Lending Club Final Project

Made with by Alexander Demidov, Yang Zeng, Mier Chen, Kopal Jain

scroll down to start

What is Lending Club?

And why we chose this project.

Lending Club is the world’s leading online marketplace that allows borrowers and investors to connect. This platform uses technology to bring borrowers and lenders together and acts as broker, performing such functions as screening borrowers, facilitating the transactions, and keeping track of loans during the loan lifetimes. Investors purchase Notes (which correspond to fractions of loans) with a goal of getting a return on investment from interest payments and principal repayment. Borrowers that apply to Lending Club typically take out loans to consolidate debt, improve their homes and finance major purchases. The project goal is to create a service that helps investors make better decisions when choosing which Lending Club notes to invest in. Specifically, our model will predict whether individual notes offered by Lending Club will be fully paid or charged off by a classification algorithm with Loan Status as the response variable. It is important to predict what notes will be charged off or not to give investors useful insight into potential performance of individual notes and their overall portfolio. Currently investors have the option of relying on Lending Club grade and sub grade system to gauge whether a borrower will default. We want to make a service that the investors can use to predict which loans will be fully paid in particular sub grades. If our model is more predictive at selecting fully paid loans than a random draw, then it will be a valuable service for investors.

Data Merging

Combine all the data we can use.

Prior to data cleaning, the data sets were merged to encompass a range of data from 2011-2017. To preserve the original copy of the data, twelve data sets were copied and then merged. To verify that the column names were identical in each data frame, the columns were counted and printed to visually see if they were matching. The combined datasets were stored into a new CSV file that would be referenced when data cleaning.

The importance of data merging is to ensure that maximized datasets are incorporated in our future engineering and models. By using data from a range of years, we are able extract variety of information throughout past history and hence have a better possibility of our models performing well when determining what loans could potentially be charged off.

1765451
Total
  • LoanStats_2007-201142538
  • LoanStats_2012-2013188183
  • LoanStats_2014235631
  • LoanStats_2015421097
  • LoanStats_2016434415
  • LoanStats_2017443587
  • Total1765451

Data Formatting

Clean up the raw data.

  • Convert int_rate from percentage string to decimal
  • Convert revol_util from the percentage string to decimal
  • Extract issue_mth and issue_year from issue_d
  • Create a new column length_cr_hist using the earliest_cr_line and issue_d

Variable Selection

Choose which predictors to use in our models.

Every time when an investor purchases the notes from a loan, they will get a summary of all the recorded information for that borrowers and the loan. We will only use these information for our model. We also include some helpful information, for example, "total_pymnt", for reference purpose.

Missing Data

We handled differently for different predictors.

Column `title` contain way too many types of titles. Almost each loan has a unique title. We can't use it for modeling. However, we later attempted to extract the most common words appeared in column title. More details can be found in feature engineering.

Exploratory Data Analysis

Prepare our data for modeling.

Grade vs. Log Value of Annual Income plot for charged off and fully paid loans indicates how annual income affects if a borrower’s loan will be charged off or fully paid. This plot further investigates a possible trend among grades (A to G). Although Grade A shows slightly higher annual income than the rest of the grades, there is no significant difference between all the grades when it comes to annual income. However, borrowers with higher income will more likely pay the full loan while borrowers with lower income will most likely be charged off, despite of what grade the borrower belongs in. Hence annual income may have some predictive value in the model.

Models

Our models.

Here we can see the proportion of Fully Paid loans that is actually present in our data set, compared to what each model predicts. We can see that the model predictions are influenced by the way that instances of Fully Paid and Charged Off classes are distributed in the training data. When trained on the original data, random forest and AdaBoost predict a higher number of Fully Paid loans than is actually there for all sub-grades (this explains the high recall noted earlier). When trained on the class re-balanced data, random forest and AdaBoost predict a lower number of Fully Paid loans than is actually there for all sub-grades. However, given the noisy nature of the data, such results are not unexpected.

Final Result and thoughts

What we learned from this project.

Our project goal was to employ the methods of data science in order to create a predictive tool that investors could use to select loans available through the Lending Club platform. One of the most important factors that any fixed income investor is concerned with is the odds that a borrower will stop paying and renege on loan obligation. The Lending Club site provides data about such occurrences throughout its history, along with extensive information for all loans and borrowers. We collected this data, cleaned it and merged it into one file. Through exploratory data analysis and feature engineering we selected a set of predictors appropriate for modeling. We used three main principles in feature selection: 1) the features had to be publicly available to investors at the time the loan was made 2) the features had to have enough values to be meaningfully used in the model and 3) the features had to be potentially helpful for separating loans in the Fully Paid class from loans in the Charged Off class.
During EDA we discovered that many predictors were weak – they did not separate the classes well. Therefore, our hope was that interesting interactions would be discovered amongst the features during modeling, and that this would lead to good predictions. We also had to contend with the issue of unbalanced data in the response classes (there were many more Fully Paid loans than Charged Off ones). The solution that we came up with was to fit some of the models with training sets that had re-balanced classes (via sub-grade stratified under-sampling of the larger class).
After completing the above steps, we used logistic regression, random forest and boosting classifiers to make predictions for loans in classes C, D, E and F. Our best model turned out to be AdaBoost trained on the balanced data set. Its performance on the test set was superior to the trivial model for each sub-grade. (The trivial model would construct a portfolio of loans by picking them at random.) If the performance of our best model on the test were to generalize to future predictions, then we contend that it would be of great benefit to investors in improving their loan portfolio performance. We were careful in evaluating the results of our model to make sure that were not produced by some artifact during training.
There are several avenues for future work to extend our project and to use the Lending Club data in general. One interesting issue to consider carefully is whether or not Lending Club changed its grading procedures during its operations, and what impact this might have on predicting charge offs based on past data. Another exciting area of research would be to delve into the declined loans data, which we did not examine, and to investigate how Lending Club selects loans for its platform. This kind of project might have less direct application for investors, but it could provide interesting insights into the operations of Lending Club, including whether or not there are any potential instances of discrimination that occur.

About Discrimination

Why is Discrimination an issue?

Although lenders now make good-faith determination in a borrower’s ability to afford a loan, lending discrimination hasn’t been eliminated. Lending discrimination occurs when a lender takes certain protected personal characteristics into account when denying a loan or imposing unfair terms on loans. Preliminary studies have shown that people of color pay higher interest rates than the people identifying themselves in the “white” community. In addition, young borrowers with lower education or women of color receive the highest rates. Although the federal Equal Credit Opportunity Act (ECOA) prohibits lenders from discriminating on the basis of race, religion, sex, age, lending discrimination still remains a challenge to be solved in the marketplace. In addition, unequal treatment of minorities regarding race, gender and age are not just motivated by racism but also lower creditworthiness or other economic disparities.

Any Discrimination in Lending Club?

Although Lending Club makes investors promise to not violate borrower discrimination laws, it is still an unfair practice. Even with the limited demographic information, Lending Club provides the first three digits of the borrower’s zipcode which can reveal the geographic location to the investor. They can then make a guess of the distribution of various groups in that location which can give some probability of the borrower’s race. In addition, in 2009 Lending Club’s SEC filings provided investors with information about the borrower’s hometown, current location and a message that might have included phrases like “my husband” or “my wife” in which indirectly disclosed their gender and marital status.

Future Work?

Our modeling will not directly track discrimination practices by Lending Club, since the analysis is not based on declined loans or any information regarding the borrower’s race, age, or gender. Although Lending Club doesn’t reveal borrower’s race, age, or gender these attributes can be voluntarily reveal through the demographics. For future models, we can investigate locations based on the zip codes and look for potential discriminations by lenders. In addition, we can investigate decline loans and see if there is any connection between them and the locations from where they come from.

Reference

[1] Lending Club - How to use
https://www.lendingclub.com/investing/alternative-assets/how-it-works

[2] Lending Club Statistics. LendingClub
www.lendingclub.com/info/download-data.action.

[3] “Credit & Lending Discrimination and Borrowers' Rights.” Findlaw, Thomson Reuters,
https://civilrights.findlaw.com/discrimination/credit-lending-discrimination-and-borrowers-rights.html.

[4] “Lenders Can’t Discriminate, but What about Investors?” FT Alphaville, The Financial Times Ltd,
https://ftalphaville.ft.com/2016/01/13/2150093/lenders-cant-discriminate-but-what-about-investors/.