Skip navigation
Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp01794080786
Title: Predictive Analytics of Loan Issuance And Default using Random Forests in the Online Peer-to-Peer Lending Marketplace
Authors: Haque, Anne
Advisors: Mian, Atif R.
Department: Economics
Class Year: 2017
Abstract: With the prominence of Big Data, disintermediation in the financial sector due to a shift towards online platforms, as well increasingly complex predictive analytic techniques, credit-scoring models that are accurately able to classify credit risk have been developed and may be incredibly useful in the context of online peer-to-peer lending. This study compares machine-learning techniques to classify consumer loans in America’s two most popular online peer-to-peer lenders. Using data from Lending Club and Prosper Marketplace, this study performs a classification predication using borrower characteristics as inputs to predict loan issuance and loan default and compares the explanatory power of Random Forests against a baseline of a Logistic Regression model. The study also attempts to extract important variables that help predict the outcome variables of interest according to the models, and applies the models towards calculating expected returns for the firms. The study found that, unlike previous literature where there was often a model that significantly outperformed the others, Random Forest seemed to only slightly outperform Logistic Regression in the case of both loan issuance and loan default predictions, however, only in terms of accuracy, as Logistic Regression often had larger AUC values. Hence, the results overall were inconclusive with regards to either model and varied between the time periods and the companies that were investigated. We utilized the statistical program R and performed a 3-fold cross validation on a subsample of our data to reduce bias in the training set, and to provide a more robust measure of accuracy after splitting the data into a 75% training and 25% test set. This study suggests that though Random Forests and complex predictive techniques are powerful tools, they may be best for discrete variables, while Logistic Regression works best with continuous variables with specified cutoffs. We also found that fewer input variables may be powerful if chosen carefully correctly as the accuracy predictions of the loan issuance data, which only included 7 variables, were much greater than the default predictions which included up to 57 input variables. Furthermore, our AUC and kappa measures were not very high overall, and we believe that such results occurred due to the unbalanced nature of the datasets, the rigorous pre-processing that was required to clean the data and resulted in the removal of many observations, using only a subsample of observations due to the intensive computational power needed to predict on such large dataset, as well as the number of categorical variables that were included in this study. Lastly, we found that strictly based on our models and the random subsample of data, Logistic Regression seems to match more closely to the observed results within our dataset when calculating expected return, yet the peer-to-peer lending companies may employ algorithms that are modelled closer to Random Forest or other complex techniques that allow higher default thresholds to be set and subsequently result in higher expected returns when optimizing for either true positive or minimizing false negative outcomes.
URI: http://arks.princeton.edu/ark:/88435/dsp01794080786
Type of Material: Princeton University Senior Theses
Language: en_US
Appears in Collections:Economics, 1927-2020

Files in This Item:
File SizeFormat 
THESIS_ANNEHAQUE.pdf3.13 MBAdobe PDF    Request a copy


Items in Dataspace are protected by copyright, with all rights reserved, unless otherwise indicated.