Predicting Loan Approval

Rob Martin

In this Kaggle competition, the goal is to predict whether an applicant is approved for a loan by using features such as their age, income, reason for requesting a loan etc. I just completed a course on supervised learning with scikit-learn and will use the knowledge I obtained to attempt this challenge.

# Import required libraries and read in dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

loan_df = pd.read_csv("loan_train_data.csv")
loan_df.head()
id person_age person_income person_home_ownership person_emp_length loan_intent loan_grade loan_amnt loan_int_rate loan_percent_income cb_person_default_on_file cb_person_cred_hist_length loan_status
0 0 37 35000 RENT 0.0 EDUCATION B 6000 11.49 0.17 N 14 0
1 1 22 56000 OWN 6.0 MEDICAL C 4000 13.35 0.07 N 2 0
2 2 29 28800 OWN 8.0 PERSONAL A 6000 8.90 0.21 N 10 0
3 3 30 70000 RENT 14.0 VENTURE B 12000 11.11 0.17 N 5 0
4 4 22 60000 RENT 2.0 MEDICAL A 6000 6.92 0.10 N 3 0
loan_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58645 entries, 0 to 58644
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   id                          58645 non-null  int64  
 1   person_age                  58645 non-null  int64  
 2   person_income               58645 non-null  int64  
 3   person_home_ownership       58645 non-null  object 
 4   person_emp_length           58645 non-null  float64
 5   loan_intent                 58645 non-null  object 
 6   loan_grade                  58645 non-null  object 
 7   loan_amnt                   58645 non-null  int64  
 8   loan_int_rate               58645 non-null  float64
 9   loan_percent_income         58645 non-null  float64
 10  cb_person_default_on_file   58645 non-null  object 
 11  cb_person_cred_hist_length  58645 non-null  int64  
 12  loan_status                 58645 non-null  int64  
dtypes: float64(3), int64(6), object(4)
memory usage: 5.8+ MB

It seems Kaggle has been nice and provided a dataset with no missing values, so there shouldn't be much cleaning up, but I will remove the id columns since pandas provides one automatically. I'll have a look at the categorical data as well to see what I'm working with.

# Remove redundant id column
loan_df.drop("id", axis=1, inplace=True)
loan_df["person_home_ownership"].value_counts()
person_home_ownership
RENT        30594
MORTGAGE    24824
OWN          3138
OTHER          89
Name: count, dtype: int64
loan_df["loan_intent"].value_counts()
loan_intent
EDUCATION            12271
MEDICAL              10934
PERSONAL             10016
VENTURE              10011
DEBTCONSOLIDATION     9133
HOMEIMPROVEMENT       6280
Name: count, dtype: int64
loan_df["loan_grade"].value_counts()
loan_grade
A    20984
B    20400
C    11036
D     5034
E     1009
F      149
G       33
Name: count, dtype: int64
loan_df["cb_person_default_on_file"].value_counts()
cb_person_default_on_file
N    49943
Y     8702
Name: count, dtype: int64

Some of the categories have such little data that the model might not figure out how to deal with them effectively, for example "OTHER" in person_home_ownership and "G" for loan_grade. However it isn't clear what to do with these categories at the moment, and they still have a decent amount of data, so I'll just leave them alone for now. Later I will prepare these features for the model using One Hot Encoding. Now I'll have a quick look at the distribution of the numerical data and see if there are any patterns or data that needs to be cleaned.

# Draw a histogram for all the numeric features
loan_df.hist(figsize=(12,8))
plt.show()

It is a bit concerning that the person_age feature has a scale going up to 120. I'll check the maximum value for that column.

loan_df["person_age"].max()
123
loan_df[loan_df["person_age"]==123]
person_age person_income person_home_ownership person_emp_length loan_intent loan_grade loan_amnt loan_int_rate loan_percent_income cb_person_default_on_file cb_person_cred_hist_length loan_status
47336 123 36000 MORTGAGE 7.0 PERSONAL B 6700 10.75 0.18 N 4 0

I mean... I guess that's possible?

loan_df["person_age"].value_counts()
person_age
23     7726
22     7051
24     6395
25     5067
27     4450
26     3874
28     3707
29     3270
30     2333
31     1917
21     1795
32     1565
33     1306
36     1117
34     1041
37      992
35      862
38      745
39      536
40      438
41      433
43      320
42      291
44      229
46      164
45      163
47      125
48       97
53       75
51       69
50       63
52       62
54       60
49       59
58       35
55       34
56       29
60       28
57       25
65       13
61       13
20       12
66       11
64       10
70       10
62        7
69        6
59        6
73        3
84        2
80        2
76        1
123       1
Name: count, dtype: int64

This person is definitely an outlier anyway, so I'll remove them from the dataset.

# Remove the person aged 123
loan_df.drop(loan_df[loan_df["person_age"]==123].index, inplace=True)
loan_df["person_age"].max()
84
loan_df["person_income"].describe().round()
count      58644.0
mean       64047.0
std        37931.0
min         4200.0
25%        42000.0
50%        58000.0
75%        75600.0
max      1900000.0
Name: person_income, dtype: float64
loan_df["person_emp_length"].describe().round()
count    58644.0
mean         5.0
std          4.0
min          0.0
25%          2.0
50%          4.0
75%          7.0
max        123.0
Name: person_emp_length, dtype: float64
loan_df[loan_df["person_emp_length"]==123]
person_age person_income person_home_ownership person_emp_length loan_intent loan_grade loan_amnt loan_int_rate loan_percent_income cb_person_default_on_file cb_person_cred_hist_length loan_status
41079 28 60350 MORTGAGE 123.0 MEDICAL D 25000 15.95 0.35 Y 6 1
49252 21 192000 MORTGAGE 123.0 VENTURE B 20000 11.49 0.10 N 2 0

The distribution of income seems right, but these two people have definitely not been employed for 123 years. I'll remove them from the dataset too.

# Remove the people with 123 years of employment
loan_df.drop(loan_df[loan_df["person_emp_length"]==123].index, inplace=True)
loan_df["person_emp_length"].max()
41.0

The rest of the histograms look reasonable, so hopefully all of the invalid data has been removed now. Since Kaggle provides a test dataset already, there's no need to split the data into training and testing data. Now I have two steps to prepare the data for training the model. First, scaling all of the numerical features, since they have very different ranges of values. Second, dealing with categorical data by using One Hot Encoding.

loan_df.reset_index(inplace=True)
# Split the features into numeric and categoric data
loan_df_num = loan_df.drop(columns=["person_home_ownership", "loan_intent", "loan_grade", 
                                    "cb_person_default_on_file", "loan_status"], axis=1)
loan_df_cat = loan_df[["person_home_ownership", "loan_intent", "loan_grade", 
                                    "cb_person_default_on_file"]]
# Create the labels
loan_df_lab = loan_df["loan_status"]
# Use StandardScalar to scale the features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_loan_df_num = scaler.fit_transform(loan_df_num)
scaled_loan_df_num = pd.DataFrame(scaled_loan_df_num, columns=loan_df_num.columns)
scaled_loan_df_num.head()
index person_age person_income person_emp_length loan_amnt loan_int_rate loan_percent_income cb_person_cred_hist_length
0 -1.731971 1.569782 -0.765783 -1.204620 -0.578277 0.267650 0.117408 2.031742
1 -1.731912 -0.921760 -0.212101 0.334194 -0.937774 0.880567 -0.973228 -0.946496
2 -1.731853 0.240960 -0.929251 0.847132 -0.578277 -0.585820 0.553662 1.038996
3 -1.731794 0.407063 0.157021 2.385947 0.500213 0.142431 0.117408 -0.201937
4 -1.731735 -0.921760 -0.106637 -0.691682 -0.578277 -1.238280 -0.646037 -0.698310
# Use the Pandas built-in One Hot Encoder
onehot_loan_df_cat = pd.get_dummies(loan_df_cat, dtype=int)
onehot_loan_df_cat.head()
person_home_ownership_MORTGAGE person_home_ownership_OTHER person_home_ownership_OWN person_home_ownership_RENT loan_intent_DEBTCONSOLIDATION loan_intent_EDUCATION loan_intent_HOMEIMPROVEMENT loan_intent_MEDICAL loan_intent_PERSONAL loan_intent_VENTURE loan_grade_A loan_grade_B loan_grade_C loan_grade_D loan_grade_E loan_grade_F loan_grade_G cb_person_default_on_file_N cb_person_default_on_file_Y
0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0
1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0
2 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0
3 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0
4 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0
# Combine the numeric and categoric features 
new_loan_df = pd.concat([scaled_loan_df_num, onehot_loan_df_cat], axis=1)
new_loan_df.head()
index person_age person_income person_emp_length loan_amnt loan_int_rate loan_percent_income cb_person_cred_hist_length person_home_ownership_MORTGAGE person_home_ownership_OTHER ... loan_intent_VENTURE loan_grade_A loan_grade_B loan_grade_C loan_grade_D loan_grade_E loan_grade_F loan_grade_G cb_person_default_on_file_N cb_person_default_on_file_Y
0 -1.731971 1.569782 -0.765783 -1.204620 -0.578277 0.267650 0.117408 2.031742 0 0 ... 0 0 1 0 0 0 0 0 1 0
1 -1.731912 -0.921760 -0.212101 0.334194 -0.937774 0.880567 -0.973228 -0.946496 0 0 ... 0 0 0 1 0 0 0 0 1 0
2 -1.731853 0.240960 -0.929251 0.847132 -0.578277 -0.585820 0.553662 1.038996 0 0 ... 0 1 0 0 0 0 0 0 1 0
3 -1.731794 0.407063 0.157021 2.385947 0.500213 0.142431 0.117408 -0.201937 0 0 ... 1 0 1 0 0 0 0 0 1 0
4 -1.731735 -0.921760 -0.106637 -0.691682 -0.578277 -1.238280 -0.646037 -0.698310 0 0 ... 0 1 0 0 0 0 0 0 1 0

5 rows × 27 columns

Now the data looks like it's in a format that's ready to train the model. Just to test it, I'll make a simple model without any hyperparameter tuning or cross validation.

# Make a K-Nearest Neighbors model
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(new_loan_df, loan_df_lab)
KNeighborsClassifier(n_neighbors=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
knn.score(new_loan_df, loan_df_lab)
0.951280652092357
loan_df_lab.value_counts()
loan_status
0    50293
1     8349
Name: count, dtype: int64

The basic model already has a score of 0.951, which is pretty good. But only about 14% of the loans in the dataset were approved, so it's possible that this statistic is a bit biased. Now I'll implement cross score validation.

from sklearn.model_selection import cross_val_score
scores = cross_val_score(knn, new_loan_df, loan_df_lab, cv=5)
scores
array([0.91891892, 0.92096513, 0.92385744, 0.92556276, 0.92291951])

So the model as it stands is more likely to have an accuracy around 0.92. Finally, time for some hypertuning.

from sklearn.model_selection import GridSearchCV, StratifiedKFold

param_grid = {
    "n_neighbors": [1,3,5,10,20],
    "weights": ["uniform", "distance"],
    "metric": ["minkowski", "manhattan"]
}

grid_search = GridSearchCV(estimator=knn, param_grid=param_grid, cv=StratifiedKFold(10))
grid_search.fit(new_loan_df, loan_df_lab)

print("Parameters:", grid_search.best_params_)
print("Score:", grid_search.best_score_)
Parameters: {'metric': 'manhattan', 'n_neighbors': 10, 'weights': 'distance'}
Score: 0.9356605478658631

I could try some more values for n_neighbors to optimize the score even more, but I'll just leave it as this. K-Nearest Neighbors produces a score of around 0.936 after hypertuning. For comparison, I'll try out Random Forest.

from sklearn.ensemble import RandomForestClassifier 

clf = RandomForestClassifier(n_estimators=100)
clf.fit(new_loan_df, loan_df_lab)
RandomForestClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
clf.score(new_loan_df, loan_df_lab)
0.9999658947512022

That is an extremely high score but let's see if the model was just overfitted...

scores = cross_val_score(clf, new_loan_df, loan_df_lab, cv=5)
scores
array([0.49160201, 0.95174354, 0.95063097, 0.95370055, 0.95182469])

I have no idea why the first score is so low, but it seems like Random Forest performs better then K-Nearest Neighbors, with a score of around 0.95 on average. Now I'll do some hypertuning like before.

param_grid = {
    "n_estimators": [15, 50, 100, 200, 500],
    "max_depth": [5, 10, 15],
    "max_features": ["sqrt", "log2", None]
}

grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=StratifiedKFold(10))
grid_search.fit(new_loan_df, loan_df_lab)

print("Parameters:", grid_search.best_params_)
print("Score:", grid_search.best_score_)
Parameters: {'max_depth': 15, 'max_features': 'log2', 'n_estimators': 100}
Score: 0.9504281852132277

Since the best max_depth was 15 and I didn't try any higher values, I should continue hypertuning to try to improve the performance.

param_grid = {
    "n_estimators": [100],
    "max_depth": [15, 20, 25],
    "max_features": ["log2"]
}

grid_search2 = GridSearchCV(estimator=clf, param_grid=param_grid, cv=StratifiedKFold(10))
grid_search2.fit(new_loan_df, loan_df_lab)

print("Parameters:", grid_search2.best_params_)
print("Score:", grid_search2.best_score_)
Parameters: {'max_depth': 25, 'max_features': 'log2', 'n_estimators': 100}
Score: 0.9505134803194665

I don't really know how Random Forest works yet but I know that increasing max_depth will always impove performance, with the risk of overfitting the data. Since the model with max_depth of 25 only had a marginally better score, I will use the original model as my final solution. Using cross validation, this model had a score of 0.950 and now I'll use the model on the Kaggle test data.

test_df = pd.read_csv("loan_test_data.csv")
test_df.head()
id person_age person_income person_home_ownership person_emp_length loan_intent loan_grade loan_amnt loan_int_rate loan_percent_income cb_person_default_on_file cb_person_cred_hist_length
0 58645 23 69000 RENT 3.0 HOMEIMPROVEMENT F 25000 15.76 0.36 N 2
1 58646 26 96000 MORTGAGE 6.0 PERSONAL C 10000 12.68 0.10 Y 4
2 58647 26 30000 RENT 5.0 VENTURE E 4000 17.19 0.13 Y 2
3 58648 33 50000 RENT 4.0 DEBTCONSOLIDATION A 7000 8.90 0.14 N 7
4 58649 26 102000 MORTGAGE 8.0 HOMEIMPROVEMENT D 15000 16.32 0.15 Y 4
# Keep id for submission
ids = test_df["id"]

# Remove redundant id column
test_df.drop("id", axis=1, inplace=True)
test_df.reset_index(inplace=True)
# Split the features into numeric and categoric data
test_df_num = test_df.drop(columns=["person_home_ownership", "loan_intent", "loan_grade", 
                                    "cb_person_default_on_file"], axis=1)
test_df_cat = test_df[["person_home_ownership", "loan_intent", "loan_grade", 
                                    "cb_person_default_on_file"]]
test_df_num.head()
index person_age person_income person_emp_length loan_amnt loan_int_rate loan_percent_income cb_person_cred_hist_length
0 0 23 69000 3.0 25000 15.76 0.36 2
1 1 26 96000 6.0 10000 12.68 0.10 4
2 2 26 30000 5.0 4000 17.19 0.13 2
3 3 33 50000 4.0 7000 8.90 0.14 7
4 4 26 102000 8.0 15000 16.32 0.15 4
# Use StandardScalar to scale the features
scaled_test_df_num = scaler.transform(test_df_num)
scaled_test_df_num = pd.DataFrame(scaled_test_df_num, columns=test_df_num.columns)
scaled_test_df_num.head()
index person_age person_income person_emp_length loan_amnt loan_int_rate loan_percent_income cb_person_cred_hist_length
0 -1.731971 -0.755657 0.130655 -0.435213 2.836942 1.674723 2.189617 -0.946496
1 -1.731912 -0.257349 0.842532 0.334194 0.140717 0.659785 -0.646037 -0.450123
2 -1.731853 -0.257349 -0.897612 0.077725 -0.937774 2.145944 -0.318847 -0.946496
3 -1.731794 0.905371 -0.370296 -0.178744 -0.398528 -0.585820 -0.209783 0.294436
4 -1.731735 -0.257349 1.000727 0.847132 1.039459 1.859257 -0.100719 -0.450123
# Use the Pandas built-in One Hot Encoder
onehot_test_df_cat = pd.get_dummies(test_df_cat, dtype=int)
onehot_test_df_cat.head()
person_home_ownership_MORTGAGE person_home_ownership_OTHER person_home_ownership_OWN person_home_ownership_RENT loan_intent_DEBTCONSOLIDATION loan_intent_EDUCATION loan_intent_HOMEIMPROVEMENT loan_intent_MEDICAL loan_intent_PERSONAL loan_intent_VENTURE loan_grade_A loan_grade_B loan_grade_C loan_grade_D loan_grade_E loan_grade_F loan_grade_G cb_person_default_on_file_N cb_person_default_on_file_Y
0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0
1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1
2 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1
3 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0
4 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1
# Combine the numeric and categoric features 
new_test_df = pd.concat([scaled_test_df_num, onehot_test_df_cat], axis=1)
new_test_df.head()
index person_age person_income person_emp_length loan_amnt loan_int_rate loan_percent_income cb_person_cred_hist_length person_home_ownership_MORTGAGE person_home_ownership_OTHER ... loan_intent_VENTURE loan_grade_A loan_grade_B loan_grade_C loan_grade_D loan_grade_E loan_grade_F loan_grade_G cb_person_default_on_file_N cb_person_default_on_file_Y
0 -1.731971 -0.755657 0.130655 -0.435213 2.836942 1.674723 2.189617 -0.946496 0 0 ... 0 0 0 0 0 0 1 0 1 0
1 -1.731912 -0.257349 0.842532 0.334194 0.140717 0.659785 -0.646037 -0.450123 1 0 ... 0 0 0 1 0 0 0 0 0 1
2 -1.731853 -0.257349 -0.897612 0.077725 -0.937774 2.145944 -0.318847 -0.946496 0 0 ... 1 0 0 0 0 1 0 0 0 1
3 -1.731794 0.905371 -0.370296 -0.178744 -0.398528 -0.585820 -0.209783 0.294436 0 0 ... 0 1 0 0 0 0 0 0 1 0
4 -1.731735 -0.257349 1.000727 0.847132 1.039459 1.859257 -0.100719 -0.450123 1 0 ... 0 0 0 0 1 0 0 0 0 1

5 rows × 27 columns

Now this data is finally ready to make some predictions! The format for submission is simply a DataFrame with id and then loan status prediction.

pred = grid_search.predict(new_test_df)
pred
array([1, 0, 1, ..., 0, 0, 1], dtype=int64)
submission = pd.DataFrame(pred, index=ids, columns=["loan_status"])
submission
loan_status
id
58645 1
58646 0
58647 1
58648 0
58649 0
... ...
97738 0
97739 0
97740 0
97741 0
97742 1

39098 rows × 1 columns

# Write the predictions to a csv file
submission.to_csv("submission.csv")

I ended up getting a score of 0.859, which is a lot lower than I expected. This suggests that my model was overfitted to the training data, but I used cross score validation, so I'm not sure how that would happen. EDIT: Kaggle used a different scoring method which might explain why the score was lower than I expected.

For my first classification project, I would say it was pretty successful. I was able to clean the data, prepare the data by scaling numerical features and encoding categorical features, train two different models on the data, and perform some hypertuning. My next step is to learn how these models actually work under the hood so I can make better informed decisions ab