Predicting Loan Approval

In this Kaggle competition, the goal is to predict whether an applicant is approved for a loan by using features such as their age, income, reason for requesting a loan etc. I just completed a course on supervised learning with scikit-learn and will use the knowledge I obtained to attempt this challenge.

# Import required libraries and read in dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

loan_df = pd.read_csv("loan_train_data.csv")
loan_df.head()

	id	person_age	person_income	person_home_ownership	person_emp_length	loan_intent	loan_grade	loan_amnt	loan_int_rate	loan_percent_income	cb_person_default_on_file	cb_person_cred_hist_length
0	0	37	35000	RENT	0.0	EDUCATION	B	6000	11.49	0.17	N	14
1	1	22	56000	OWN	6.0	MEDICAL	C	4000	13.35	0.07	N	2
2	2	29	28800	OWN	8.0	PERSONAL	A	6000	8.90	0.21	N	10
3	3	30	70000	RENT	14.0	VENTURE	B	12000	11.11	0.17	N	5
4	4	22	60000	RENT	2.0	MEDICAL	A	6000	6.92	0.10	N	3

loan_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58645 entries, 0 to 58644
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   id                          58645 non-null  int64  
 1   person_age                  58645 non-null  int64  
 2   person_income               58645 non-null  int64  
 3   person_home_ownership       58645 non-null  object 
 4   person_emp_length           58645 non-null  float64
 5   loan_intent                 58645 non-null  object 
 6   loan_grade                  58645 non-null  object 
 7   loan_amnt                   58645 non-null  int64  
 8   loan_int_rate               58645 non-null  float64
 9   loan_percent_income         58645 non-null  float64
 10  cb_person_default_on_file   58645 non-null  object 
 11  cb_person_cred_hist_length  58645 non-null  int64  
 12  loan_status                 58645 non-null  int64  
dtypes: float64(3), int64(6), object(4)
memory usage: 5.8+ MB

It seems Kaggle has been nice and provided a dataset with no missing values, so there shouldn't be much cleaning up, but I will remove the id columns since pandas provides one automatically. I'll have a look at the categorical data as well to see what I'm working with.

# Remove redundant id column
loan_df.drop("id", axis=1, inplace=True)

loan_df["person_home_ownership"].value_counts()

person_home_ownership
RENT        30594
MORTGAGE    24824
OWN          3138
OTHER          89
Name: count, dtype: int64

loan_df["loan_intent"].value_counts()

loan_intent
EDUCATION            12271
MEDICAL              10934
PERSONAL             10016
VENTURE              10011
DEBTCONSOLIDATION     9133
HOMEIMPROVEMENT       6280
Name: count, dtype: int64

loan_df["loan_grade"].value_counts()

loan_grade
A    20984
B    20400
C    11036
D     5034
E     1009
F      149
G       33
Name: count, dtype: int64

loan_df["cb_person_default_on_file"].value_counts()

cb_person_default_on_file
N    49943
Y     8702
Name: count, dtype: int64

Some of the categories have such little data that the model might not figure out how to deal with them effectively, for example "OTHER" in person_home_ownership and "G" for loan_grade. However it isn't clear what to do with these categories at the moment, and they still have a decent amount of data, so I'll just leave them alone for now. Later I will prepare these features for the model using One Hot Encoding. Now I'll have a quick look at the distribution of the numerical data and see if there are any patterns or data that needs to be cleaned.

# Draw a histogram for all the numeric features
loan_df.hist(figsize=(12,8))
plt.show()

It is a bit concerning that the person_age feature has a scale going up to 120. I'll check the maximum value for that column.

loan_df["person_age"].max()

loan_df[loan_df["person_age"]==123]

	person_age	person_income	person_home_ownership	person_emp_length	loan_intent	loan_grade	loan_amnt	loan_int_rate	loan_percent_income	cb_person_default_on_file	cb_person_cred_hist_length	loan_status
47336	123	36000	MORTGAGE	7.0	PERSONAL	B	6700	10.75	0.18	N	4	0

I mean... I guess that's possible?

loan_df["person_age"].value_counts()

person_age
23     7726
22     7051
24     6395
25     5067
27     4450
26     3874
28     3707
29     3270
30     2333
31     1917
21     1795
32     1565
33     1306
36     1117
34     1041
37      992
35      862
38      745
39      536
40      438
41      433
43      320
42      291
44      229
46      164
45      163
47      125
48       97
53       75
51       69
50       63
52       62
54       60
49       59
58       35
55       34
56       29
60       28
57       25
65       13
61       13
20       12
66       11
64       10
70       10
62        7
69        6
59        6
73        3
84        2
80        2
76        1
123       1
Name: count, dtype: int64

This person is definitely an outlier anyway, so I'll remove them from the dataset.

# Remove the person aged 123
loan_df.drop(loan_df[loan_df["person_age"]==123].index, inplace=True)
loan_df["person_age"].max()

loan_df["person_income"].describe().round()

count      58644.0
mean       64047.0
std        37931.0
min         4200.0
25%        42000.0
50%        58000.0
75%        75600.0
max      1900000.0
Name: person_income, dtype: float64

loan_df["person_emp_length"].describe().round()

count    58644.0
mean         5.0
std          4.0
min          0.0
25%          2.0
50%          4.0
75%          7.0
max        123.0
Name: person_emp_length, dtype: float64

loan_df[loan_df["person_emp_length"]==123]

	person_age	person_income	person_home_ownership	person_emp_length	loan_intent	loan_grade	loan_amnt	loan_int_rate	loan_percent_income	cb_person_default_on_file	cb_person_cred_hist_length	loan_status
41079	28	60350	MORTGAGE	123.0	MEDICAL	D	25000	15.95	0.35	Y	6	1
49252	21	192000	MORTGAGE	123.0	VENTURE	B	20000	11.49	0.10	N	2	0

The distribution of income seems right, but these two people have definitely not been employed for 123 years. I'll remove them from the dataset too.

# Remove the people with 123 years of employment
loan_df.drop(loan_df[loan_df["person_emp_length"]==123].index, inplace=True)
loan_df["person_emp_length"].max()

41.0

The rest of the histograms look reasonable, so hopefully all of the invalid data has been removed now. Since Kaggle provides a test dataset already, there's no need to split the data into training and testing data. Now I have two steps to prepare the data for training the model. First, scaling all of the numerical features, since they have very different ranges of values. Second, dealing with categorical data by using One Hot Encoding.

loan_df.reset_index(inplace=True)

# Split the features into numeric and categoric data
loan_df_num = loan_df.drop(columns=["person_home_ownership", "loan_intent", "loan_grade", 
                                    "cb_person_default_on_file", "loan_status"], axis=1)
loan_df_cat = loan_df[["person_home_ownership", "loan_intent", "loan_grade", 
                                    "cb_person_default_on_file"]]
# Create the labels
loan_df_lab = loan_df["loan_status"]

# Use StandardScalar to scale the features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_loan_df_num = scaler.fit_transform(loan_df_num)
scaled_loan_df_num = pd.DataFrame(scaled_loan_df_num, columns=loan_df_num.columns)
scaled_loan_df_num.head()

	index	person_age	person_income	person_emp_length	loan_amnt	loan_int_rate	loan_percent_income	cb_person_cred_hist_length
0	-1.731971	1.569782	-0.765783	-1.204620	-0.578277	0.267650	0.117408	2.031742
1	-1.731912	-0.921760	-0.212101	0.334194	-0.937774	0.880567	-0.973228	-0.946496
2	-1.731853	0.240960	-0.929251	0.847132	-0.578277	-0.585820	0.553662	1.038996
3	-1.731794	0.407063	0.157021	2.385947	0.500213	0.142431	0.117408	-0.201937
4	-1.731735	-0.921760	-0.106637	-0.691682	-0.578277	-1.238280	-0.646037	-0.698310

# Use the Pandas built-in One Hot Encoder
onehot_loan_df_cat = pd.get_dummies(loan_df_cat, dtype=int)
onehot_loan_df_cat.head()

	person_home_ownership_OWN	person_home_ownership_RENT	loan_intent_EDUCATION	loan_intent_MEDICAL	loan_intent_PERSONAL	loan_intent_VENTURE	loan_grade_A	loan_grade_B	loan_grade_C	cb_person_default_on_file_N
0	0	1	1	0	0	0	0	1	0	1
1	1	0	0	1	0	0	0	0	1	1
2	1	0	0	0	1	0	1	0	0	1
3	0	1	0	0	0	1	0	1	0	1
4	0	1	0	1	0	0	1	0	0	1

# Combine the numeric and categoric features 
new_loan_df = pd.concat([scaled_loan_df_num, onehot_loan_df_cat], axis=1)
new_loan_df.head()

	index	person_age	person_income	person_emp_length	loan_amnt	loan_int_rate	loan_percent_income	cb_person_cred_hist_length	...	loan_intent_VENTURE	loan_grade_A	loan_grade_B	loan_grade_C	cb_person_default_on_file_N
0	-1.731971	1.569782	-0.765783	-1.204620	-0.578277	0.267650	0.117408	2.031742	...	0	0	1	0	1
1	-1.731912	-0.921760	-0.212101	0.334194	-0.937774	0.880567	-0.973228	-0.946496	...	0	0	0	1	1
2	-1.731853	0.240960	-0.929251	0.847132	-0.578277	-0.585820	0.553662	1.038996	...	0	1	0	0	1
3	-1.731794	0.407063	0.157021	2.385947	0.500213	0.142431	0.117408	-0.201937	...	1	0	1	0	1
4	-1.731735	-0.921760	-0.106637	-0.691682	-0.578277	-1.238280	-0.646037	-0.698310	...	0	1	0	0	1

5 rows × 27 columns

Now the data looks like it's in a format that's ready to train the model. Just to test it, I'll make a simple model without any hyperparameter tuning or cross validation.

# Make a K-Nearest Neighbors model
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(new_loan_df, loan_df_lab)

KNeighborsClassifier(n_neighbors=3)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

knn.score(new_loan_df, loan_df_lab)

0.951280652092357

loan_df_lab.value_counts()

loan_status
0    50293
1     8349
Name: count, dtype: int64

The basic model already has a score of 0.951, which is pretty good. But only about 14% of the loans in the dataset were approved, so it's possible that this statistic is a bit biased. Now I'll implement cross score validation.

from sklearn.model_selection import cross_val_score
scores = cross_val_score(knn, new_loan_df, loan_df_lab, cv=5)
scores

array([0.91891892, 0.92096513, 0.92385744, 0.92556276, 0.92291951])

So the model as it stands is more likely to have an accuracy around 0.92. Finally, time for some hypertuning.

from sklearn.model_selection import GridSearchCV, StratifiedKFold

param_grid = {
    "n_neighbors": [1,3,5,10,20],
    "weights": ["uniform", "distance"],
    "metric": ["minkowski", "manhattan"]
}

grid_search = GridSearchCV(estimator=knn, param_grid=param_grid, cv=StratifiedKFold(10))
grid_search.fit(new_loan_df, loan_df_lab)

print("Parameters:", grid_search.best_params_)
print("Score:", grid_search.best_score_)

Parameters: {'metric': 'manhattan', 'n_neighbors': 10, 'weights': 'distance'}
Score: 0.9356605478658631

I could try some more values for n_neighbors to optimize the score even more, but I'll just leave it as this. K-Nearest Neighbors produces a score of around 0.936 after hypertuning. For comparison, I'll try out Random Forest.

from sklearn.ensemble import RandomForestClassifier 

clf = RandomForestClassifier(n_estimators=100)
clf.fit(new_loan_df, loan_df_lab)

RandomForestClassifier()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

clf.score(new_loan_df, loan_df_lab)

0.9999658947512022

That is an extremely high score but let's see if the model was just overfitted...

scores = cross_val_score(clf, new_loan_df, loan_df_lab, cv=5)
scores

array([0.49160201, 0.95174354, 0.95063097, 0.95370055, 0.95182469])

I have no idea why the first score is so low, but it seems like Random Forest performs better then K-Nearest Neighbors, with a score of around 0.95 on average. Now I'll do some hypertuning like before.

param_grid = {
    "n_estimators": [15, 50, 100, 200, 500],
    "max_depth": [5, 10, 15],
    "max_features": ["sqrt", "log2", None]
}

grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=StratifiedKFold(10))
grid_search.fit(new_loan_df, loan_df_lab)

print("Parameters:", grid_search.best_params_)
print("Score:", grid_search.best_score_)

Parameters: {'max_depth': 15, 'max_features': 'log2', 'n_estimators': 100}
Score: 0.9504281852132277

Since the best max_depth was 15 and I didn't try any higher values, I should continue hypertuning to try to improve the performance.

param_grid = {
    "n_estimators": [100],
    "max_depth": [15, 20, 25],
    "max_features": ["log2"]
}

grid_search2 = GridSearchCV(estimator=clf, param_grid=param_grid, cv=StratifiedKFold(10))
grid_search2.fit(new_loan_df, loan_df_lab)

print("Parameters:", grid_search2.best_params_)
print("Score:", grid_search2.best_score_)

Parameters: {'max_depth': 25, 'max_features': 'log2', 'n_estimators': 100}
Score: 0.9505134803194665

I don't really know how Random Forest works yet but I know that increasing max_depth will always impove performance, with the risk of overfitting the data. Since the model with max_depth of 25 only had a marginally better score, I will use the original model as my final solution. Using cross validation, this model had a score of 0.950 and now I'll use the model on the Kaggle test data.

test_df = pd.read_csv("loan_test_data.csv")
test_df.head()

	id	person_age	person_income	person_home_ownership	person_emp_length	loan_intent	loan_grade	loan_amnt	loan_int_rate	loan_percent_income	cb_person_default_on_file	cb_person_cred_hist_length
0	58645	23	69000	RENT	3.0	HOMEIMPROVEMENT	F	25000	15.76	0.36	N	2
1	58646	26	96000	MORTGAGE	6.0	PERSONAL	C	10000	12.68	0.10	Y	4
2	58647	26	30000	RENT	5.0	VENTURE	E	4000	17.19	0.13	Y	2
3	58648	33	50000	RENT	4.0	DEBTCONSOLIDATION	A	7000	8.90	0.14	N	7
4	58649	26	102000	MORTGAGE	8.0	HOMEIMPROVEMENT	D	15000	16.32	0.15	Y	4

# Keep id for submission
ids = test_df["id"]

# Remove redundant id column
test_df.drop("id", axis=1, inplace=True)
test_df.reset_index(inplace=True)

# Split the features into numeric and categoric data
test_df_num = test_df.drop(columns=["person_home_ownership", "loan_intent", "loan_grade", 
                                    "cb_person_default_on_file"], axis=1)
test_df_cat = test_df[["person_home_ownership", "loan_intent", "loan_grade", 
                                    "cb_person_default_on_file"]]
test_df_num.head()

	index	person_age	person_income	person_emp_length	loan_amnt	loan_int_rate	loan_percent_income	cb_person_cred_hist_length
0	0	23	69000	3.0	25000	15.76	0.36	2
1	1	26	96000	6.0	10000	12.68	0.10	4
2	2	26	30000	5.0	4000	17.19	0.13	2
3	3	33	50000	4.0	7000	8.90	0.14	7
4	4	26	102000	8.0	15000	16.32	0.15	4

# Use StandardScalar to scale the features
scaled_test_df_num = scaler.transform(test_df_num)
scaled_test_df_num = pd.DataFrame(scaled_test_df_num, columns=test_df_num.columns)
scaled_test_df_num.head()

	index	person_age	person_income	person_emp_length	loan_amnt	loan_int_rate	loan_percent_income	cb_person_cred_hist_length
0	-1.731971	-0.755657	0.130655	-0.435213	2.836942	1.674723	2.189617	-0.946496
1	-1.731912	-0.257349	0.842532	0.334194	0.140717	0.659785	-0.646037	-0.450123
2	-1.731853	-0.257349	-0.897612	0.077725	-0.937774	2.145944	-0.318847	-0.946496
3	-1.731794	0.905371	-0.370296	-0.178744	-0.398528	-0.585820	-0.209783	0.294436
4	-1.731735	-0.257349	1.000727	0.847132	1.039459	1.859257	-0.100719	-0.450123

# Use the Pandas built-in One Hot Encoder
onehot_test_df_cat = pd.get_dummies(test_df_cat, dtype=int)
onehot_test_df_cat.head()

	person_home_ownership_MORTGAGE	person_home_ownership_RENT	loan_intent_DEBTCONSOLIDATION	loan_intent_HOMEIMPROVEMENT	loan_intent_PERSONAL	loan_intent_VENTURE	loan_grade_A	loan_grade_C	loan_grade_D	loan_grade_E	loan_grade_F	cb_person_default_on_file_N	cb_person_default_on_file_Y
0	0	1	0	1	0	0	0	0	0	0	1	1	0
1	1	0	0	0	1	0	0	1	0	0	0	0	1
2	0	1	0	0	0	1	0	0	0	1	0	0	1
3	0	1	1	0	0	0	1	0	0	0	0	1	0
4	1	0	0	1	0	0	0	0	1	0	0	0	1

# Combine the numeric and categoric features 
new_test_df = pd.concat([scaled_test_df_num, onehot_test_df_cat], axis=1)
new_test_df.head()

	index	person_age	person_income	person_emp_length	loan_amnt	loan_int_rate	loan_percent_income	cb_person_cred_hist_length	person_home_ownership_MORTGAGE	...	loan_intent_VENTURE	loan_grade_A	loan_grade_C	loan_grade_D	loan_grade_E	loan_grade_F	cb_person_default_on_file_N	cb_person_default_on_file_Y
0	-1.731971	-0.755657	0.130655	-0.435213	2.836942	1.674723	2.189617	-0.946496	0	...	0	0	0	0	0	1	1	0
1	-1.731912	-0.257349	0.842532	0.334194	0.140717	0.659785	-0.646037	-0.450123	1	...	0	0	1	0	0	0	0	1
2	-1.731853	-0.257349	-0.897612	0.077725	-0.937774	2.145944	-0.318847	-0.946496	0	...	1	0	0	0	1	0	0	1
3	-1.731794	0.905371	-0.370296	-0.178744	-0.398528	-0.585820	-0.209783	0.294436	0	...	0	1	0	0	0	0	1	0
4	-1.731735	-0.257349	1.000727	0.847132	1.039459	1.859257	-0.100719	-0.450123	1	...	0	0	0	1	0	0	0	1

5 rows × 27 columns

Now this data is finally ready to make some predictions! The format for submission is simply a DataFrame with id and then loan status prediction.

pred = grid_search.predict(new_test_df)
pred

array([1, 0, 1, ..., 0, 0, 1], dtype=int64)

submission = pd.DataFrame(pred, index=ids, columns=["loan_status"])
submission

	loan_status
id
58645	1
58646	0
58647	1
58648	0
58649	0
...	...
97738	0
97739	0
97740	0
97741	0
97742	1

39098 rows × 1 columns

# Write the predictions to a csv file
submission.to_csv("submission.csv")

I ended up getting a score of 0.859, which is a lot lower than I expected. This suggests that my model was overfitted to the training data, but I used cross score validation, so I'm not sure how that would happen. EDIT: Kaggle used a different scoring method which might explain why the score was lower than I expected.

For my first classification project, I would say it was pretty successful. I was able to clean the data, prepare the data by scaling numerical features and encoding categorical features, train two different models on the data, and perform some hypertuning. My next step is to learn how these models actually work under the hood so I can make better informed decisions ab

	person_home_ownership_OWN	person_home_ownership_RENT	loan_intent_EDUCATION	loan_intent_MEDICAL	loan_intent_PERSONAL	loan_intent_VENTURE	loan_grade_A	loan_grade_B	loan_grade_C	cb_person_default_on_file_N
0	0	1	1	0	0	0	0	1	0	1
1	1	0	0	1	0	0	0	0	1	1
2	1	0	0	0	1	0	1	0	0	1
3	0	1	0	0	0	1	0	1	0	1
4	0	1	0	1	0	0	1	0	0	1

	person_home_ownership_MORTGAGE	person_home_ownership_RENT	loan_intent_DEBTCONSOLIDATION	loan_intent_HOMEIMPROVEMENT	loan_intent_PERSONAL	loan_intent_VENTURE	loan_grade_A	loan_grade_C	loan_grade_D	loan_grade_E	loan_grade_F	cb_person_default_on_file_N	cb_person_default_on_file_Y
0	0	1	0	1	0	0	0	0	0	0	1	1	0
1	1	0	0	0	1	0	0	1	0	0	0	0	1
2	0	1	0	0	0	1	0	0	0	1	0	0	1
3	0	1	1	0	0	0	1	0	0	0	0	1	0
4	1	0	0	1	0	0	0	0	1	0	0	0	1

	person_home_ownership_OWN	person_home_ownership_RENT	loan_intent_EDUCATION	loan_intent_MEDICAL	loan_intent_PERSONAL	loan_intent_VENTURE	loan_grade_A	loan_grade_B	loan_grade_C	cb_person_default_on_file_N
0	0	1	1	0	0	0	0	1	0	1
1	1	0	0	1	0	0	0	0	1	1
2	1	0	0	0	1	0	1	0	0	1
3	0	1	0	0	0	1	0	1	0	1
4	0	1	0	1	0	0	1	0	0	1

	person_home_ownership_MORTGAGE	person_home_ownership_RENT	loan_intent_DEBTCONSOLIDATION	loan_intent_HOMEIMPROVEMENT	loan_intent_PERSONAL	loan_intent_VENTURE	loan_grade_A	loan_grade_C	loan_grade_D	loan_grade_E	loan_grade_F	cb_person_default_on_file_N	cb_person_default_on_file_Y
0	0	1	0	1	0	0	0	0	0	0	1	1	0
1	1	0	0	0	1	0	0	1	0	0	0	0	1
2	0	1	0	0	0	1	0	0	0	1	0	0	1
3	0	1	1	0	0	0	1	0	0	0	0	1	0
4	1	0	0	1	0	0	0	0	1	0	0	0	1

	person_home_ownership_OWN	person_home_ownership_RENT	loan_intent_EDUCATION	loan_intent_MEDICAL	loan_intent_PERSONAL	loan_intent_VENTURE	loan_grade_A	loan_grade_B	loan_grade_C	cb_person_default_on_file_N
0	0	1	1	0	0	0	0	1	0	1
1	1	0	0	1	0	0	0	0	1	1
2	1	0	0	0	1	0	1	0	0	1
3	0	1	0	0	0	1	0	1	0	1
4	0	1	0	1	0	0	1	0	0	1

	person_home_ownership_MORTGAGE	person_home_ownership_RENT	loan_intent_DEBTCONSOLIDATION	loan_intent_HOMEIMPROVEMENT	loan_intent_PERSONAL	loan_intent_VENTURE	loan_grade_A	loan_grade_C	loan_grade_D	loan_grade_E	loan_grade_F	cb_person_default_on_file_N	cb_person_default_on_file_Y
0	0	1	0	1	0	0	0	0	0	0	1	1	0
1	1	0	0	0	1	0	0	1	0	0	0	0	1
2	0	1	0	0	0	1	0	0	0	1	0	0	1
3	0	1	1	0	0	0	1	0	0	0	0	1	0
4	1	0	0	1	0	0	0	0	1	0	0	0	1