In this Kaggle competition, the goal is to predict whether an applicant is approved for a loan by using features such as their age, income, reason for requesting a loan etc. I just completed a course on supervised learning with scikit-learn and will use the knowledge I obtained to attempt this challenge.
# Import required libraries and read in dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
loan_df = pd.read_csv("loan_train_data.csv")
loan_df.head()
id | person_age | person_income | person_home_ownership | person_emp_length | loan_intent | loan_grade | loan_amnt | loan_int_rate | loan_percent_income | cb_person_default_on_file | cb_person_cred_hist_length | loan_status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 37 | 35000 | RENT | 0.0 | EDUCATION | B | 6000 | 11.49 | 0.17 | N | 14 | 0 |
1 | 1 | 22 | 56000 | OWN | 6.0 | MEDICAL | C | 4000 | 13.35 | 0.07 | N | 2 | 0 |
2 | 2 | 29 | 28800 | OWN | 8.0 | PERSONAL | A | 6000 | 8.90 | 0.21 | N | 10 | 0 |
3 | 3 | 30 | 70000 | RENT | 14.0 | VENTURE | B | 12000 | 11.11 | 0.17 | N | 5 | 0 |
4 | 4 | 22 | 60000 | RENT | 2.0 | MEDICAL | A | 6000 | 6.92 | 0.10 | N | 3 | 0 |
loan_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 58645 entries, 0 to 58644 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 58645 non-null int64 1 person_age 58645 non-null int64 2 person_income 58645 non-null int64 3 person_home_ownership 58645 non-null object 4 person_emp_length 58645 non-null float64 5 loan_intent 58645 non-null object 6 loan_grade 58645 non-null object 7 loan_amnt 58645 non-null int64 8 loan_int_rate 58645 non-null float64 9 loan_percent_income 58645 non-null float64 10 cb_person_default_on_file 58645 non-null object 11 cb_person_cred_hist_length 58645 non-null int64 12 loan_status 58645 non-null int64 dtypes: float64(3), int64(6), object(4) memory usage: 5.8+ MB
It seems Kaggle has been nice and provided a dataset with no missing values, so there shouldn't be much cleaning up, but I will remove the id columns since pandas provides one automatically. I'll have a look at the categorical data as well to see what I'm working with.
# Remove redundant id column
loan_df.drop("id", axis=1, inplace=True)
loan_df["person_home_ownership"].value_counts()
person_home_ownership RENT 30594 MORTGAGE 24824 OWN 3138 OTHER 89 Name: count, dtype: int64
loan_df["loan_intent"].value_counts()
loan_intent EDUCATION 12271 MEDICAL 10934 PERSONAL 10016 VENTURE 10011 DEBTCONSOLIDATION 9133 HOMEIMPROVEMENT 6280 Name: count, dtype: int64
loan_df["loan_grade"].value_counts()
loan_grade A 20984 B 20400 C 11036 D 5034 E 1009 F 149 G 33 Name: count, dtype: int64
loan_df["cb_person_default_on_file"].value_counts()
cb_person_default_on_file N 49943 Y 8702 Name: count, dtype: int64
Some of the categories have such little data that the model might not figure out how to deal with them effectively, for example "OTHER" in person_home_ownership and "G" for loan_grade. However it isn't clear what to do with these categories at the moment, and they still have a decent amount of data, so I'll just leave them alone for now. Later I will prepare these features for the model using One Hot Encoding. Now I'll have a quick look at the distribution of the numerical data and see if there are any patterns or data that needs to be cleaned.
# Draw a histogram for all the numeric features
loan_df.hist(figsize=(12,8))
plt.show()
It is a bit concerning that the person_age feature has a scale going up to 120. I'll check the maximum value for that column.
loan_df["person_age"].max()
123
loan_df[loan_df["person_age"]==123]
person_age | person_income | person_home_ownership | person_emp_length | loan_intent | loan_grade | loan_amnt | loan_int_rate | loan_percent_income | cb_person_default_on_file | cb_person_cred_hist_length | loan_status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
47336 | 123 | 36000 | MORTGAGE | 7.0 | PERSONAL | B | 6700 | 10.75 | 0.18 | N | 4 | 0 |
I mean... I guess that's possible?
loan_df["person_age"].value_counts()
person_age 23 7726 22 7051 24 6395 25 5067 27 4450 26 3874 28 3707 29 3270 30 2333 31 1917 21 1795 32 1565 33 1306 36 1117 34 1041 37 992 35 862 38 745 39 536 40 438 41 433 43 320 42 291 44 229 46 164 45 163 47 125 48 97 53 75 51 69 50 63 52 62 54 60 49 59 58 35 55 34 56 29 60 28 57 25 65 13 61 13 20 12 66 11 64 10 70 10 62 7 69 6 59 6 73 3 84 2 80 2 76 1 123 1 Name: count, dtype: int64
This person is definitely an outlier anyway, so I'll remove them from the dataset.
# Remove the person aged 123
loan_df.drop(loan_df[loan_df["person_age"]==123].index, inplace=True)
loan_df["person_age"].max()
84
loan_df["person_income"].describe().round()
count 58644.0 mean 64047.0 std 37931.0 min 4200.0 25% 42000.0 50% 58000.0 75% 75600.0 max 1900000.0 Name: person_income, dtype: float64
loan_df["person_emp_length"].describe().round()
count 58644.0 mean 5.0 std 4.0 min 0.0 25% 2.0 50% 4.0 75% 7.0 max 123.0 Name: person_emp_length, dtype: float64
loan_df[loan_df["person_emp_length"]==123]
person_age | person_income | person_home_ownership | person_emp_length | loan_intent | loan_grade | loan_amnt | loan_int_rate | loan_percent_income | cb_person_default_on_file | cb_person_cred_hist_length | loan_status | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
41079 | 28 | 60350 | MORTGAGE | 123.0 | MEDICAL | D | 25000 | 15.95 | 0.35 | Y | 6 | 1 |
49252 | 21 | 192000 | MORTGAGE | 123.0 | VENTURE | B | 20000 | 11.49 | 0.10 | N | 2 | 0 |
The distribution of income seems right, but these two people have definitely not been employed for 123 years. I'll remove them from the dataset too.
# Remove the people with 123 years of employment
loan_df.drop(loan_df[loan_df["person_emp_length"]==123].index, inplace=True)
loan_df["person_emp_length"].max()
41.0
The rest of the histograms look reasonable, so hopefully all of the invalid data has been removed now. Since Kaggle provides a test dataset already, there's no need to split the data into training and testing data. Now I have two steps to prepare the data for training the model. First, scaling all of the numerical features, since they have very different ranges of values. Second, dealing with categorical data by using One Hot Encoding.
loan_df.reset_index(inplace=True)
# Split the features into numeric and categoric data
loan_df_num = loan_df.drop(columns=["person_home_ownership", "loan_intent", "loan_grade",
"cb_person_default_on_file", "loan_status"], axis=1)
loan_df_cat = loan_df[["person_home_ownership", "loan_intent", "loan_grade",
"cb_person_default_on_file"]]
# Create the labels
loan_df_lab = loan_df["loan_status"]
# Use StandardScalar to scale the features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_loan_df_num = scaler.fit_transform(loan_df_num)
scaled_loan_df_num = pd.DataFrame(scaled_loan_df_num, columns=loan_df_num.columns)
scaled_loan_df_num.head()
index | person_age | person_income | person_emp_length | loan_amnt | loan_int_rate | loan_percent_income | cb_person_cred_hist_length | |
---|---|---|---|---|---|---|---|---|
0 | -1.731971 | 1.569782 | -0.765783 | -1.204620 | -0.578277 | 0.267650 | 0.117408 | 2.031742 |
1 | -1.731912 | -0.921760 | -0.212101 | 0.334194 | -0.937774 | 0.880567 | -0.973228 | -0.946496 |
2 | -1.731853 | 0.240960 | -0.929251 | 0.847132 | -0.578277 | -0.585820 | 0.553662 | 1.038996 |
3 | -1.731794 | 0.407063 | 0.157021 | 2.385947 | 0.500213 | 0.142431 | 0.117408 | -0.201937 |
4 | -1.731735 | -0.921760 | -0.106637 | -0.691682 | -0.578277 | -1.238280 | -0.646037 | -0.698310 |
# Use the Pandas built-in One Hot Encoder
onehot_loan_df_cat = pd.get_dummies(loan_df_cat, dtype=int)
onehot_loan_df_cat.head()
person_home_ownership_MORTGAGE | person_home_ownership_OTHER | person_home_ownership_OWN | person_home_ownership_RENT | loan_intent_DEBTCONSOLIDATION | loan_intent_EDUCATION | loan_intent_HOMEIMPROVEMENT | loan_intent_MEDICAL | loan_intent_PERSONAL | loan_intent_VENTURE | loan_grade_A | loan_grade_B | loan_grade_C | loan_grade_D | loan_grade_E | loan_grade_F | loan_grade_G | cb_person_default_on_file_N | cb_person_default_on_file_Y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
# Combine the numeric and categoric features
new_loan_df = pd.concat([scaled_loan_df_num, onehot_loan_df_cat], axis=1)
new_loan_df.head()
index | person_age | person_income | person_emp_length | loan_amnt | loan_int_rate | loan_percent_income | cb_person_cred_hist_length | person_home_ownership_MORTGAGE | person_home_ownership_OTHER | ... | loan_intent_VENTURE | loan_grade_A | loan_grade_B | loan_grade_C | loan_grade_D | loan_grade_E | loan_grade_F | loan_grade_G | cb_person_default_on_file_N | cb_person_default_on_file_Y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -1.731971 | 1.569782 | -0.765783 | -1.204620 | -0.578277 | 0.267650 | 0.117408 | 2.031742 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1 | -1.731912 | -0.921760 | -0.212101 | 0.334194 | -0.937774 | 0.880567 | -0.973228 | -0.946496 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
2 | -1.731853 | 0.240960 | -0.929251 | 0.847132 | -0.578277 | -0.585820 | 0.553662 | 1.038996 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
3 | -1.731794 | 0.407063 | 0.157021 | 2.385947 | 0.500213 | 0.142431 | 0.117408 | -0.201937 | 0 | 0 | ... | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | -1.731735 | -0.921760 | -0.106637 | -0.691682 | -0.578277 | -1.238280 | -0.646037 | -0.698310 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
5 rows × 27 columns
Now the data looks like it's in a format that's ready to train the model. Just to test it, I'll make a simple model without any hyperparameter tuning or cross validation.
# Make a K-Nearest Neighbors model
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(new_loan_df, loan_df_lab)
KNeighborsClassifier(n_neighbors=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier(n_neighbors=3)
knn.score(new_loan_df, loan_df_lab)
0.951280652092357
loan_df_lab.value_counts()
loan_status 0 50293 1 8349 Name: count, dtype: int64
The basic model already has a score of 0.951, which is pretty good. But only about 14% of the loans in the dataset were approved, so it's possible that this statistic is a bit biased. Now I'll implement cross score validation.
from sklearn.model_selection import cross_val_score
scores = cross_val_score(knn, new_loan_df, loan_df_lab, cv=5)
scores
array([0.91891892, 0.92096513, 0.92385744, 0.92556276, 0.92291951])
So the model as it stands is more likely to have an accuracy around 0.92. Finally, time for some hypertuning.
from sklearn.model_selection import GridSearchCV, StratifiedKFold
param_grid = {
"n_neighbors": [1,3,5,10,20],
"weights": ["uniform", "distance"],
"metric": ["minkowski", "manhattan"]
}
grid_search = GridSearchCV(estimator=knn, param_grid=param_grid, cv=StratifiedKFold(10))
grid_search.fit(new_loan_df, loan_df_lab)
print("Parameters:", grid_search.best_params_)
print("Score:", grid_search.best_score_)
Parameters: {'metric': 'manhattan', 'n_neighbors': 10, 'weights': 'distance'} Score: 0.9356605478658631
I could try some more values for n_neighbors to optimize the score even more, but I'll just leave it as this. K-Nearest Neighbors produces a score of around 0.936 after hypertuning. For comparison, I'll try out Random Forest.
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)
clf.fit(new_loan_df, loan_df_lab)
RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
RandomForestClassifier()
clf.score(new_loan_df, loan_df_lab)
0.9999658947512022
That is an extremely high score but let's see if the model was just overfitted...
scores = cross_val_score(clf, new_loan_df, loan_df_lab, cv=5)
scores
array([0.49160201, 0.95174354, 0.95063097, 0.95370055, 0.95182469])
I have no idea why the first score is so low, but it seems like Random Forest performs better then K-Nearest Neighbors, with a score of around 0.95 on average. Now I'll do some hypertuning like before.
param_grid = {
"n_estimators": [15, 50, 100, 200, 500],
"max_depth": [5, 10, 15],
"max_features": ["sqrt", "log2", None]
}
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=StratifiedKFold(10))
grid_search.fit(new_loan_df, loan_df_lab)
print("Parameters:", grid_search.best_params_)
print("Score:", grid_search.best_score_)
Parameters: {'max_depth': 15, 'max_features': 'log2', 'n_estimators': 100} Score: 0.9504281852132277
Since the best max_depth was 15 and I didn't try any higher values, I should continue hypertuning to try to improve the performance.
param_grid = {
"n_estimators": [100],
"max_depth": [15, 20, 25],
"max_features": ["log2"]
}
grid_search2 = GridSearchCV(estimator=clf, param_grid=param_grid, cv=StratifiedKFold(10))
grid_search2.fit(new_loan_df, loan_df_lab)
print("Parameters:", grid_search2.best_params_)
print("Score:", grid_search2.best_score_)
Parameters: {'max_depth': 25, 'max_features': 'log2', 'n_estimators': 100} Score: 0.9505134803194665
I don't really know how Random Forest works yet but I know that increasing max_depth will always impove performance, with the risk of overfitting the data. Since the model with max_depth of 25 only had a marginally better score, I will use the original model as my final solution. Using cross validation, this model had a score of 0.950 and now I'll use the model on the Kaggle test data.
test_df = pd.read_csv("loan_test_data.csv")
test_df.head()
id | person_age | person_income | person_home_ownership | person_emp_length | loan_intent | loan_grade | loan_amnt | loan_int_rate | loan_percent_income | cb_person_default_on_file | cb_person_cred_hist_length | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 58645 | 23 | 69000 | RENT | 3.0 | HOMEIMPROVEMENT | F | 25000 | 15.76 | 0.36 | N | 2 |
1 | 58646 | 26 | 96000 | MORTGAGE | 6.0 | PERSONAL | C | 10000 | 12.68 | 0.10 | Y | 4 |
2 | 58647 | 26 | 30000 | RENT | 5.0 | VENTURE | E | 4000 | 17.19 | 0.13 | Y | 2 |
3 | 58648 | 33 | 50000 | RENT | 4.0 | DEBTCONSOLIDATION | A | 7000 | 8.90 | 0.14 | N | 7 |
4 | 58649 | 26 | 102000 | MORTGAGE | 8.0 | HOMEIMPROVEMENT | D | 15000 | 16.32 | 0.15 | Y | 4 |
# Keep id for submission
ids = test_df["id"]
# Remove redundant id column
test_df.drop("id", axis=1, inplace=True)
test_df.reset_index(inplace=True)
# Split the features into numeric and categoric data
test_df_num = test_df.drop(columns=["person_home_ownership", "loan_intent", "loan_grade",
"cb_person_default_on_file"], axis=1)
test_df_cat = test_df[["person_home_ownership", "loan_intent", "loan_grade",
"cb_person_default_on_file"]]
test_df_num.head()
index | person_age | person_income | person_emp_length | loan_amnt | loan_int_rate | loan_percent_income | cb_person_cred_hist_length | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 23 | 69000 | 3.0 | 25000 | 15.76 | 0.36 | 2 |
1 | 1 | 26 | 96000 | 6.0 | 10000 | 12.68 | 0.10 | 4 |
2 | 2 | 26 | 30000 | 5.0 | 4000 | 17.19 | 0.13 | 2 |
3 | 3 | 33 | 50000 | 4.0 | 7000 | 8.90 | 0.14 | 7 |
4 | 4 | 26 | 102000 | 8.0 | 15000 | 16.32 | 0.15 | 4 |
# Use StandardScalar to scale the features
scaled_test_df_num = scaler.transform(test_df_num)
scaled_test_df_num = pd.DataFrame(scaled_test_df_num, columns=test_df_num.columns)
scaled_test_df_num.head()
index | person_age | person_income | person_emp_length | loan_amnt | loan_int_rate | loan_percent_income | cb_person_cred_hist_length | |
---|---|---|---|---|---|---|---|---|
0 | -1.731971 | -0.755657 | 0.130655 | -0.435213 | 2.836942 | 1.674723 | 2.189617 | -0.946496 |
1 | -1.731912 | -0.257349 | 0.842532 | 0.334194 | 0.140717 | 0.659785 | -0.646037 | -0.450123 |
2 | -1.731853 | -0.257349 | -0.897612 | 0.077725 | -0.937774 | 2.145944 | -0.318847 | -0.946496 |
3 | -1.731794 | 0.905371 | -0.370296 | -0.178744 | -0.398528 | -0.585820 | -0.209783 | 0.294436 |
4 | -1.731735 | -0.257349 | 1.000727 | 0.847132 | 1.039459 | 1.859257 | -0.100719 | -0.450123 |
# Use the Pandas built-in One Hot Encoder
onehot_test_df_cat = pd.get_dummies(test_df_cat, dtype=int)
onehot_test_df_cat.head()
person_home_ownership_MORTGAGE | person_home_ownership_OTHER | person_home_ownership_OWN | person_home_ownership_RENT | loan_intent_DEBTCONSOLIDATION | loan_intent_EDUCATION | loan_intent_HOMEIMPROVEMENT | loan_intent_MEDICAL | loan_intent_PERSONAL | loan_intent_VENTURE | loan_grade_A | loan_grade_B | loan_grade_C | loan_grade_D | loan_grade_E | loan_grade_F | loan_grade_G | cb_person_default_on_file_N | cb_person_default_on_file_Y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
3 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
# Combine the numeric and categoric features
new_test_df = pd.concat([scaled_test_df_num, onehot_test_df_cat], axis=1)
new_test_df.head()
index | person_age | person_income | person_emp_length | loan_amnt | loan_int_rate | loan_percent_income | cb_person_cred_hist_length | person_home_ownership_MORTGAGE | person_home_ownership_OTHER | ... | loan_intent_VENTURE | loan_grade_A | loan_grade_B | loan_grade_C | loan_grade_D | loan_grade_E | loan_grade_F | loan_grade_G | cb_person_default_on_file_N | cb_person_default_on_file_Y | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -1.731971 | -0.755657 | 0.130655 | -0.435213 | 2.836942 | 1.674723 | 2.189617 | -0.946496 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
1 | -1.731912 | -0.257349 | 0.842532 | 0.334194 | 0.140717 | 0.659785 | -0.646037 | -0.450123 | 1 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
2 | -1.731853 | -0.257349 | -0.897612 | 0.077725 | -0.937774 | 2.145944 | -0.318847 | -0.946496 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
3 | -1.731794 | 0.905371 | -0.370296 | -0.178744 | -0.398528 | -0.585820 | -0.209783 | 0.294436 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
4 | -1.731735 | -0.257349 | 1.000727 | 0.847132 | 1.039459 | 1.859257 | -0.100719 | -0.450123 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
5 rows × 27 columns
Now this data is finally ready to make some predictions! The format for submission is simply a DataFrame with id and then loan status prediction.
pred = grid_search.predict(new_test_df)
pred
array([1, 0, 1, ..., 0, 0, 1], dtype=int64)
submission = pd.DataFrame(pred, index=ids, columns=["loan_status"])
submission
loan_status | |
---|---|
id | |
58645 | 1 |
58646 | 0 |
58647 | 1 |
58648 | 0 |
58649 | 0 |
... | ... |
97738 | 0 |
97739 | 0 |
97740 | 0 |
97741 | 0 |
97742 | 1 |
39098 rows × 1 columns
# Write the predictions to a csv file
submission.to_csv("submission.csv")
I ended up getting a score of 0.859, which is a lot lower than I expected. This suggests that my model was overfitted to the training data, but I used cross score validation, so I'm not sure how that would happen. EDIT: Kaggle used a different scoring method which might explain why the score was lower than I expected.
For my first classification project, I would say it was pretty successful. I was able to clean the data, prepare the data by scaling numerical features and encoding categorical features, train two different models on the data, and perform some hypertuning. My next step is to learn how these models actually work under the hood so I can make better informed decisions ab