Machine Learning Applications in Credit Risk Modelling
Specialeforsvar: Maryam Nasheem Irshad
Titel: Machine Learning Applications in Credit Risk Modelling
Abstract: This master thesis explores different model settings and their performances in credit scoring. The study begins by exploring the tuning of hyperparameters for individual models and evaluating their performance on the complete credit dataset as well as a feature-selected credit dataset. The models considered in this study are Logistic Regression (LR), Support Vector
Machine (SVM), K-Nearest Neighbors (KNN), Random Forest (RF), and AdaBoost (AB). The hyperparameter optimization process involves grid search and randomized search methods. The models are evaluated using performance measures such as accuracy, AUC, precision, recall, Fscore, type-I error, and type-II error. The results of the individual models show that the RF model performs the best on the complete credit dataset, achieving the highest F1-score of 0.814. On the feature-selected dataset, the SVM model achieves the highest F1-score of 0.714. However, the RF model outperforms the others in terms of accuracy and AUC. The comparison of the complete dataset and feature-selected dataset reveals that the choice of dataset does not consistently lead to better results across all models and measures. Next, the study investigates the potential of using unsupervised clustering techniques to enhance predictive performance in credit scoring models. The optimal number of clusters is determined using the elbow method and silhouette method, and the feature-selected dataset is partitioned into subsets based on the k-means algorithm. Five base learners are applied to each cluster, and hyperparameter optimization is performed. The cluster-based models are evaluated and compared to their individual model counterparts. It is found that the cluster-based models generally outperform the individual models. The cluster-based RF model achieves the highest performance with an F1-score of 0.731, surpassing the best F1-score obtained by individual models 0.713. Furthermore, the study presents ROC curves and AUC values to compare the classifiers. The results demonstrate the effectiveness of clustering techniques in improving the performance of
credit scoring models. Overall, this thesis provides insights into the analysis of model settings and performances in credit scoring, highlighting the potential of cluster-based models for enhancing predictive accuracy.
Vejleder: Rolf Poulsen
Censor: Nina Lange, DTU