In this paper we investigate the role of sample size and class distribution
in credit risk assessments, focusing on real life imbalanced data sets. Choosing the
optimal sample is of utmost importance for the quality of predictive models and has
become an increasingly important topic with the recent advances in automating
lending decision processes and the ever growing richness in data collected by
financial institutions. To address the observed research gap, a large-scale
experimental evaluation of real-life data sets of different characteristics was
performed, using several classification algorithms and performance measures.
Results indicate that various factors play a role in determining the optimal class
distribution, namely the performance measure, classification algorithm and data set
characteristics. The study also provides valuable insight on how to design the
training sample to maximize prediction performance and the suitability of using
different classification algorithms by assessing their sensitivity to class imbalance
and sample size.