I would like to thank everyone who took the time to read and engage with my article. Your support and feedback are truly appreciated.
You can reproduce the analysis on my GitHub repository: Credit Scoring with Python.
is not just about training a machine learning algorithm and evaluating its performance with an AUC or a Gini coefficient.
Many beginners in modeling rush into model training, skipping crucial steps that determine whether a model is truly robust and interpretable. This enthusiasm, which lasts only a few minutes — just long enough for the performance metrics to appear on the screen — often obscures the more in-depth and rigorous work that precedes this stage.
In credit risk, the quality of a model depends heavily on the variables it uses. A variable that seems predictive in a training dataset may behave inconsistently across time or across different populations. If we ignore this, we risk building a model that performs well in development but fails in production.
This raises three fundamental questions. Do the selected variables exhibit a constant credit risk over time? Does the trend of this risk remain stable from year to year? Does the distribution of these variables remain comparable across the training, test, and out-of-period datasets?
– I first define the concepts of monotonicity and stability in credit scoring.
– Then I apply these concepts to the seven variables selected in my previous post.
– Finally, I evaluate dataset stability using the Population Stability Index (PSI) across years and across train, test, and out-of-time datasets.
Presenting the Data
In my previous post, I presented a simple method that combines variable relationship analysis with cross-validation to robustly select variables for a scoring model. This method is easy to understand, easy to implement, and powerful, especially when combined with logistic regression, which remains the reference model in credit scoring.
I retained seven variables after the selection process:
Five numerical ones [person_income, person_age, person_emp_length, loan_int_rate, and loan_percent_income]
and two categorical ones [person_home_ownership and cb_person_default_on_file].
The question I now ask is whether these variables are truly relevant for estimating the parameters of the final scoring model, and how I can interpret the risk direction of each variable.
Defining Monotonicity and Stability
Monotonicity refers to the analysis of the risk direction of a pre-selected variable. For a continuous variable, it answers the following question: when the value of the variable increases or decreases, does the credit risk increase or decrease accordingly?
For example, in a corporate context, we expect that when a company’s revenue increases, its financial situation improves. Conversely, when its revenue decreases, its financial situation deteriorates. This is the risk direction.
Stability goes one step further. It answers the question: is this risk direction consistently respected across multiple years, or do we observe risk inversions? A risk inversion occurs when, despite an increase in revenue, the financial situation deteriorates — or vice versa. Stability gives a long-term view of the variable’s behavior and supports informed decision-making.
In credit scoring, we study both the monotonicity of variables and their stability over time. We also study the stability of variable distributions between consecutive years and between the train, test, and out-of-time datasets.
Monotonicity and Stability of Variables
This analysis acts as a pre-selection step. If a variable shows a risk inversion over time, we must either treat it or remove it from the model. For continuous variables, treatment typically consists of discretizing the variable and aggregating its bins. For categorical variables, we can directly combine certain categories.
Defining the Risk Direction
The first step is to assign a risk direction to each variable.
For a continuous variable, we assign a “+” sign if we expect that an increase in the variable leads to an increase in credit risk. We assign a “−” sign if we expect that an increase leads to a decrease in credit risk.
For a binary categorical variable, we assign a “+” sign if moving from the least risky to the most risky category increases the risk. We assign a “−” sign if it decreases the risk.
For a multi-category variable, we do not assign a binary sign. Instead, we rank the categories from least risky to most risky based on their empirical default rate. The category with the lowest default rate is the least risky; the one with the highest is the most risky. We then validate this ranking with business experts.
The table below summarizes the expected risk direction for each continuous variable studied. A “+” means that an increase in the variable is expected to increase credit risk and therefore the computed probability of default. A “−” means the opposite.

I make two specific comments here. For person_age, age is a sensitive variable that may discriminate counterparties. We expect both very young and very old counterparties to carry higher risk, which makes it difficult to assign a single direction. We therefore let the data reveal the risk pattern. For person_home_ownership, the variable has multiple categories, making it equally difficult to assign a binary direction a priori. We expect the RENT category to carry the highest risk, followed by MORTGAGE, then OWN, with the OTHER category capturing counterparties in more ambiguous housing situations. We let the data confirm this ordering.
Practical Approach
In practice, we evaluate the empirical default rate over time for defined values of the explanatory variables. For values we define as risky, we expect higher default rates. For values we define as less risky, we expect lower default rates.
For continuous variables, we discretize them using quantiles. Using terciles — Q1, Q2, and Q3 — we compute the default rate of each bin for each year. If a variable has a “+” sign, the default rate in Q1 must be lower than in Q2, which must be lower than in Q3, for every period. Graphically, the curve for Q3 sits above the curve for Q2, which sits above the curve for Q1.

For categorical variables, we compute the default rate of each category for each period. The curve for the most risky category must consistently sit above the curves for all other categories.
Application: Monotonicity and Stability of the Seven Variables
We apply this framework to the seven pre-selected variables. The distribution of the “default” variable by year in the training set is as follows:

Continuous Variables
We discretize the continuous variables into terciles on the training set.
Person Income The risk monotonicity is respected in all periods. Counterparties with the lowest incomes show the highest default rates across all years. We observe no risk inversion.

Person Age The risk monotonicity is not respected. We observe a risk inversion, and Q2 is not present in all years. This variable lacks the predictive power to differentiate between good and very good counterparties. I exclude it from further modelling.

Employment Length The risk monotonicity is globally respected across all years.

Interest Rate The risk monotonicity is respected for all years.

Loan Percent Income The risk monotonicity is globally respected across all years for this variable.

Categorical Variables
Historical Default (cb_person_default_on_file) The risk monotonicity is respected. Counterparties with a history of default show higher default rates across all periods. This result is entirely coherent.

Home Ownership (person_home_ownership) The risk monotonicity is respected at a global level but not at a per-year level for 2016, 2017, and 2018. 
In this situation, we have several options. I choose to regroup the variable into three categories: OWN, MORTGAGE, and (RENT + OTHER). After regrouping, the risk monotonicity is globally respected.
Summary
This monotonicity analysis leads me to exclude the variable person_age, whose risk stability is not respected. I retain the six remaining variables for the next step.
Dataset Stability
I now study the stability of variable distributions. The objective is to ensure that the distribution of each variable remains consistent across years and between the train, test, and out-of-time datasets.
The Population Stability Index (PSI)
We use the PSI — a practical indicator widely used in credit scoring — to measure distributional shifts. It applies directly to categorical variables. For continuous variables, we discretize them first. In this article, I use terciles for continuous variables.
For each variable, we compute the proportion of observations in each bin or category for both datasets. The PSI then compares, bin by bin, the proportions observed in the reference dataset versus the target dataset, using the following logarithmic formula:
PSI=∑i=1k(pi−qi)⋅ln(piqi)PSI = \sum_{i=1}^{k} (p_i – q_i) \cdot \ln\left(\frac{p_i}{q_i}\right)
Here, pᵢ and qᵢ denote the proportions in bin i of the reference and target datasets, respectively. In this article, I clearly explain how to use this indicator. When it is below 10%, the variable is considered stable. When it is below 25%, no significant shift is observed.
Year-to-Year Stability
I evaluate whether the distribution of each variable has shifted from one year to the next.

All variables are stable over time — no threshold violation is observed (PSI below 10%).
Dataset Stability
I evaluate the stability of variable distributions across the three datasets, testing three scenarios:
- Train vs Test,
- Train vs Out-of-Time,
- And Test vs Out-of-Time.

No threshold violation is observed across all scenarios, which confirms that the selected risk drivers are stable between the estimation and evaluation sets.
Conclusion
In this article, I presented a rigorous framework for studying monotonicity and stability in a scoring model. I showed how to assign a risk direction to each variable, how to validate this direction across years, and how to detect distributional shifts using the PSI. This step — often skipped in practice — is essential to ensuring that the model I build is not only performant, but also robust, interpretable, and reliable over time.
In my next post, I will present the estimation of the final scoring model using the six retained variables.
Image Credits
All images and visualizations in this article were created by the author using Python (pandas, matplotlib, seaborn, and plotly) and excel, unless otherwise stated.
References
[1] Lorenzo Beretta and Alessandro Santaniello.
Nearest Neighbor Imputation Algorithms: A Critical Evaluation.
National Library of Medicine, 2016.
[2] Nexialog Consulting.
Traitement des données manquantes dans le milieu bancaire.
Working paper, 2022.
[3] John T. Hancock and Taghi M. Khoshgoftaar.
Survey on Categorical Data for Neural Networks.
Journal of Big Data, 7(28), 2020.
[4] Melissa J. Azur, Elizabeth A. Stuart, Constantine Frangakis, and Philip J. Leaf.
Multiple Imputation by Chained Equations: What Is It and How Does It Work?
International Journal of Methods in Psychiatric Research, 2011.
[5] Majid Sarmad.
Robust Data Analysis for Factorial Experimental Designs: Improved Methods and Software.
Department of Mathematical Sciences, University of Durham, England, 2006.
[6] Daniel J. Stekhoven and Peter Bühlmann.
MissForest—Non-Parametric Missing Value Imputation for Mixed-Type Data.Bioinformatics, 2011.
[7] Supriyanto Wibisono, Anwar, and Amin.
Multivariate Weather Anomaly Detection Using the DBSCAN Clustering Algorithm.
Journal of Physics: Conference Series, 2021.
[8] Laborda, J., & Ryoo, S. (2021). Feature selection in a credit scoring model. Mathematics, 9(7), 746.
Data & Licensing
The dataset used in this article is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
This license allows anyone to share and adapt the dataset for any purpose, including commercial use, provided that proper attribution is given to the source.
For more details, see the official license text: CC0: Public Domain.
Disclaimer
Any remaining errors or inaccuracies are the author’s responsibility. Feedback and corrections are welcome.