Blog

How credit scoring models actually work: a data-driven breakdown

We trained a credit scoring model on 32,437 real loan applications. Here's what actually predicts default — by loan grade, income, home ownership, and loan-to-income ratio — with a working logistic regression that scores AUC 0.871.

If you've ever applied for a loan and been told “your application is in review,” you've handed your data to a credit scoring model — almost certainly a flavor of logistic regression or gradient boosting wrapped in a UI. It looked at a handful of numbers about you and emitted a probability that you'd repay. That probability is what got you approved, declined, or quietly bumped to a higher interest rate.

The mechanics are not magic. They're also not the polished narrative that lenders publish on their FAQ pages. So instead of explaining what credit scoring models say they do, we trained one and looked at what it actually does. The modeling discipline is closely related to what we do for equities at QScoring — see how to read a QScore for the parallel on the equity side.

The data: Kaggle's credit-risk-dataset, which contains 32,581 anonymized loan applications. After dropping a handful of clearly-bad rows (ages above 80, a few income outliers) and filling in missing values we were left with 32,437 records — a meaningful sample with a real-world default rate of 21.9%. That's higher than what major banks see because the dataset includes a chunk of subprime applicants, which is actually useful: we want a signal-rich population for modeling.

Headline numbers: 32,437 applications · 21.9% default rate · median income $55,000 · median loan $8,000 · median rate 11.0%.

Here's what we found, ordered from weakest signal to strongest, ending with a working model.

1. The lender's own grade is the single best published predictor

Every loan in the dataset comes pre-tagged with a loan_gradefrom A (best) to G (worst), assigned by the originating lender's internal scoring system. It's tempting to treat grade as cheating — it's a model output, not a raw input. But it tells you something interesting: how good the lender's existing model is.

Spoiler: very good. The default rate by grade looks like this:

Bar chart showing default rate climbing from 10% for grade A loans to 98.4% for grade G loans
Figure 1. The default rate climbs from 10.0% for grade A to 98.4% for grade G. Roughly two-thirds of borrowers fall into the relatively safe A and B grades; the worst grades are small populations but spectacularly risky.

That's a 9.9× liftin default rate between the lender's best and worst tier — and almost everything in the worst tier defaults. The takeaway isn't “use someone else's score.” It's that some signal exists, and a model can find it. The interesting question is how much of that signal you can recover from raw features, without using the grade. We'll come back to that when we train the model.

2. Where you live matters more than people admit

Lenders are not allowed to discriminate based on a long list of protected characteristics, and housing-related variables sit close to that line. But “do you rent, own, or carry a mortgage” is on the application, and the data is unambiguous:

Horizontal bar chart: renters default at 31.6%, mortgage holders 12.6%, outright owners 7.5%
Figure 2. Renters default on loans at 31.6% — about 4.2×the rate of outright homeowners (7.5%) and 2.5× the rate of mortgage holders (12.6%).

The mortgage-holder effect is the interesting one. A mortgage payment is, by definition, a large recurring liability — naively, you'd expect more defaults among mortgage holders, not fewer. The model picks up the inverse because survivorship: people who've already qualified for a mortgage have already passed someone else's credit screen. The variable is a proxy for “the financial system has previously vouched for this person.”

Renting, by contrast, is a near-universal state for younger applicants and lower-income workers. The variable doesn't punish renters — it captures everything that “I rent” tends to correlate with: less savings, less stable employment, less prior credit history. The lender doesn't care which of those is causal. The model doesn't care either.

3. Why you're borrowing matters more than how much

Horizontal bar chart showing default rates by loan intent
Figure 3. Default rates range from 14.9% on venture loans up to 28.7%on debt consolidation loans. The intent label captures a lot of context the dollar amount can't.

“What is the loan for?” is a free-form question on most applications, but lenders bucket the answer. The buckets are predictive in a way that makes intuitive sense once you see it:

The cheap-and-cheerful signal isn't how muchthe borrower wants. It's why.

4. Income, by itself, is misleading

Income is the variable everyone thinks should be the most important one. It isn't.

Two overlapping density curves of income for repaid vs defaulted loans showing significant overlap
Figure 4. Defaulters skew lower-income, but the distributions overlap heavily. Plenty of $40k-earners repay on time; plenty of $90k-earners default. Income alone is a weak predictor.

If you draw a line at the median income ($55,000) and predict “everyone above repays, everyone below defaults,” you'd be wrong nearly half the time. The reason income looks important in lender narratives is that it appears in combinations — specifically, in the ratio of loan size to income, which is much more interesting.

5. Loan-to-income ratio is the single strongest raw signal

This is the signal that surprises people. The total loan amount doesn't matter much in isolation. Neither does income. But the ratio — the percentage of your annual income represented by the loan — separates good and bad loans almost as well as the lender's full grade does:

Bar chart showing default rate increasing with loan-to-income ratio, from 10% under 5% to 78.6% above 50%
Figure 5. Loan-to-income ratio: under 5% of income, defaults run 10.2%. Once the loan exceeds 30% of annual income, default rates jump to 69.7%. Above 50%, it's 78.6%.

The discontinuity around 30% is striking. It's also financially intuitive: a loan worth 30%+ of annual income usually means the monthly payment is a non-trivial fraction of monthly take-home pay. Combined with rent, food, healthcare, and existing debt, that load is mathematically hard to carry. The model doesn't need to “understand” this; it just sees the historical data and weights the ratio heavily.

Building a working credit scoring model

With the patterns in hand, we trained a baseline model. The setup:

Results on the held-out 8,110 applications:

ROC AUC 0.871 · accuracy 80.5% · recall 77.6% (defaults caught) · precision 53.8%
ROC curve well above the diagonal with AUC 0.871
Figure 6. The ROC curve sits well above the diagonal. AUC of 0.871 means: for any randomly-chosen defaulter and any randomly-chosen non-defaulter, the model assigns the defaulter a higher risk score 87% of the time.

Recall of 77.6%means the model catches more than three-quarters of the people who will actually default. That's good. Precision of 53.8% means that of everyone the model flags as risky, about half actually default — the other half are false positives.

This precision/recall tradeoff is the entire game in credit scoring. Tilt the threshold one way and you approve more good borrowers but eat more losses. Tilt it the other way and your loss rate drops but you reject creditworthy people. Every lender chooses where on that curve to operate, based on their cost of capital and their tolerance for charge-offs. The model doesn't make that choice; a product manager does.

Confusion matrix: 5152 true negatives, 1184 false positives, 397 false negatives, 1377 true positives
Figure 7.Confusion matrix at the default threshold. The 1,184 false positives — borrowers flagged risky who would have repaid — are the lender's opportunity cost. The 397 false negatives are the actual losses.

What the model actually weights

Because we used logistic regression on standardized features, the coefficients have a clean meaning: each one is the change in log-odds of default per one-standard-deviation move in that feature, holding the others constant. Positive coefficients raise the model's risk estimate; negative coefficients lower it.

Bar chart of standardized logistic regression coefficients showing loan_percent_income as the biggest positive driver
Figure 8. Standardized coefficients. Red raises risk; green lowers it. Loan-to-income ratio dominates everything else.

The top risk-raising features:

The top risk-lowering features:

The five rules that fall out of the data

If you wanted to summarize what a working credit model has learned — in language a borrower would actually understand — it comes down to five rules:

The same discipline applied to stocks

QScoring isn't a credit bureau. We score equities, not borrowers. But the modeling discipline is the same one: take a noisy population (companies instead of applicants), extract a handful of features that actually predict the outcome you care about (forward returns instead of defaults), and resist the temptation to over-engineer.

The lesson from credit scoring that we apply to equity scoring: simple linear models built on the right features beat complex models built on the wrong ones. The credit model above uses 11 features and a logistic regression and gets 87% AUC. A neural network on the same data, in our testing, gets to about 0.89 — marginally better, completely uninterpretable, and harder to defend in a regulated context.

For equity scoring the same principle holds. We've spent more time choosing the features than choosing the algorithm. If you're curious about which features we use and what their information coefficients look like, that's on the methodology page — with the same level of disclosure you just read above. Browse the live ticker scores to see the factor breakdown for any name in the universe.

Related reads

Discussion

Comments are powered by GitHub Discussions. Sign in with GitHub to join the conversation.

← All posts