Predicting loan defaults: what the data tells us banks miss

The credit-risk dataset we covered in our last postwas generous. Loan grades that ran from 10% defaults at Grade A to 98% at Grade G. Loan-to-income ratios with a sharp cliff at 30%. A logistic regression got to AUC 0.871. Reading that post, you'd be forgiven for thinking credit scoring is a solved problem.

This post is about what happens when the dataset isn't generous.

We pulled the hemanthsai7/loandefault Kaggle dataset — 67,463 anonymized loan applications, 35 features per loan, a real-world 9.25% default rate. Then we trained the two standard credit scoring models: logistic regression (the model class bank regulators are most comfortable with) and random forest (the model class data scientists reach for when they want non-linear interactions).

Headline: 67,463 loans · 9.25% default rate · best ROC AUC achieved was 0.527 (random forest), barely above the 0.5 random baseline.

Both models barely beat random. Here's why — and what the right lesson is.

1. The headline features are flat

If you visit any lender's FAQ page, you'll see the same four risk signals advertised: loan grade, home ownership, verification status, and interest rate. The implication is that these are how the bank decides whether you're a good risk.

In this dataset, they aren't.

Three-panel bar chart showing default rate by loan grade, home ownership, and verification status — all hovering near the 9.25% overall baseline — **Figure 1.**Default rate by loan grade, home ownership, and verification status. The dashed line is the overall 9.25% default rate. Grade A defaults at 8.7%; Grade G defaults at 10.6%. That's a **1.9 percentage point**spread across what is supposed to be the lender's most discriminating risk tier.

For comparison: the previous credit-risk dataset showed an 88 percentage pointspread between Grade A and Grade G. Here it's 1.9. The grade variable in this dataset is essentially noise — it has the column heading of a risk signal but none of the discrimination.

Home ownership tells the same story: mortgage holders default at 8.9%, outright owners at 10.2%, renters at 9.6%. Verification status is even flatter: not verified (9.2%), source verified (9.3%), verified (9.1%). None of these are signals; they're table dressings.

Interest rate is the most striking of the four:

Two overlapping density curves of interest rates for repaid vs defaulted loans — the curves are nearly identical — **Figure 2.** Interest rate distributions for repaid vs defaulted loans. Mean for repaid: 11.84%. Mean for defaulted: 11.88%. The lender priced both groups identically. Either the original underwriting model treated all of these applications the same way, or the rate is set by factors uncorrelated with the default outcome.

Either way, interest rate cannot help a downstream model predict default in this dataset.

2. Where signal does live — and it's thin

If the headline features don't carry signal, what does? Two columns: Public Record and Delinquency - two years. Both are negative-history flags — incidents the borrower has already accumulated before this loan was originated.

Two bar charts: default rate by number of public records and by number of 2-year delinquencies, both showing rising default rates with more flags — **Figure 3.** The features that do separate borrowers — but only weakly. Borrowers with 3+ public records default at 12.4% vs 9.2% for those with none. Borrowers with 4+ recent delinquencies default at 12.7% vs 9.2%. Real signal, but the populations are tiny: the 4+ delinquency bucket has only 577 borrowers out of 67,463.

The maximum absolute Pearson correlation between any individual feature and the default outcome is 0.011. For comparison, a feature would need a correlation of ~0.05+ to be considered weakly informative in most credit modeling contexts. Every feature in this dataset is below the “weakly informative” line.

Why this matters.Real production scorecards from FICO, VantageScore, and the major bureaus use features this dataset doesn't have: trade-line credit utilization, hard-inquiry velocity, payment recency on each tradeline, balance trajectory, and total debt burden across all existing obligations. Those features routinely show correlations of 0.15–0.30 with default. The dataset we're using simply doesn't include them. What “banks miss” — for the banks that use thin feature sets like this one — is the feature set itself.

3. Logistic regression vs. random forest

The conventional wisdom on credit modeling: start with logistic regression because regulators understand it, then move to random forest or gradient boosting if you need to capture non-linear feature interactions. Random forests, the story goes, can wring extra signal from interactions a linear model misses.

To test this, we trained both on the same 50,597-row train split (75%) and evaluated on the same 16,866-row test split (25%). To keep the comparison apples-to-apples in the face of 9.25% class imbalance, we tuned each model's probability threshold so that each one flags exactly the bottom 9.25% as predicted defaults — the same “lender risk appetite” for both.

Results:

LR → RF: AUC 0.520 → 0.527 · recall 9.8% → 9.9% · precision 9.8% → 9.9% · F1 0.098 → 0.099. The random baseline AUC is 0.500.

Both models are barely above the random-coin-flip baseline.The random forest's extra capacity bought us 0.7 AUC points. At the operating threshold, that translates to one additional default caught out of every ~1,500 loans flagged.

Grouped bar chart comparing AUC, precision, recall, and F1 for logistic regression vs random forest — both models close, both modest — **Figure 4.**Side-by-side metric comparison. Random forest edges out logistic regression on every metric, but the margins are within statistical noise. The story isn't “RF wins” — it's “neither model can rescue a weak feature set.”

ROC curves for logistic regression and random forest — both lines hug the diagonal baseline closely — **Figure 5.**ROC curves. A perfect classifier hugs the upper-left corner; a random classifier hugs the diagonal. Both models hug the diagonal. The random forest's curve is fractionally above the LR curve, which is fractionally above random.

This is what AUC 0.52 looks like. It's not zero signal — both models are statistically above the 0.5 baseline — but it's the kind of signal you'd want to verify with a much larger sample before betting any actual capital on it.

4. The confusion matrices tell the operational story

AUC summarizes the model's ranking quality. The confusion matrix shows what happens when you actually use the model to make decisions. At the prevalence-matched threshold, here's how each model performs on the 16,866-loan test set:

Two confusion matrices side by side — logistic regression and random forest both showing very similar predicted-default counts and similar true-positive vs false-positive ratios — **Figure 6.**Confusion matrices at each model's tuned threshold (each model flags ~9.25% of test loans as predicted defaults). LR catches 153 actual defaults out of 1,560 (recall 9.8%) at the cost of 1,408 false positives. RF catches 154 (recall 9.9%) at the cost of 1,407 false positives. The two matrices are essentially identical.

To put that in business terms: for every 100 loans the model flags as risky, only about 10 will actually default. The other 90 would have repaid if approved. That's a 10% precision— barely above the 9.25% you'd get by flagging loans at random.

A lender deploying this model into production wouldn't see meaningfully lower losses. They'd just reject more good borrowers.

5. What the random forest “found”

Even when the model performs poorly, its feature importance distribution tells us something. If the random forest had found a small handful of strong predictors, we'd see a sharp drop-off in the importance ranking. If it found no predictors at all, every feature would contribute roughly equally.

Horizontal bar chart of top 14 random forest feature importances — values are tightly clustered between 0.045 and 0.060, with no dominant feature — **Figure 7.** Random forest feature importance, top 14. The top feature (`Loan Amount`) has importance 0.056. The 14th feature has importance 0.034. That's a remarkably flat distribution — exactly what you'd expect when the model is grasping at marginal signal across many weak features rather than relying on a few strong ones.

The visible features in the top of the ranking — Loan Amount, Home Value Reported, Total Received Late Fee, Interest Rate, Revolving Utilities, Funded Amount Investor — are mostly continuous variables. The model is treating them as a high-dimensional ranking problem rather than finding categorical “buckets” of high-risk borrowers, because no such buckets exist in this dataset.

That's the diagnostic. A random forest with a flat importance distribution and a sub-0.55 AUC is telling you: the answer isn't in this data.

6. The right lesson

It would be easy to read this post as a takedown of random forests, or of logistic regression, or of credit scoring in general. None of those are the right read.

The right read is: data quality and feature selection beat model choice. Every single time.

If the features carry signal, as in the credit-risk dataset, even logistic regression — the oldest, simplest classifier in the toolkit — gets to AUC 0.87.
If the features don't carry signal, as in this loan-default dataset, even random forest — a non-linear ensemble with hundreds of trees — barely beats random guessing.

Switching algorithms is the cheapest thing in a modeling project. It costs an hour of compute and a one-line code change. Adding a genuinely informative feature can take weeks of data engineering, vendor negotiation, or new data collection. So when teams ship a weak model and reach for a fancier algorithm before reaching for better features, they're optimizing the wrong axis.

7. Why this matters for an equity investor

QScoring isn't a credit bureau. We score equities. But the modeling discipline rhymes:

Equity factor researchspent decades arguing about whether the value, size, momentum, profitability, and investment factors are real, robust, and persistent. The factor zoo problem (Cochrane's “hundreds of significant factors discovered”) is the equity-research equivalent of an overfit model finding spurious signal in noise.
Single-stock scoring needs features that historically separate winners from losers — measured by their information coefficient against forward returns, not their statistical fit in an in-sample regression.
Feature engineering matters more than algorithm choice. QScoring uses a deliberately small, vetted feature set drawn from the academic factor literature, scored consistently across the universe. Adding a deep-learning ranker on top of weak features would be exactly the mistake this post is warning against.

Browse the methodology pageto see which features we use and how they're combined, or look at any individual ticker's live scoreto see the factor breakdown — including each factor's contribution and the underlying metric values.

Predicting loan defaults: what the data tells us banks miss

1. The headline features are flat

2. Where signal does live — and it's thin

3. Logistic regression vs. random forest

4. The confusion matrices tell the operational story

5. What the random forest “found”

6. The right lesson

7. Why this matters for an equity investor

Related reads

Discussion