Blog

Detecting credit card fraud: when 99.8% accuracy means your model caught nothing

We trained logistic regression and random forest on 284,807 European card transactions (0.173% fraud rate). ROC AUC said one model was better. PR-AUC said the opposite. With extreme class imbalance, only one of those metrics tells the truth.

Every introductory machine-learning class teaches accuracy and ROC AUC as the headline metrics for binary classification. Most of those classes also use balanced datasets where those metrics work reasonably well — cancer/no-cancer split 50/50, churn/no-churn split 30/70, that kind of thing.

Real-world fraud detection is not that. The fraud rate on this Kaggle dataset is 0.173%— about one in 578 transactions. At that level of imbalance, the metric you choose decides whether your fraud team thinks the model is a success or a failure. We're going to show why.

The data: mlg-ulb/creditcardfraud, a publicly available dataset of 284,807 European cardholder transactions across a 48-hour window in September 2013. Of those, 492 are confirmed fraud. The features are anonymized — 28 of the 30 columns are PCA-transformed for privacy (labeled V1 through V28), with only Time and Amount kept in their original form.

That's a more interesting setup than it sounds. We can't lean on the columns being “income” or “loan grade.” The model has to find signal in features it can't name. Which makes this a particularly clean test of where the signal actually lives — and which metric is reading it correctly.

Headline: 284,807 transactions · 0.173% fraud rate · best ROC AUC 0.968 (LR) · best PR-AUC 0.824 (RF) · 87% of fraud caught by reviewing the top 1% of flagged transactions.

1. The class imbalance is what eats your metrics

Stacked horizontal bar showing 284,315 legitimate transactions and only 492 fraud cases — fraud is barely visible
Figure 1.284,315 legitimate transactions, 492 fraud cases. If you predicted “legitimate” for every transaction in this dataset, you'd be wrong 492 times out of 284,807 — an accuracy of 99.83%. That number sounds like a win. It catches zero fraud.

This is the imbalance trap. When the rare class is what matters, the metric you reach for first — overall accuracy — is dominated by the easy class. A model that simply predicts “legitimate” for every single transaction beats almost any naive predictor at “accuracy.” The number looks great. The customers losing money to fraud see no benefit.

The same problem leaks into the next metric most teams reach for: ROC AUC. We'll get to that one in a minute.

2. Where the fraud actually lives

Before we train any models, two things from the raw data are worth flagging — both because they're counterintuitive and because they shape the modeling choices later.

First, fraud has a time-of-day pattern.The dataset spans 48 hours, and if we treat transaction time modulo 24 hours as a rough “hour of day,” fraud rate isn't constant:

Bar chart of fraud rate by hour-of-day showing elevated rates between 2am and 6am
Figure 2.Fraud rate by hour-of-day. The overnight window (2am–6am local) shows rates up to 4× the overall baseline — when card-not-present fraud rings prefer to operate because human review queues are thinner and customers are asleep and slower to notice charges.

Second, fraud amounts are smaller than legitimate ones — at the median. This one surprises people:

Histogram of transaction amounts on log scale showing fraud distribution skewed toward smaller amounts compared to legitimate transactions
Figure 3. Transaction amount distribution (log-x scale). Median legitimate transaction: $22.00. Median fraud transaction: $9.25. The intuition that “fraud is big-ticket purchases” is wrong on this dataset — the most common fraud pattern is small-dollar card-testing transactions where the attacker validates that a stolen card works before attempting larger charges.

The mean tells a different story (fraud mean is $122 vs $88 for legit — a few huge fraud transactions pull the average up), which is why looking at distributions matters. Means lie under fat tails. So do single-threshold “flag if amount > X” rules.

3. Training two models

Same two model families as our previous posts on credit-risk modeling and loan default prediction: logistic regression (linear, regulator-friendly) and random forest (non-linear ensemble). Both trained with class_weight=balanced on a 70/30 stratified split. Test set: 85,443 transactions, 148 of them fraud.

The headline numbers:

ROC AUC: LR 0.968 / RF 0.949 · PR-AUC: LR 0.705 / RF 0.824 · Precision @ 0.5: LR 6.7% / RF 96.5% · Recall @ 0.5: LR 87.8% / RF 73.6%

Read those carefully. The two models are doing very different things, and the headline summary metric you pick decides which one looks better.

4. ROC AUC vs. PR-AUC — the chart that explains everything

The single most important plot in this post:

Two panels: ROC curves on the left showing both models near top-left corner, PR curves on the right showing the random forest meaningfully higher than logistic regression
Figure 4. Same two models, two metrics. Left (ROC):both curves hug the upper-left corner. LR's AUC is 0.968, RF's is 0.949 — LR looks better. Right (PR):RF clearly dominates. RF's PR-AUC is 0.824 vs LR's 0.705. Different stories, same models, same data.

This is the lesson. The two AUC metrics disagree because they answer different questions:

At extreme class imbalance, ROC AUC saturates near 1.0 for any model that gets the easy stuff right. The difference between “0.95 ROC AUC” and “0.97 ROC AUC” sounds tiny but can hide a 2× difference in how many false alarms your analysts wade through to find the same number of real frauds.

Rule of thumb. If the positive class is below ~5% of your data — fraud, rare disease screening, ad clicks, churn at sub-monthly intervals — make PR-AUC your headline metric. ROC AUC stays in the deck as a sanity check, but the optimization target should be PR-AUC or a directly operational quantity like recall-at-fixed-FPR or precision-at-top-K.

5. The threshold matters more than the algorithm

Both models output a probability between 0 and 1. The default is to flag anything above 0.5 as fraud. With class imbalance this severe, that's rarely the right cutoff.

Two confusion matrices for random forest: at threshold 0.5 showing high precision and lower recall, at top-0.5% threshold showing more captured frauds but more false positives
Figure 5. Same random forest, two thresholds. Left (default 0.5): precision is 96.5%, recall is 73.6% — only 41 false alarms in 85,443 transactions, but 39 actual frauds missed. Right (top-0.5% threshold): precision drops to 29.4%, but recall rises to 85.1% — the model now catches 126 of 148 frauds at the cost of 302 false positives the analysts will review.

Which one is “better” depends entirely on the fraud team's operational budget. If they can review 300 transactions per day, the top-0.5% threshold catches more fraud. If they can only review 50 per day, the high-precision threshold is the right cut. The model didn't change — only the operating point did.

6. The operational view: cumulative gains

For a fraud team, the metric they actually care about is roughly: if I have capacity to review the top X% of flagged transactions, what fraction of fraud will I catch? That's a cumulative gains curve:

Cumulative gains curve showing that reviewing the top 1% of transactions ranked by random forest risk catches ~87% of fraud
Figure 6. Cumulative gains. Reviewing the top 1% of model-flagged transactions catches 87% of fraud with the random forestand 86% with logistic regression. By 5%, both models are over 90%. The diagonal is what you'd get reviewing transactions at random.

This is the chart you want in front of a fraud-ops director. “Give me a budget to review 1% of transactions and I'll catch 87% of your fraud” is a defensible business case. “My model has 0.96 ROC AUC” is not.

7. What the random forest found in anonymized features

Even though V1–V28 are PCA components and have no human-readable names, the random forest's importance ranking tells us how concentrated the fraud signal is across them:

Horizontal bar chart of top 12 random forest feature importances showing V17, V14, V12, V10 dominating with importance over 0.10 each
Figure 7. Random forest feature importance, top 12. Unlike the loan-default datasetwhere importance was uniformly distributed across 30 weak features, here a handful of PCA components — V17, V14, V12, V10 — carry most of the signal. The fraud features were anonymized, but they're anonymized in a way that preserves enough variance for the model to discriminate.

This is what “a model is finding something” looks like in feature-importance terms: a few features doing most of the work, with a long tail of marginal contributors. Compare that to our loan-default post, where the top feature's importance was barely above the bottom feature's — that was the signature of a model grasping at noise.

8. What this means for equity scoring

QScoring scores equities, not fraud, but the metric-choice question rhymes. Single-stock scoring is also an imbalanced ranking problem: the genuine breakouts and breakdowns over the next quarter are the rare events; most stocks drift along sector-and-market beta. The wrong evaluation metric makes everything look like it's working.

If you only remember one thing from this post: the metric you optimize is the metric you get. Choose accordingly.

Related reads

Discussion

Comments are powered by GitHub Discussions. Sign in with GitHub to join the conversation.

← All posts