a breakdown of

Heart Disease
Classification Results

A step-by-step visual walkthrough of C4.5, CART, and pruned CART models - from raw data all the way to final predictions.

✦ 302 instances ✦ 13 features ✦ 3 models ✦ best recall: 87.9%
1
Step One
The Dataset 🌸

We're using the Cleveland Heart Disease dataset. After removing duplicate rows (the Kaggle version had 1,025 rows but 723 were duplicates!), we're left with 302 clean unique patients. The goal: predict whether each patient has heart disease based on 13 clinical measurements.

302
Clean Rows
13
Features
160
No Disease
142
Disease
53/47
% Split
Class Balance - Nearly Equal ✨
No Disease · 160
Disease · 142
No Heart Disease (0) - 160 patients
Heart Disease (1) - 142 patients
💡
Why balance matters With a roughly 50/50 split, accuracy is a meaningful metric here - unlike imbalanced datasets where a lazy model could score 80% by always predicting the majority class without learning anything.
♡ ♡ ♡
2
Step Two
Preprocessing 🎀

Before any learning can happen, the data needs to be cleaned and converted. Decision trees need numbers - so categorical columns get turned into binary yes/no columns using One-Hot Encoding.

What Happens & Why - Step by Step
🗑️
Remove Duplicates The Kaggle heart.csv has 1,025 rows but only 302 are unique. Duplicates leak into both train and test sets, causing fake perfect scores - like the 100% accuracy we saw earlier!
1,025 → 302 rows
🔢
Binarise Target Original values are 0-4 (severity). We simplify: 0 stays 0 (healthy), anything 1-4 becomes 1 (disease present). Binary classification is cleaner and clinically meaningful.
0,1,2,3,4 → 0 or 1
🏷️
One-Hot Encode Categoricals 8 categorical columns (sex, cp, fbs, restecg, exang, slope, ca, thal) expand into binary yes/no columns - the model won't treat category 3 as "bigger" than category 1.
13 features → 28 columns
✂️
Stratified 80/20 Split 241 rows for training, 61 for testing. Stratified means the 53/47% class balance is preserved in both subsets so the test set is representative.
241 train · 61 test
⚖️
Balanced Class Weights class_weight='balanced' penalises missed disease cases more heavily. With near-equal classes the effect is small, but it's important practice for imbalanced medical data.
class_weight='balanced'
♡ ♡ ♡
3
Step Three
Training the Models 💻

Three models are trained on the same 241 rows. The only differences are how they choose which feature to split on at each decision node, and whether branches get pruned afterwards.

How Each Model Makes Decisions
🌹
C4.5 - criterion="entropy" Asks: "Which split most reduces uncertainty?" Uses entropy (information theory). Corrects for attributes with many categories using Gain Ratio. Grew to depth 9 with 33 leaf nodes.
Entropy · Gain Ratio · depth 9 · 33 leaves
💜
CART - criterion="gini" Asks: "Which split minimises misclassification probability?" Uses Gini Impurity - faster maths. Always binary splits. No pruning on small data = overfits badly.
Gini Impurity · Binary splits · depth ? · 29 leaves
🌿
CART Pruned - ccp_alpha tuned by CV Same as CART but branches that don't justify their complexity are removed. Alpha selected by 5-fold cross-validation on the training set. Reduces to 24 leaves.
Gini · Cost-Complexity Pruning · 24 leaves
♡ ♡ ♡
4
Step Four
Performance Metrics 📊

Each model is evaluated on the 61 test instances it never saw during training. Four metrics give us the full picture - because accuracy alone hides class-specific failures.

What Each Metric Actually Means 💕
🎯
Accuracy"What % of all predictions were correct?" - Reliable here because classes are balanced.
🔍
Precision"When we predicted disease, how often were we right?" - High precision = few false alarms.
🩺
Recall ← most important"Of all real disease cases, how many did we catch?" - Missing a diagnosis is dangerous. This is the metric that matters most clinically.
⚖️
F1 Score"Balance between precision and recall." - Best single number to rank models fairly.
Side-by-Side Comparison - Our Actual Results
Accuracy
C4.5
78.7%
👑
CART
68.9%
Pruned
77.1%
Precision
C4.5
83.3%
👑
CART
73.3%
Pruned
74.4%
Recall 🩺 Most Important in Medicine
C4.5
75.8%
CART
66.7%
Pruned
87.9%
👑
F1 Score
C4.5
0.794
CART
0.698
Pruned
0.806
👑
♡ ♡ ♡
5
Step Five
Confusion Matrices 🔢

A confusion matrix breaks predictions into four boxes. The most important for heart disease is False Negatives - real disease cases the model missed. These are the dangerous errors because the patient goes home thinking they're fine.

🩺
How to read each box True Positive (TP) = correctly caught disease  ·  True Negative (TN) = correctly identified healthy  ·  False Negative (FN) = missed disease ← worst!  ·  False Positive (FP) = unnecessary false alarm
C4.5
23
True Neg.
5
False Pos.
8
False Neg. ⚠️
25
True Pos.
Missed 8 disease cases. Fewest false alarms (5).
CART
20
True Neg.
8
False Pos.
11
False Neg. ⚠️
22
True Pos.
Missed 11 disease cases - worst of the three.
CART Pruned ✨
18
True Neg.
10
False Pos.
4
False Neg. ✓
29
True Pos.
Only missed 4 disease cases. Best for clinical use.
💡
The trade-off in plain English Pruned CART sends more false alarms (10 vs 5 for C4.5) but only misses 4 real cases vs 8. In medicine this is always the right trade-off - a false alarm leads to more tests, a missed diagnosis can be fatal.
♡ ♡ ♡
6
Step Six
Why Pruning Helped ✂️

Pruning is the single most impactful thing in this experiment. Without it, CART memorises the 241 training patients including their noise, then struggles on new ones. Pruning removes branches that don't reflect genuine patterns.

Before vs After Pruning
CART Unpruned
29
leaf nodes · F1: 0.698
CART Pruned ✨
24
leaf nodes · F1: 0.806 · −17%
1️⃣
Grow the full treeCART builds until training is memorised - 29 leaves, overfit.
2️⃣
Generate candidate alphasThe pruning path provides a list of α values - each removes more branches. Higher α = simpler tree.
3️⃣
5-fold cross-validationTraining data is split into 5 groups. For each α, train on 4, validate on 1 - five times - average the F1 scores. Every row gets validated exactly once.
4️⃣
Pick the best α & retrainThe α with the highest average CV F1 is used to build the final pruned tree. 29 → 24 leaves.
Recall: 66.7% → 87.9%  ·  F1: 0.698 → 0.806
♡ ♡ ♡
7
Step Seven
Feature Importances 🌺

After training, the pruned CART model tells us exactly which features it actually used. Remarkably, only 3 features out of 28 do all the work - everything else was pruned away as noise. This makes the model beautifully interpretable.

What the Model Actually Learned to Look At
🫀 Chest Pain (cp_0)
70.9%
70.9%
🧬 Thalassemia (thal_2)
17.2%
17.2%
🔬 Vessels (ca_0)
12.0%
12.0%
🫀 cp_0 · 70.9%
Asymptomatic chest pain. Patients with heart disease often show no typical symptoms - making its absence a paradoxically strong predictor.
🧬 thal_2 · 17.2%
Reversible thalassemia defect. Indicates areas of the heart with reduced blood flow under stress - a direct clinical marker of coronary artery disease.
🔬 ca_0 · 12.0%
Zero major vessels visible by fluoroscopy. Healthy vessels show up clearly - none visible suggests blockages restricting blood flow.
Clinically meaningful! All three features the pruned model selected align directly with known cardiac disease indicators in medical literature. A model explainable with just three questions - any clinician could understand and audit every single prediction.
♡ ♡ ♡
8
Step Eight
Final Verdict 🏆

Here's the complete picture. Each model has its strengths - but for a medical classification task where missing a diagnosis is the worst possible error, CART Pruned wins clearly.

🌹
C4.5
0.794
F1 · Best Precision (83.3%)
Most careful about false alarms - only 5. But misses 8 real disease cases. Best choice if sending people for unnecessary tests is very costly. Outperforms unpruned CART overall.
💜
CART
0.698
F1 · Weakest Overall
Overfits on just 241 training rows. Misses 11 disease cases - worst of the three. Without pruning, CART memorises noise instead of learning real patterns on small datasets.
🏆
CART Pruned
0.806
F1 · Best Recall (87.9%) · Recommended
Catches 29 of 33 disease cases. Only misses 4. More false alarms (10) but that's the right medical trade-off. Also the most interpretable - just 3 features drive every decision.
The 4 Things to Remember 💕
🎀
Pruning beats algorithm choicePruning improved CART's recall by 21 points and F1 by 0.108. The C4.5 vs CART gap was only ~10 points accuracy. Pruning matters more than which algorithm we pick.
🎀
Small datasets amplify overfittingWith only 302 rows, unpruned trees memorise training noise easily. This is why CART unpruned scored worst despite being a strong algorithm on larger datasets.
🎀
Recall is the right metric hereEven though CART Pruned has lower accuracy than C4.5 (77.1% vs 78.7%), it catches 4 more disease cases. In medicine, that's what actually matters.
🎀
3 features tell the whole storyChest pain type, thalassemia result, and vessel count explain everything. A beautifully simple, clinically interpretable model that any doctor could audit and trust.