Heart Disease ML

Applying Machine Learning models to predict heart disease risk from health-related tabular indicators.

Modality 1: Tabular Data

1. From EDA to Tabular ML Task

This section presents our machine learning workflow for the Heart Disease Health Indicators dataset. We built a binary classification pipeline to predict heart disease risk using insights from Assignment 1 EDA, feature preprocessing, six-model comparison, threshold tuning, and final test evaluation.

Overview

  • Dataset Heart Disease Health Indicators
  • Task Binary classification
  • Target HeartDiseaseorAttack
  • Goal Predict heart disease risk from health-related indicators

Problem Setup

  • Total samples 253,680
  • Input features 21
  • Feature groups:

    Continuous: BMI, MentHlth, PhysHlth

    Ordinal: Diabetes, GenHlth, Age, Education, Income

    Binary: Remaining indicators

EDA → ML Decisions

Class imbalance

Used stratified splitting, class-weighted models, and imbalance-aware metrics.

Duplicate rows

Kept in the main pipeline; identical responses may represent different respondents in survey data.

Mixed feature types

Grouped variables for suitable preprocessing (Scaling vs Encoding).

2. Data Split & Preprocessing

Preprocessing was designed based on both feature types and model requirements to ensure a fair comparison.

📊 Data Split

Train
70%
177,576 samples
Validation
15%
38,052 samples
Test
15%
38,052 samples

Note: Stratified splitting was used to preserve the target class distribution across all subsets.

⚙️ Preprocessing Strategy

Linear Models

Continuous and ordinal features were standardized using StandardScaler. Binary features were kept unchanged.

Tree-based & Boosting

Raw feature values used directly; no scaling required as these models are invariant to monotonic transformations.

End-to-End Workflow

Raw Data Stratified Split Model-Specific Preprocessing Model Training Validation Evaluation
💡

Key takeaway: This design respects the unique characteristics of each model family while ensuring a baseline for objective evaluation.

3. Models Applied

We compared six models to evaluate different machine learning approaches for tabular classification, ranging from basic baselines to advanced ensemble methods.

Machine Learning Model Role & Strategy
Dummy Classifier
Baseline model used for reference
Logistic Regression (balanced)
Linear and interpretable baseline for binary classification
Random Forest (balanced)
Ensemble tree model for non-linear relationships and feature interactions
Extra Trees (balanced)
Randomized tree ensemble used for comparison with Random Forest
Gradient Boosting
Sequential boosting model that corrects previous errors and often performs well on structured tabular data
XGBoost
Advanced gradient boosting model designed for strong predictive performance on tabular datasets

4. Evaluation Strategy

All models were trained on the 70% training set and rigorously evaluated on the 15% validation set before the final test.

Metrics Used

Accuracy

Overall percentage of correct predictions.

Balanced Accuracy

Arithmetic mean of class-specific recall scores.

Precision

Ratio of correctly predicted positive cases to total predicted positives.

Recall

Ratio of correctly predicted positive cases to all actual positives.

F1-Score

Harmonic mean of Precision and Recall.

ROC-AUC

Model's ability to discriminate between classes.

PR-AUC

Best for evaluating minority detection in imbalanced data.

Model Selection Priority

Given the 90/10 class imbalance, we prioritized metrics that penalize false negatives and reward correct minority classification.

1

PR-AUC

Primary Priority

2

F1-score

Secondary

3

ROC-AUC

Ranking Power

4

Recall

Clinical Safety

5. Preprocessing Configuration

📈 Linear Preprocessor
Used forLogistic Regression
Scaled featuresContinuous + Ordinal features
ScalerStandardScaler
Binary featuresPassthrough (unchanged)
PurposeMake feature scales suitable for linear modeling
🌿 Tree Preprocessor
Used forDummy, RF, ET, HistGB, XGBoost
Feature handlingPassthrough
ScalingNot required
PurposePreserve raw feature values for tree-based splits

6. Model Parameters

Baseline Reference

🔘
Dummy Classifier
PreprocessingTree preprocessor
Strategymost_frequent
Class balancingNot applied

Linear Model

📈
Logistic Regression
PreprocessingLinear preprocessor
Penalty / Cl2 / 0.5
Solverliblinear
Max Iterations3000
Weightbalanced

Ensemble Model

🌲
Random Forest
N Estimators / Depth300 / 12
Min Samples Split10
Min Samples Leaf10
Weightbalanced_subsample

Native Boosting

🚀
HistGradientBoosting
L-Rate / Iterations0.05 / 300
Max Depth / Leaf8 / 20
L2 Regularization1.0
Weightbalanced

Randomized Ensemble

🌴
Extra Trees
N Estimators / Depth400 / 12
Min Samples Split10
Min Samples Leaf10
Weightbalanced

Top Performance Model

⚔️
XGBoost Balanced PR-AUC Optimized
Estimators400
Max Depth4
L-Rate0.05
Subsample0.8
Colsample0.8
Alpha/Lambda0.5 / 2.0
Gamma/Weight1.0 / pos
Tree Methodhist

7. Formulas & Metrics

Evaluation Metrics

MetricMathematical FormulaClinical Meaning
Accuracy(TP + TN) / TotalOverall correct predictions
PrecisionTP / (TP + FP)Correctness among predicted positives
RecallTP / (TP + FN)Ability to detect actual positives
F1-score2 × (P × R) / (P + R)Balance between Precision and Recall
SpecificityTN / (TN + FP)Ability to detect actual negatives

Confusion Matrix Terms

TPTrue Positive
Correctly predicted heart disease
TNTrue Negative
Correctly predicted no disease
FPFalse Positive
Type I Error: Healthy but predicted ill
FNFalse Negative
Type II Error: Missing actual illness

8. Individual Model Results

Model Metrics

Confusion Matrix

ROC Curve

Precision-Recall Curve

9. Summary Table / Model Comparison

Comparison of validation performance across all tested models at a default threshold of 0.5.

# Model Accuracy Balanced Acc Precision Recall F1 ROC-AUC PR-AUC
0 XGBoost_Balanced 0.7411 0.7740 0.2411 0.8145 0.3721 0.8485 0.3753
1 HistGradientBoosting_Balanced 0.7352 0.7732 0.2376 0.8200 0.3684 0.8482 0.3729
2 LogisticRegression_Balanced 0.7516 0.7721 0.2467 0.7974 0.3768 0.8454 0.3655
3 RandomForest_Balanced 0.7632 0.7674 0.2525 0.7726 0.3806 0.8457 0.3640
4 ExtraTrees_Balanced 0.7428 0.7694 0.2405 0.8022 0.3701 0.8441 0.3634
5 Dummy Classifier 0.9058 0.5000 0.0000 0.0000 0.0000 0.5000 0.0942

📌 Performance Findings

  • XGBoost_Balanced achieved the best overall validation ranking with PR-AUC = 0.3753 and ROC-AUC = 0.8485.

  • HistGradientBoosting_Balanced was a close second and showed the highest Recall = 0.8200.

  • LogisticRegression_Balanced remained a strong linear baseline with competitive ranking performance.

  • Random Forest and Extra Trees performed well, but ranked slightly lower overall in the prioritized metrics.

  • !

    Dummy had the highest Accuracy but failed completely on the minority class, missing all heart disease cases.

Final Conclusion

Based on the superior PR-AUC and balanced metrics, XGBoost_Balanced was selected as the best overall model for the final threshold tuning and evaluation phase.

10. Threshold Tuning

Why threshold tuning?

  • The default threshold (0.50) is not always optimal for imbalanced data.
  • Different thresholds create different trade-offs between Precision, Recall, and F1-score.
  • We tuned the threshold to improve minority-class detection and find a better balance.

Interpretation

  • Setting the threshold to 0.69 gave the best Precision–Recall trade-off based on the highest F1-score.
  • This tuned threshold was then used for the final evaluation on the test set.

Result

  • The default threshold 0.50 was not optimal for this imbalanced problem.
  • The best threshold was 0.69, where the model achieved the highest F1-score = 0.3852.
  • At this point, the model maintained Recall = 0.6911.
  • Threshold tuning helped the model make more balanced decisions.

11. Final Test Evaluation (XGBoost_Balanced)

Final Results

Test Set Metrics

Test Confusion Matrix

Final ROC Curve (Test Set)

Final Precision-Recall Curve (Test Set)

📊 Final Evaluation Insights

  • The test results show that XGBoost_Balanced generalizes well on unseen data.
  • ROC-AUC = 0.8511 and PR-AUC = 0.3733 indicate good ranking performance under class imbalance.
  • Recall = 0.6027 means the model can detect a meaningful portion of actual positive cases.
  • Precision = 0.3268 shows that the model still produces many false positives.
  • The confusion matrix suggests that the model favors detecting more positive cases, even at the cost of extra false alarms.
  • Overall, the model achieves a reasonable trade-off between minority-class detection and overall classification quality.

12. Conclusion

Final summary of the machine learning pipeline and its impact on heart disease prediction.

01

Assignment 1 EDA directly guided the design of the tabular ML pipeline, especially in handling class imbalance, keeping duplicate survey responses, and grouping features for tailored preprocessing.

02

A complete end-to-end workflow was successfully built, encompassing feature grouping, model-specific scaling, 6-model comparison, and precision-recall threshold tuning.

03

Threshold tuning proved critical: shifting from the default 0.50 to an optimal 0.69 significantly improved the model's reliability for clinical prediction.

Best Performing Model

XGBoost_Balanced

Achieved the strongest balance between ranking ability (AUC) and minority-class detection (F1/PR) on the validation set.

Final Test Evaluation

ROC-AUC
0.8511
PR-AUC
0.3733
F1-Score
0.4238

No significant overfitting observed. Excellent generalization on unseen data.

Overall, the final pipeline provides a strong and practical baseline for heart disease risk prediction on imbalanced tabular health data.