1. From EDA to Tabular ML Task
This section presents our machine learning workflow for the Heart Disease Health Indicators dataset. We built a binary classification pipeline to predict heart disease risk using insights from Assignment 1 EDA, feature preprocessing, six-model comparison, threshold tuning, and final test evaluation.
Overview
- Dataset Heart Disease Health Indicators
- Task Binary classification
- Target HeartDiseaseorAttack
- Goal Predict heart disease risk from health-related indicators
Problem Setup
- Total samples 253,680
- Input features 21
-
Feature groups:
Continuous: BMI, MentHlth, PhysHlth
Ordinal: Diabetes, GenHlth, Age, Education, Income
Binary: Remaining indicators
EDA → ML Decisions
Used stratified splitting, class-weighted models, and imbalance-aware metrics.
Kept in the main pipeline; identical responses may represent different respondents in survey data.
Grouped variables for suitable preprocessing (Scaling vs Encoding).
2. Data Split & Preprocessing
Preprocessing was designed based on both feature types and model requirements to ensure a fair comparison.
📊 Data Split
Note: Stratified splitting was used to preserve the target class distribution across all subsets.
⚙️ Preprocessing Strategy
Linear Models
Continuous and ordinal features were standardized using StandardScaler. Binary features were kept unchanged.
Tree-based & Boosting
Raw feature values used directly; no scaling required as these models are invariant to monotonic transformations.
End-to-End Workflow
Key takeaway: This design respects the unique characteristics of each model family while ensuring a baseline for objective evaluation.
3. Models Applied
We compared six models to evaluate different machine learning approaches for tabular classification, ranging from basic baselines to advanced ensemble methods.
| Machine Learning Model | Role & Strategy |
|---|---|
|
Dummy Classifier
|
Baseline model used for reference |
|
Logistic Regression (balanced)
|
Linear and interpretable baseline for binary classification |
|
Random Forest (balanced)
|
Ensemble tree model for non-linear relationships and feature interactions |
|
Extra Trees (balanced)
|
Randomized tree ensemble used for comparison with Random Forest |
|
Gradient Boosting
|
Sequential boosting model that corrects previous errors and often performs well on structured tabular data |
|
XGBoost
|
Advanced gradient boosting model designed for strong predictive performance on tabular datasets |
4. Evaluation Strategy
All models were trained on the 70% training set and rigorously evaluated on the 15% validation set before the final test.
Metrics Used
Overall percentage of correct predictions.
Arithmetic mean of class-specific recall scores.
Ratio of correctly predicted positive cases to total predicted positives.
Ratio of correctly predicted positive cases to all actual positives.
Harmonic mean of Precision and Recall.
Model's ability to discriminate between classes.
Best for evaluating minority detection in imbalanced data.
Model Selection Priority
Given the 90/10 class imbalance, we prioritized metrics that penalize false negatives and reward correct minority classification.
PR-AUC
Primary Priority
F1-score
Secondary
ROC-AUC
Ranking Power
Recall
Clinical Safety
5. Preprocessing Configuration
| Used for | Logistic Regression |
| Scaled features | Continuous + Ordinal features |
| Scaler | StandardScaler |
| Binary features | Passthrough (unchanged) |
| Purpose | Make feature scales suitable for linear modeling |
| Used for | Dummy, RF, ET, HistGB, XGBoost |
| Feature handling | Passthrough |
| Scaling | Not required |
| Purpose | Preserve raw feature values for tree-based splits |
6. Model Parameters
Baseline Reference
Linear Model
Ensemble Model
Native Boosting
Randomized Ensemble
Top Performance Model
7. Formulas & Metrics
Evaluation Metrics
| Metric | Mathematical Formula | Clinical Meaning |
|---|---|---|
| Accuracy | (TP + TN) / Total | Overall correct predictions |
| Precision | TP / (TP + FP) | Correctness among predicted positives |
| Recall | TP / (TP + FN) | Ability to detect actual positives |
| F1-score | 2 × (P × R) / (P + R) | Balance between Precision and Recall |
| Specificity | TN / (TN + FP) | Ability to detect actual negatives |
Confusion Matrix Terms
8. Individual Model Results
Model Metrics
Confusion Matrix
ROC Curve
Precision-Recall Curve
9. Summary Table / Model Comparison
Comparison of validation performance across all tested models at a default threshold of 0.5.
| # | Model | Accuracy | Balanced Acc | Precision | Recall | F1 | ROC-AUC | PR-AUC |
|---|---|---|---|---|---|---|---|---|
| 0 | XGBoost_Balanced | 0.7411 | 0.7740 | 0.2411 | 0.8145 | 0.3721 | 0.8485 | 0.3753 |
| 1 | HistGradientBoosting_Balanced | 0.7352 | 0.7732 | 0.2376 | 0.8200 | 0.3684 | 0.8482 | 0.3729 |
| 2 | LogisticRegression_Balanced | 0.7516 | 0.7721 | 0.2467 | 0.7974 | 0.3768 | 0.8454 | 0.3655 |
| 3 | RandomForest_Balanced | 0.7632 | 0.7674 | 0.2525 | 0.7726 | 0.3806 | 0.8457 | 0.3640 |
| 4 | ExtraTrees_Balanced | 0.7428 | 0.7694 | 0.2405 | 0.8022 | 0.3701 | 0.8441 | 0.3634 |
| 5 | Dummy Classifier | 0.9058 | 0.5000 | 0.0000 | 0.0000 | 0.0000 | 0.5000 | 0.0942 |
📌 Performance Findings
-
●
XGBoost_Balanced achieved the best overall validation ranking with PR-AUC = 0.3753 and ROC-AUC = 0.8485.
-
●
HistGradientBoosting_Balanced was a close second and showed the highest Recall = 0.8200.
-
●
LogisticRegression_Balanced remained a strong linear baseline with competitive ranking performance.
-
●
Random Forest and Extra Trees performed well, but ranked slightly lower overall in the prioritized metrics.
-
!
Dummy had the highest Accuracy but failed completely on the minority class, missing all heart disease cases.
Final Conclusion
Based on the superior PR-AUC and balanced metrics, XGBoost_Balanced was selected as the best overall model for the final threshold tuning and evaluation phase.
10. Threshold Tuning
Why threshold tuning?
- • The default threshold (0.50) is not always optimal for imbalanced data.
- • Different thresholds create different trade-offs between Precision, Recall, and F1-score.
- • We tuned the threshold to improve minority-class detection and find a better balance.
Interpretation
- ✅ Setting the threshold to 0.69 gave the best Precision–Recall trade-off based on the highest F1-score.
- ✅ This tuned threshold was then used for the final evaluation on the test set.
Result
- The default threshold 0.50 was not optimal for this imbalanced problem.
- The best threshold was 0.69, where the model achieved the highest F1-score = 0.3852.
- At this point, the model maintained Recall = 0.6911.
- Threshold tuning helped the model make more balanced decisions.
11. Final Test Evaluation (XGBoost_Balanced)
Final ResultsTest Set Metrics
Test Confusion Matrix
Final ROC Curve (Test Set)
Final Precision-Recall Curve (Test Set)
📊 Final Evaluation Insights
- ✓ The test results show that XGBoost_Balanced generalizes well on unseen data.
- ✓ ROC-AUC = 0.8511 and PR-AUC = 0.3733 indicate good ranking performance under class imbalance.
- ✓ Recall = 0.6027 means the model can detect a meaningful portion of actual positive cases.
- ✓ Precision = 0.3268 shows that the model still produces many false positives.
- ✓ The confusion matrix suggests that the model favors detecting more positive cases, even at the cost of extra false alarms.
- ✓ Overall, the model achieves a reasonable trade-off between minority-class detection and overall classification quality.
12. Conclusion
Final summary of the machine learning pipeline and its impact on heart disease prediction.
Assignment 1 EDA directly guided the design of the tabular ML pipeline, especially in handling class imbalance, keeping duplicate survey responses, and grouping features for tailored preprocessing.
A complete end-to-end workflow was successfully built, encompassing feature grouping, model-specific scaling, 6-model comparison, and precision-recall threshold tuning.
Threshold tuning proved critical: shifting from the default 0.50 to an optimal 0.69 significantly improved the model's reliability for clinical prediction.
Best Performing Model
Achieved the strongest balance between ranking ability (AUC) and minority-class detection (F1/PR) on the validation set.
Final Test Evaluation
No significant overfitting observed. Excellent generalization on unseen data.
Overall, the final pipeline provides a strong and practical baseline for heart disease risk prediction on imbalanced tabular health data.