Tabular ML - Group APlus

1. From EDA to Tabular ML Task

This section presents our machine learning workflow for the Heart Disease Health Indicators dataset. We built a binary classification pipeline to predict heart disease risk using insights from Assignment 1 EDA, feature preprocessing, six-model comparison, threshold tuning, and final test evaluation.

Overview

Dataset Heart Disease Health Indicators
Task Binary classification
Target HeartDiseaseorAttack
Goal Predict heart disease risk from health-related indicators

Problem Setup

Total samples 253,680
Input features 21
Feature groups:

Continuous: BMI, MentHlth, PhysHlth

Ordinal: Diabetes, GenHlth, Age, Education, Income

Binary: Remaining indicators

EDA → ML Decisions

Class imbalance

Used stratified splitting, class-weighted models, and imbalance-aware metrics.

Duplicate rows

Kept in the main pipeline; identical responses may represent different respondents in survey data.

Mixed feature types

Grouped variables for suitable preprocessing (Scaling vs Encoding).

2. Data Split & Preprocessing

Preprocessing was designed based on both feature types and model requirements to ensure a fair comparison.

📊 Data Split

Train

70%

177,576 samples

Validation

15%

38,052 samples

Test

15%

38,052 samples

Note: Stratified splitting was used to preserve the target class distribution across all subsets.

⚙️ Preprocessing Strategy

Linear Models

Continuous and ordinal features were standardized using StandardScaler. Binary features were kept unchanged.

Tree-based & Boosting

Raw feature values used directly; no scaling required as these models are invariant to monotonic transformations.

End-to-End Workflow

Raw Data → Stratified Split → Model-Specific Preprocessing → Model Training → Validation Evaluation

💡

Key takeaway: This design respects the unique characteristics of each model family while ensuring a baseline for objective evaluation.

3. Models Applied

We compared six models to evaluate different machine learning approaches for tabular classification, ranging from basic baselines to advanced ensemble methods.

Machine Learning Model	Role & Strategy
Dummy Classifier	Baseline model used for reference
Logistic Regression (balanced)	Linear and interpretable baseline for binary classification
Random Forest (balanced)	Ensemble tree model for non-linear relationships and feature interactions
Extra Trees (balanced)	Randomized tree ensemble used for comparison with Random Forest
Gradient Boosting	Sequential boosting model that corrects previous errors and often performs well on structured tabular data
XGBoost	Advanced gradient boosting model designed for strong predictive performance on tabular datasets

4. Evaluation Strategy

All models were trained on the 70% training set and rigorously evaluated on the 15% validation set before the final test.

Metrics Used

Accuracy

Overall percentage of correct predictions.

Balanced Accuracy

Arithmetic mean of class-specific recall scores.

Precision

Ratio of correctly predicted positive cases to total predicted positives.

Recall

Ratio of correctly predicted positive cases to all actual positives.

F1-Score

Harmonic mean of Precision and Recall.

ROC-AUC

Model's ability to discriminate between classes.

PR-AUC

Best for evaluating minority detection in imbalanced data.

Model Selection Priority

Given the 90/10 class imbalance, we prioritized metrics that penalize false negatives and reward correct minority classification.

PR-AUC

Primary Priority

F1-score

Secondary

ROC-AUC

Ranking Power

Recall

Clinical Safety

5. Preprocessing Configuration

📈 Linear Preprocessor

Used for	Logistic Regression
Scaled features	Continuous + Ordinal features
Scaler	StandardScaler
Binary features	Passthrough (unchanged)
Purpose	Make feature scales suitable for linear modeling

🌿 Tree Preprocessor

Used for	Dummy, RF, ET, HistGB, XGBoost
Feature handling	Passthrough
Scaling	Not required
Purpose	Preserve raw feature values for tree-based splits

6. Model Parameters

Baseline Reference

🔘

Dummy Classifier

PreprocessingTree preprocessor

Strategymost_frequent

Class balancingNot applied

Linear Model

📈

Logistic Regression

PreprocessingLinear preprocessor

Penalty / Cl2 / 0.5

Solverliblinear

Max Iterations3000

Weightbalanced

Ensemble Model

🌲

Random Forest

N Estimators / Depth300 / 12

Min Samples Split10

Min Samples Leaf10

Weightbalanced_subsample

Native Boosting

🚀

HistGradientBoosting

L-Rate / Iterations0.05 / 300

Max Depth / Leaf8 / 20

L2 Regularization1.0

Weightbalanced

Randomized Ensemble

🌴

Extra Trees

N Estimators / Depth400 / 12

Min Samples Split10

Min Samples Leaf10

Weightbalanced

Top Performance Model

⚔️

XGBoost Balanced PR-AUC Optimized

Estimators400

Max Depth4

L-Rate0.05

Subsample0.8

Colsample0.8

Alpha/Lambda0.5 / 2.0

Gamma/Weight1.0 / pos

Tree Methodhist

7. Formulas & Metrics

Evaluation Metrics

Metric	Mathematical Formula	Clinical Meaning
Accuracy	(TP + TN) / Total	Overall correct predictions
Precision	TP / (TP + FP)	Correctness among predicted positives
Recall	TP / (TP + FN)	Ability to detect actual positives
F1-score	2 × (P × R) / (P + R)	Balance between Precision and Recall
Specificity	TN / (TN + FP)	Ability to detect actual negatives

Confusion Matrix Terms

TPTrue Positive

Correctly predicted heart disease

TNTrue Negative

Correctly predicted no disease

FPFalse Positive

Type I Error: Healthy but predicted ill

FNFalse Negative

Type II Error: Missing actual illness

8. Individual Model Results

Model Metrics

Confusion Matrix

ROC Curve

Precision-Recall Curve

9. Summary Table / Model Comparison

Comparison of validation performance across all tested models at a default threshold of 0.5.

#	Model	Accuracy	Balanced Acc	Precision	Recall	F1	ROC-AUC	PR-AUC
0	XGBoost_Balanced	0.7411	0.7740	0.2411	0.8145	0.3721	0.8485	0.3753
1	HistGradientBoosting_Balanced	0.7352	0.7732	0.2376	0.8200	0.3684	0.8482	0.3729
2	LogisticRegression_Balanced	0.7516	0.7721	0.2467	0.7974	0.3768	0.8454	0.3655
3	RandomForest_Balanced	0.7632	0.7674	0.2525	0.7726	0.3806	0.8457	0.3640
4	ExtraTrees_Balanced	0.7428	0.7694	0.2405	0.8022	0.3701	0.8441	0.3634
5	Dummy Classifier	0.9058	0.5000	0.0000	0.0000	0.0000	0.5000	0.0942

📌 Performance Findings

●
XGBoost_Balanced achieved the best overall validation ranking with PR-AUC = 0.3753 and ROC-AUC = 0.8485.
●
HistGradientBoosting_Balanced was a close second and showed the highest Recall = 0.8200.
●
LogisticRegression_Balanced remained a strong linear baseline with competitive ranking performance.
●
Random Forest and Extra Trees performed well, but ranked slightly lower overall in the prioritized metrics.
!
Dummy had the highest Accuracy but failed completely on the minority class, missing all heart disease cases.

Final Conclusion

Based on the superior PR-AUC and balanced metrics, XGBoost_Balanced was selected as the best overall model for the final threshold tuning and evaluation phase.

10. Threshold Tuning

Why threshold tuning?

• The default threshold (0.50) is not always optimal for imbalanced data.
• Different thresholds create different trade-offs between Precision, Recall, and F1-score.
• We tuned the threshold to improve minority-class detection and find a better balance.

Interpretation

✅ Setting the threshold to 0.69 gave the best Precision–Recall trade-off based on the highest F1-score.
✅ This tuned threshold was then used for the final evaluation on the test set.

Result

The default threshold 0.50 was not optimal for this imbalanced problem.
The best threshold was 0.69, where the model achieved the highest F1-score = 0.3852.

At this point, the model maintained Recall = 0.6911.
Threshold tuning helped the model make more balanced decisions.

11. Final Test Evaluation (XGBoost_Balanced)

Final Results

Test Set Metrics

Test Confusion Matrix

Final ROC Curve (Test Set)

Final Precision-Recall Curve (Test Set)

📊 Final Evaluation Insights

✓ The test results show that XGBoost_Balanced generalizes well on unseen data.
✓ ROC-AUC = 0.8511 and PR-AUC = 0.3733 indicate good ranking performance under class imbalance.
✓ Recall = 0.6027 means the model can detect a meaningful portion of actual positive cases.
✓ Precision = 0.3268 shows that the model still produces many false positives.
✓ The confusion matrix suggests that the model favors detecting more positive cases, even at the cost of extra false alarms.
✓ Overall, the model achieves a reasonable trade-off between minority-class detection and overall classification quality.

12. Conclusion

Final summary of the machine learning pipeline and its impact on heart disease prediction.

Assignment 1 EDA directly guided the design of the tabular ML pipeline, especially in handling class imbalance, keeping duplicate survey responses, and grouping features for tailored preprocessing.

A complete end-to-end workflow was successfully built, encompassing feature grouping, model-specific scaling, 6-model comparison, and precision-recall threshold tuning.

Threshold tuning proved critical: shifting from the default 0.50 to an optimal 0.69 significantly improved the model's reliability for clinical prediction.

Best Performing Model

XGBoost_Balanced

Achieved the strongest balance between ranking ability (AUC) and minority-class detection (F1/PR) on the validation set.

Final Test Evaluation

ROC-AUC

0.8511

PR-AUC

0.3733

F1-Score

0.4238

No significant overfitting observed. Excellent generalization on unseen data.

Overall, the final pipeline provides a strong and practical baseline for heart disease risk prediction on imbalanced tabular health data.

Heart Disease ML

1. From EDA to Tabular ML Task

Overview

Problem Setup

EDA → ML Decisions

2. Data Split & Preprocessing

📊 Data Split

⚙️ Preprocessing Strategy

Linear Models

Tree-based & Boosting

End-to-End Workflow

3. Models Applied

4. Evaluation Strategy

Metrics Used

Model Selection Priority

PR-AUC

F1-score

ROC-AUC

Recall

5. Preprocessing Configuration

6. Model Parameters

Baseline Reference

Linear Model

Ensemble Model

Native Boosting

Randomized Ensemble

Top Performance Model

7. Formulas & Metrics

Evaluation Metrics

Confusion Matrix Terms

8. Individual Model Results

Model Metrics

Confusion Matrix

ROC Curve

Precision-Recall Curve

9. Summary Table / Model Comparison

📌 Performance Findings

Final Conclusion

10. Threshold Tuning

Why threshold tuning?

Interpretation

Result

11. Final Test Evaluation (XGBoost_Balanced)

Test Set Metrics

Test Confusion Matrix

Final ROC Curve (Test Set)

Final Precision-Recall Curve (Test Set)

📊 Final Evaluation Insights

12. Conclusion

Best Performing Model

Final Test Evaluation