1. Dataset Overview
| Feature Name | Type | Description |
|---|---|---|
| HighBP | Binary | Indicator for high blood pressure (0 = no, 1 = yes) |
| HighChol | Binary | Indicator for high cholesterol (0 = no, 1 = yes) |
| CholCheck | Binary | Had cholesterol checked (0 = no, 1 = yes) |
| BMI | Continuous | Body Mass Index of the individual |
| Smoker | Binary | Smoking status (0 = no, 1 = yes) |
| Stroke | Binary | History of stroke (0 = no, 1 = yes) |
| Diabetes | Ordinal | Diabetes status (0 = no, 1 = borderline, 2 = diabetes) |
| PhysActivity | Binary | Physical activity (0 = no, 1 = yes) |
| Fruits | Binary | Consumes fruits regularly (0 = no, 1 = yes) |
| Veggies | Binary | Consumes vegetables regularly (0 = no, 1 = yes) |
| HvyAlcoholConsump | Binary | Heavy alcohol consumption (0 = no, 1 = yes) |
| AnyHealthcare | Binary | Has healthcare coverage (0 = no, 1 = yes) |
| NoDocbcCost | Binary | Did not visit doctor due to cost (0 = no, 1 = yes) |
| DiffWalk | Binary | Difficulty walking (0 = no, 1 = yes) |
| Sex | Binary | Biological sex (0 = female, 1 = male) |
| GenHlth | Ordinal | Self-rated general health (1 = excellent, 5 = poor) |
| MentHlth | Continuous | Number of days mental health was not good in past month |
| PhysHlth | Continuous | Number of days physical health was not good in past month |
| Age | Ordinal | Age groups encoded (1–13) |
| Education | Ordinal | Education level encoded (1–6) |
| Income | Ordinal | Income level encoded (target: 0 = no heart disease, 1 = heart disease) |
| HeartDiseaseorAttack | Binary | Had heart disease or a heart attack (0 = no, 1 = yes) |
2. Visualizations
Heart Disease Target Distribution
Statistical Insights
- The target classes are strongly imbalanced.
- Most respondents are in the No Heart Disease group (90.58%).
- Only 9.42% of respondents belong to the Heart Disease group.
- This imbalance may cause models to favor the majority class.
- Therefore, handling imbalance and choosing proper evaluation metrics are necessary.
Univariate analysis
Distributions of continuous and categorial features across the dataset.
1. Continuous Features
Continuous Features Insights
- BMI is right-skewed, with most values concentrated in the normal-to-overweight range and a small number of very high values.
- MentHlth and PhysHlth are heavily right-skewed and zero-inflated, as most respondents report 0 unhealthy days.
- Both variables also show a spike at 30 days, indicating a subgroup with persistent health problems.
- Overall, these variables are not normally distributed and may contain outliers, especially in the right tail.
2. Binary Features
Binary Features Insights
- HighBP, HighChol, and Smoker have substantial “Yes” counts, indicating these risk factors are fairly common.
- CholCheck and AnyHealthcare are overwhelmingly “Yes”, showing that most respondents had healthcare access and cholesterol screening.
- Stroke and HvyAlcoholConsump are strongly imbalanced toward “No”, so positive cases are rare.
- PhysActivity, Fruits, and Veggies are mostly “Yes”, suggesting many respondents report healthy behaviors.
- DiffWalk remains mostly “No”, but its positive cases are still notable.
- Sex is relatively balanced, with slightly more females than males.
3. Ordinal Features
Ordinal Features Insights
- Diabetes is dominated by the “No” category, while borderline diabetes is the least common group.
- For GenHlth, most respondents report Good or Very Good health, while Poor health is relatively uncommon.
- Age is more concentrated in the middle-to-older groups, suggesting strong representation of older adults.
- Education and Income are skewed toward higher levels, with lower categories appearing less frequently.
Correlation Heatmap
Correlation Matrix Insights
- Most correlations are weak to moderate, indicating there is no serious multicollinearity within the dataset.
- GenHlth, Age, DiffWalk, HighBP, and Stroke show the strongest positive correlations with heart disease.
- Income, Education, and PhysActivity are negatively correlated with the target variable.
- The strongest inter-feature relationships appear between GenHlth–PhysHlth, PhysHlth–DiffWalk, and Education–Income.
- Overall, heart disease risk seems to be related to a combination of health, lifestyle, and socioeconomic factors.
Feature vs Target Analysis
Bivariate Analysis Insights
- Heart disease is significantly more common among respondents with HighBP, HighChol, Stroke, Diabetes, and DiffWalk.
- Healthier behaviors such as physical activity and fruit/vegetable consumption are clearly associated with lower heart disease prevalence.
- Older age groups and individuals with poorer self-rated general health show extremely strong associations with heart disease.
- Socioeconomic factors, specifically lower income and lower education, also appear related to higher heart disease risk.
- Some variables like CholCheck and AnyHealthcare provide limited visual separation because the vast majority of respondents fall into a single "Yes" category.
Outlier Detection (IQR method)
| Feature | Outlier Count | Percentage | Severity | Valid Range (IQR) |
|---|---|---|---|---|
| HighBP | 0 | 0.0% | ✅ Low | [-1.50, 2.50] |
| HighChol | 0 | 0.0% | ✅ Low | [-1.50, 2.50] |
| CholCheck | 9,470 | 3.73% | ✅ Low | [1.00, 1.00] |
| BMI | 9,847 | 3.88% | ✅ Low | [13.50, 41.50] |
| Smoker | 0 | 0.0% | ✅ Low | [-1.50, 2.50] |
| Stroke | 10292 | 4.06% | ✅ Low | [0.00, 0.00] |
| Diabetes | 39,977 | 15.76% | ❌ High | [0.00, 0.00] |
| PhysActivity | 61,760 | 24.35% | ❌ High | [1.00, 1.00] |
| Fruits | 0 | 0.0% | ✅ Low | [-1.50, 2.50] |
| Veggies | 47,839 | 18.86% | ❌ High | [1.00, 1.00] |
| HvyAlcoholConsump | 14,256 | 5.62% | ⚠️ Medium | [0.00, 0.00] |
| AnyHealthcare | 12,417 | 4.89% | ✅ Low | [1.00, 1.00] |
| NoDocbcCost | 21,354 | 8.42% | ⚠️ Medium | [0.00, 0.00] |
| GenHlth | 12,081 | 4.76% | ✅ Low | [0.50, 4.50] |
| MentHlth | 36,208 | 14.27% | ⚠️ Medium | [-3.00, 5.00] |
| PhysHlth | 40,949 | 16.14% | ❌ High | [-4.50, 7.50] |
| DiffWalk | 42,675 | 16.82% | ❌ High | [0.00, 0.00] |
| Sex | 0 | 0.0% | ✅ Low | [-1.50, 2.50] |
| Age | 0 | 0.0% | ✅ Low | [0.00, 16.00] |
| Education | 0 | 0.0% | ✅ Low | [1.00, 9.00] |
| Income | 0 | 0.0% | ✅ Low | [0.50, 12.50] |
Outlier Analysis Insights
- Outliers are concentrated in only a subset of features.
- PhysActivity, Veggies, DiffWalk, PhysHlth, and Diabetes have the highest outlier proportions.
- BMI, MentHlth, and PhysHlth show outliers consistent with their skewed distributions.
- Several variables such as Age, Education, Income, and Sex have no detected outliers.
- For binary/ordinal features, IQR-based outliers may reflect class imbalance rather than true anomalies.
- Outlier treatment should do focus mainly on continuous variables.
Feature Importance
Model Predictors Insights
- BMI is the most influential feature in the model, showing the highest relative importance score.
- Age, Income, PhysHlth, GenHlth, MentHlth, and Education are also strong predictors of heart disease risk.
- Several medical and lifestyle factors such as Diabetes, Stroke, HighBP, and PhysActivity have moderate importance.
- CholCheck, AnyHealthcare, and HvyAlcoholConsump contribute the least to this specific model's performance.
- Overall, the model suggests that heart disease risk depends on a mix of health condition, lifestyle, and demographic factors.
3. Key Insights
-
✓
The dataset contains 253,680 samples with 22 features, all numeric; many binary features reflect health status and lifestyle.
- ✓ The target is highly imbalanced: 90.58% without heart disease, 9.42% with heart disease.
-
✓
Strong predictors include Age, Diabetes, GenHlth, PhysHlth, Income, BMI. Individuals with heart disease tend to be older, have worse general health, and higher BMI.
-
✓
Binary features: HighBP, HighChol, Stroke, DiffWalk are strongly associated with heart disease; PhysActivity appears protective, while features like Fruits, Veggies, CholCheck are less discriminative.
-
✓
Outliers: BMI, MentHlth, PhysHlth are right-skewed; Age, Education, Income are stable; Diabetes & GenHlth are encoded categories.
-
✓
Correlations: Most features show weak-to-moderate correlations; meaningful clusters exist such as GenHlth–PhysHlth–MentHlth and Education–Income.
-
✓
Feature importance (Random Forest): BMI, Age, Income, PhysHlth, GenHlth/Education are the most important features, reflecting medical, lifestyle, and socioeconomic factors.