Heart Disease EDA

Exploratory Data Analysis to identify key health factors affecting cardiovascular risk.

Modality 1: Tabular Data

1. Dataset Overview

253,680
Total Samples
21
Features
1
Target variable
23,899
Duplicates (~9.4%)
Feature Name Type Description
HighBP Binary Indicator for high blood pressure (0 = no, 1 = yes)
HighChol Binary Indicator for high cholesterol (0 = no, 1 = yes)
CholCheck Binary Had cholesterol checked (0 = no, 1 = yes)
BMI Continuous Body Mass Index of the individual
Smoker Binary Smoking status (0 = no, 1 = yes)
Stroke Binary History of stroke (0 = no, 1 = yes)
Diabetes Ordinal Diabetes status (0 = no, 1 = borderline, 2 = diabetes)
PhysActivity Binary Physical activity (0 = no, 1 = yes)
Fruits Binary Consumes fruits regularly (0 = no, 1 = yes)
Veggies Binary Consumes vegetables regularly (0 = no, 1 = yes)
HvyAlcoholConsump Binary Heavy alcohol consumption (0 = no, 1 = yes)
AnyHealthcare Binary Has healthcare coverage (0 = no, 1 = yes)
NoDocbcCost Binary Did not visit doctor due to cost (0 = no, 1 = yes)
DiffWalk Binary Difficulty walking (0 = no, 1 = yes)
Sex Binary Biological sex (0 = female, 1 = male)
GenHlth Ordinal Self-rated general health (1 = excellent, 5 = poor)
MentHlth Continuous Number of days mental health was not good in past month
PhysHlth Continuous Number of days physical health was not good in past month
Age Ordinal Age groups encoded (1–13)
Education Ordinal Education level encoded (1–6)
Income Ordinal Income level encoded (target: 0 = no heart disease, 1 = heart disease)
HeartDiseaseorAttack Binary Had heart disease or a heart attack (0 = no, 1 = yes)

2. Visualizations

Heart Disease Target Distribution

Statistical Insights

  • The target classes are strongly imbalanced.
  • Most respondents are in the No Heart Disease group (90.58%).
  • Only 9.42% of respondents belong to the Heart Disease group.
  • This imbalance may cause models to favor the majority class.
  • Therefore, handling imbalance and choosing proper evaluation metrics are necessary.

Univariate analysis

Distributions of continuous and categorial features across the dataset.

1. Continuous Features

Continuous Features Insights
  • BMI is right-skewed, with most values concentrated in the normal-to-overweight range and a small number of very high values.
  • MentHlth and PhysHlth are heavily right-skewed and zero-inflated, as most respondents report 0 unhealthy days.
  • Both variables also show a spike at 30 days, indicating a subgroup with persistent health problems.
  • Overall, these variables are not normally distributed and may contain outliers, especially in the right tail.

2. Binary Features

Binary Features Insights
  • HighBP, HighChol, and Smoker have substantial “Yes” counts, indicating these risk factors are fairly common.
  • CholCheck and AnyHealthcare are overwhelmingly “Yes”, showing that most respondents had healthcare access and cholesterol screening.
  • Stroke and HvyAlcoholConsump are strongly imbalanced toward “No”, so positive cases are rare.
  • PhysActivity, Fruits, and Veggies are mostly “Yes”, suggesting many respondents report healthy behaviors.
  • DiffWalk remains mostly “No”, but its positive cases are still notable.
  • Sex is relatively balanced, with slightly more females than males.

3. Ordinal Features

Ordinal Features Insights
  • Diabetes is dominated by the “No” category, while borderline diabetes is the least common group.
  • For GenHlth, most respondents report Good or Very Good health, while Poor health is relatively uncommon.
  • Age is more concentrated in the middle-to-older groups, suggesting strong representation of older adults.
  • Education and Income are skewed toward higher levels, with lower categories appearing less frequently.

Correlation Heatmap

Correlation Matrix Insights
  • Most correlations are weak to moderate, indicating there is no serious multicollinearity within the dataset.
  • GenHlth, Age, DiffWalk, HighBP, and Stroke show the strongest positive correlations with heart disease.
  • Income, Education, and PhysActivity are negatively correlated with the target variable.
  • The strongest inter-feature relationships appear between GenHlth–PhysHlth, PhysHlth–DiffWalk, and Education–Income.
  • Overall, heart disease risk seems to be related to a combination of health, lifestyle, and socioeconomic factors.

Feature vs Target Analysis

Bivariate Analysis Insights
  • Heart disease is significantly more common among respondents with HighBP, HighChol, Stroke, Diabetes, and DiffWalk.
  • Healthier behaviors such as physical activity and fruit/vegetable consumption are clearly associated with lower heart disease prevalence.
  • Older age groups and individuals with poorer self-rated general health show extremely strong associations with heart disease.
  • Socioeconomic factors, specifically lower income and lower education, also appear related to higher heart disease risk.
  • Some variables like CholCheck and AnyHealthcare provide limited visual separation because the vast majority of respondents fall into a single "Yes" category.

Outlier Detection (IQR method)

Feature Outlier Count Percentage Severity Valid Range (IQR)
HighBP 0 0.0% ✅ Low [-1.50, 2.50]
HighChol 0 0.0% ✅ Low [-1.50, 2.50]
CholCheck 9,470 3.73% ✅ Low [1.00, 1.00]
BMI 9,847 3.88% ✅ Low [13.50, 41.50]
Smoker 0 0.0% ✅ Low [-1.50, 2.50]
Stroke 10292 4.06% ✅ Low [0.00, 0.00]
Diabetes 39,977 15.76% ❌ High [0.00, 0.00]
PhysActivity 61,760 24.35% ❌ High [1.00, 1.00]
Fruits 0 0.0% ✅ Low [-1.50, 2.50]
Veggies 47,839 18.86% ❌ High [1.00, 1.00]
HvyAlcoholConsump 14,256 5.62% ⚠️ Medium [0.00, 0.00]
AnyHealthcare 12,417 4.89% ✅ Low [1.00, 1.00]
NoDocbcCost 21,354 8.42% ⚠️ Medium [0.00, 0.00]
GenHlth 12,081 4.76% ✅ Low [0.50, 4.50]
MentHlth 36,208 14.27% ⚠️ Medium [-3.00, 5.00]
PhysHlth 40,949 16.14% ❌ High [-4.50, 7.50]
DiffWalk 42,675 16.82% ❌ High [0.00, 0.00]
Sex 0 0.0% ✅ Low [-1.50, 2.50]
Age 0 0.0% ✅ Low [0.00, 16.00]
Education 0 0.0% ✅ Low [1.00, 9.00]
Income 0 0.0% ✅ Low [0.50, 12.50]
Outlier Analysis Insights
  • Outliers are concentrated in only a subset of features.
  • PhysActivity, Veggies, DiffWalk, PhysHlth, and Diabetes have the highest outlier proportions.
  • BMI, MentHlth, and PhysHlth show outliers consistent with their skewed distributions.
  • Several variables such as Age, Education, Income, and Sex have no detected outliers.
  • For binary/ordinal features, IQR-based outliers may reflect class imbalance rather than true anomalies.
  • Outlier treatment should do focus mainly on continuous variables.

Feature Importance

Model Predictors Insights
  • BMI is the most influential feature in the model, showing the highest relative importance score.
  • Age, Income, PhysHlth, GenHlth, MentHlth, and Education are also strong predictors of heart disease risk.
  • Several medical and lifestyle factors such as Diabetes, Stroke, HighBP, and PhysActivity have moderate importance.
  • CholCheck, AnyHealthcare, and HvyAlcoholConsump contribute the least to this specific model's performance.
  • Overall, the model suggests that heart disease risk depends on a mix of health condition, lifestyle, and demographic factors.
* Note: Feature importance is model-specific and reflects contribution within this particular model, not direct causality.

3. Key Insights