Tabular Data EDA - Group APlus

1. Dataset Overview

253,680

Total Samples

Features

Target variable

23,899

Duplicates (~9.4%)

Feature Name	Type	Description
HighBP	Binary	Indicator for high blood pressure (0 = no, 1 = yes)
HighChol	Binary	Indicator for high cholesterol (0 = no, 1 = yes)
CholCheck	Binary	Had cholesterol checked (0 = no, 1 = yes)
BMI	Continuous	Body Mass Index of the individual
Smoker	Binary	Smoking status (0 = no, 1 = yes)
Stroke	Binary	History of stroke (0 = no, 1 = yes)
Diabetes	Ordinal	Diabetes status (0 = no, 1 = borderline, 2 = diabetes)
PhysActivity	Binary	Physical activity (0 = no, 1 = yes)
Fruits	Binary	Consumes fruits regularly (0 = no, 1 = yes)
Veggies	Binary	Consumes vegetables regularly (0 = no, 1 = yes)
HvyAlcoholConsump	Binary	Heavy alcohol consumption (0 = no, 1 = yes)
AnyHealthcare	Binary	Has healthcare coverage (0 = no, 1 = yes)
NoDocbcCost	Binary	Did not visit doctor due to cost (0 = no, 1 = yes)
DiffWalk	Binary	Difficulty walking (0 = no, 1 = yes)
Sex	Binary	Biological sex (0 = female, 1 = male)
GenHlth	Ordinal	Self-rated general health (1 = excellent, 5 = poor)
MentHlth	Continuous	Number of days mental health was not good in past month
PhysHlth	Continuous	Number of days physical health was not good in past month
Age	Ordinal	Age groups encoded (1–13)
Education	Ordinal	Education level encoded (1–6)
Income	Ordinal	Income level encoded (target: 0 = no heart disease, 1 = heart disease)
HeartDiseaseorAttack	Binary	Had heart disease or a heart attack (0 = no, 1 = yes)

2. Visualizations

Heart Disease Target Distribution

Statistical Insights

The target classes are strongly imbalanced.
Most respondents are in the No Heart Disease group (90.58%).
Only 9.42% of respondents belong to the Heart Disease group.
This imbalance may cause models to favor the majority class.
Therefore, handling imbalance and choosing proper evaluation metrics are necessary.

Univariate analysis

Distributions of continuous and categorial features across the dataset.

1. Continuous Features

Continuous Features Insights

BMI is right-skewed, with most values concentrated in the normal-to-overweight range and a small number of very high values.
MentHlth and PhysHlth are heavily right-skewed and zero-inflated, as most respondents report 0 unhealthy days.
Both variables also show a spike at 30 days, indicating a subgroup with persistent health problems.
Overall, these variables are not normally distributed and may contain outliers, especially in the right tail.

2. Binary Features

Binary Features Insights

HighBP, HighChol, and Smoker have substantial “Yes” counts, indicating these risk factors are fairly common.
CholCheck and AnyHealthcare are overwhelmingly “Yes”, showing that most respondents had healthcare access and cholesterol screening.
Stroke and HvyAlcoholConsump are strongly imbalanced toward “No”, so positive cases are rare.
PhysActivity, Fruits, and Veggies are mostly “Yes”, suggesting many respondents report healthy behaviors.
DiffWalk remains mostly “No”, but its positive cases are still notable.
Sex is relatively balanced, with slightly more females than males.

3. Ordinal Features

Ordinal Features Insights

Diabetes is dominated by the “No” category, while borderline diabetes is the least common group.
For GenHlth, most respondents report Good or Very Good health, while Poor health is relatively uncommon.
Age is more concentrated in the middle-to-older groups, suggesting strong representation of older adults.
Education and Income are skewed toward higher levels, with lower categories appearing less frequently.

Correlation Heatmap

Correlation Matrix Insights

Most correlations are weak to moderate, indicating there is no serious multicollinearity within the dataset.
GenHlth, Age, DiffWalk, HighBP, and Stroke show the strongest positive correlations with heart disease.
Income, Education, and PhysActivity are negatively correlated with the target variable.
The strongest inter-feature relationships appear between GenHlth–PhysHlth, PhysHlth–DiffWalk, and Education–Income.
Overall, heart disease risk seems to be related to a combination of health, lifestyle, and socioeconomic factors.

Feature vs Target Analysis

Bivariate Analysis Insights

Heart disease is significantly more common among respondents with HighBP, HighChol, Stroke, Diabetes, and DiffWalk.
Healthier behaviors such as physical activity and fruit/vegetable consumption are clearly associated with lower heart disease prevalence.
Older age groups and individuals with poorer self-rated general health show extremely strong associations with heart disease.
Socioeconomic factors, specifically lower income and lower education, also appear related to higher heart disease risk.
Some variables like CholCheck and AnyHealthcare provide limited visual separation because the vast majority of respondents fall into a single "Yes" category.

Outlier Detection (IQR method)

Feature	Outlier Count	Percentage	Severity	Valid Range (IQR)
HighBP	0	0.0%	✅ Low	[-1.50, 2.50]
HighChol	0	0.0%	✅ Low	[-1.50, 2.50]
CholCheck	9,470	3.73%	✅ Low	[1.00, 1.00]
BMI	9,847	3.88%	✅ Low	[13.50, 41.50]
Smoker	0	0.0%	✅ Low	[-1.50, 2.50]
Stroke	10292	4.06%	✅ Low	[0.00, 0.00]
Diabetes	39,977	15.76%	❌ High	[0.00, 0.00]
PhysActivity	61,760	24.35%	❌ High	[1.00, 1.00]
Fruits	0	0.0%	✅ Low	[-1.50, 2.50]
Veggies	47,839	18.86%	❌ High	[1.00, 1.00]
HvyAlcoholConsump	14,256	5.62%	⚠️ Medium	[0.00, 0.00]
AnyHealthcare	12,417	4.89%	✅ Low	[1.00, 1.00]
NoDocbcCost	21,354	8.42%	⚠️ Medium	[0.00, 0.00]
GenHlth	12,081	4.76%	✅ Low	[0.50, 4.50]
MentHlth	36,208	14.27%	⚠️ Medium	[-3.00, 5.00]
PhysHlth	40,949	16.14%	❌ High	[-4.50, 7.50]
DiffWalk	42,675	16.82%	❌ High	[0.00, 0.00]
Sex	0	0.0%	✅ Low	[-1.50, 2.50]
Age	0	0.0%	✅ Low	[0.00, 16.00]
Education	0	0.0%	✅ Low	[1.00, 9.00]
Income	0	0.0%	✅ Low	[0.50, 12.50]

Outlier Analysis Insights

Outliers are concentrated in only a subset of features.
PhysActivity, Veggies, DiffWalk, PhysHlth, and Diabetes have the highest outlier proportions.
BMI, MentHlth, and PhysHlth show outliers consistent with their skewed distributions.
Several variables such as Age, Education, Income, and Sex have no detected outliers.
For binary/ordinal features, IQR-based outliers may reflect class imbalance rather than true anomalies.
Outlier treatment should do focus mainly on continuous variables.

Feature Importance

Model Predictors Insights

BMI is the most influential feature in the model, showing the highest relative importance score.
Age, Income, PhysHlth, GenHlth, MentHlth, and Education are also strong predictors of heart disease risk.
Several medical and lifestyle factors such as Diabetes, Stroke, HighBP, and PhysActivity have moderate importance.
CholCheck, AnyHealthcare, and HvyAlcoholConsump contribute the least to this specific model's performance.
Overall, the model suggests that heart disease risk depends on a mix of health condition, lifestyle, and demographic factors.

* Note: Feature importance is model-specific and reflects contribution within this particular model, not direct causality.

3. Key Insights

✓
The dataset contains 253,680 samples with 22 features, all numeric; many binary features reflect health status and lifestyle.
✓ The target is highly imbalanced: 90.58% without heart disease, 9.42% with heart disease.
✓
Strong predictors include Age, Diabetes, GenHlth, PhysHlth, Income, BMI. Individuals with heart disease tend to be older, have worse general health, and higher BMI.
✓
Binary features: HighBP, HighChol, Stroke, DiffWalk are strongly associated with heart disease; PhysActivity appears protective, while features like Fruits, Veggies, CholCheck are less discriminative.
✓
Outliers: BMI, MentHlth, PhysHlth are right-skewed; Age, Education, Income are stable; Diabetes & GenHlth are encoded categories.
✓
Correlations: Most features show weak-to-moderate correlations; meaningful clusters exist such as GenHlth–PhysHlth–MentHlth and Education–Income.
✓
Feature importance (Random Forest): BMI, Age, Income, PhysHlth, GenHlth/Education are the most important features, reflecting medical, lifestyle, and socioeconomic factors.

Heart Disease EDA