The overarching purpose of our research project is it study the relationship between different risk factors that can cause cardiovascular disease. We believe it is an important undertaking because we are always looking to understand the relationships between lifestyle and health. With science-backed findings, we want to help individuals in making better decisions, communities in promoting healthier public health practices, and governments in finding the best balance in resource-allocation between public healthcare and research and development.
For decades now, cardiovascular disease has been the leading killer of Americans. In the past years however, advances in biomedical research has improved emergency response systems and treatment, and public health has been better in prevention efforts. Source: CVD: A costly burden for America. Even so, cardiovascular disease continues to be the leading cause of death, a major cause of disability, and a major contributor to productivity loss in Americans. In fact, an estimated 71.3 million--about one in three--Americans have one or more types of heart disease. This burden translates not only in the loss of life, but also affects society’s quality of lives and puts a toll on public healthcare spending that could otherwise be put into other social good programs. Prevention is usually much cheaper to invest in than treatment, and so understanding risk factors has a huge potential impact on reducing burden. Source: An overview of CVD burden in the U.S.
Data description: We will be working on a dataset about cardiovascular disease available on Kaggle that consists of 70,000 records of patients data. We were particularly interested in exploring a research question in the domain of healthcare. We wanted to understand the relationship between some given risk factors that contribute to a certain disease. The main outcome that we are concerned with is the presence or absence of cardiovascular disease. Our search led us to this dataset and had all the relevant information that we could use to predict cardiovascular disease.
There are 3 types of input features:
Feature | Variable Type | Variable | Value Type |
---|---|---|---|
Age | Objective Feature | age | int(days) |
Height | Objective Feature | height | int(cm) |
Weight | Objective Feature | weight | float(kg) |
Gender | Objective Feature | gender | categorical code |
Systolic blood pressure | Examination Feature | ap_hi | int |
Diastolic blood pressure | Examination Feature | ap_lo | int |
Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |
Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |
Smoking | Subjective Feature | smoke | binary |
Alcohol Intake | Subjective Feature | alco | binary |
smoking | Subjective Feature | smoke | binary |
Physical Activity | Subjective Feature | active | binary |
Presence or absence of cardiovascular disease | Target Variable | cardio | binary |
import pandas as pd
import glob
import os
import matplotlib.pyplot as plt
import numpy as np
import math
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.api as sm
import warnings
warnings.filterwarnings("ignore")
# styling
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
# read the data from csv file
df = pd.read_csv('data/cardio_train.csv', sep=';', index_col = "id")
# check if the dataframe has any missing values
df.isnull().values.any()
# convert the unit of the column "age" in days into in years
df['age_in_years'] = (df['age'] / 365).round().astype('int')
# use describe function to display statistics such as min, max, mean and std for each features
df.describe()
From the table above, we saw some inconsistency in the data. For example, minimum height is 55 cm (which is 1.8 feet) and minimum weight is 10 kg, which initially made us believe that the dataset would include children. However, the minimum age is 29 years old, which means that those data points were possibly incorrectly recorded.
To check our assumptions, we decided to see those outliers visually. We will be looking at height, weight, high and low blood pressure.
# create a scatter plot that tells us which dots could be considered as outliers
height = df.plot.scatter(x='age_in_years', y='height', c='DarkBlue')
From the above scatter plot, we can see that there are some data that are not within the cluster and seems to be the outliers.
# create a histogram that displays the distribution of individual's height
df.height.hist()
From the bar graph above, we see that the distribution is not normal.
To remove outliers, we decided to drop any data that's over the quantile of 97.5% and under 2.5%.
# remove height outliers
df.drop(df[(df['height'] > df['height'].quantile(0.975)) | (df['height'] < df['height'].quantile(0.025))].index,inplace=True)
# create a scatter plot that shows the correlation between age and height after removing outliers
height_after_scat = df.plot.scatter(x='age_in_years', y='height', c='DarkBlue', title="Age vs Height")
plt.xlabel("Age")
plt.ylabel("Height (cm)")
plt.show()
# create a histogram that shows the distribution of height after removing outliers
height_after_hist = df.height.hist()
plt.title("Height distribution")
plt.xlabel("Height (cm)")
plt.ylabel("Count")
plt.show()
After removing the outliers, we can see that the scatter plot is more evenly spread and bar graph is more evenly distributed.
# create a scatter plot that tells us which dots could be considered as outliers
weight = df.plot.scatter(x='age_in_years', y='weight', c='DarkBlue')
plt.title("Age vs Weight")
plt.xlabel("Age")
plt.ylabel("Weight (kg)")
plt.show()
# create a histogram that displays the distribution of individual's weight
df.weight.hist()
plt.title("Weight distribution")
plt.xlabel("Weight (kg)")
plt.ylabel("Count")
plt.show()
# get rid of outliers by dropping whichever values that falls below 2.5% or above 97.5%
df.drop(df[(df['weight'] > df['weight'].quantile(0.975)) | (df['weight'] < df['weight'].quantile(0.025))].index,inplace=True)
# create a scatter plot that shows the correlation between age and weight after getting rid of outliers
weight2 = df.plot.scatter(x='age_in_years', y='weight', c='DarkBlue')
plt.title("Age vs Weight")
plt.xlabel("Age")
plt.ylabel("Weight (kg)")
plt.show()
# create a histogram that shows the distribution of weight after getting rid of outliers
df.weight.hist()
plt.title("Weight distribution")
plt.xlabel("Weight (kg)")
plt.ylabel("Count")
plt.show()
# create a scatter plot that tells us which dots could be considered as outliers for systolic blood pressure
blood_hi = df.plot.scatter(x='age_in_years', y='ap_hi', c='DarkBlue')
plt.title("Age vs High Blood Pressure")
plt.xlabel("Age")
plt.ylabel("High Blood Pressure (mmHg)")
plt.show()
# create a histogram that displays the distribution of each individual's systolic blood pressure
df.ap_hi.hist()
plt.title("High Blood Pressure Distribution")
plt.xlabel("High Blood Pressure (mmHg)")
plt.ylabel("Count")
plt.show()
# remove outliers by dropping whichever values that falls below 2.5% or above 97.5% for systolic blood pressure and diastolic blood pressure
# remove rows where 'ap_lo' is higher than 'ap_hi', because a negative blood pressure is scientifically impossible
indexNames = df[df['ap_hi'] < df['ap_lo']].index
df.drop(indexNames, inplace=True)
df.drop(df[(df['ap_hi'] > df['ap_hi'].quantile(0.975)) | (df['ap_hi'] < df['ap_hi'].quantile(0.025))].index,inplace=True)
# Scatter after removing high blood pressure outliers
blood_hi = df.plot.scatter(x='age_in_years', y='ap_hi', c='DarkBlue')
plt.title("Age vs High Blood Pressure")
plt.xlabel("Age")
plt.ylabel("High Blood Pressure (mmHg)")
plt.show()
# Histogram after removing high blood pressure outliers
df.ap_hi.hist()
plt.title("High Blood Pressure Distribution")
plt.xlabel("High Blood Pressure (mmHg)")
plt.ylabel("Count")
plt.show()
# create a scatter plot that tells us which dots could be considered as outliers for diastolic blood pressure
blood_lo = df.plot.scatter(x='age_in_years', y='ap_lo', c='DarkBlue')
plt.title("Age vs Low Blood Pressure")
plt.xlabel("Age")
plt.ylabel("Low Blood Pressure (mmHg)")
plt.show()
# create a histogram that displays the distribution of each individual's diastolic blood pressure
df.ap_lo.hist()
plt.title("Low Blood Pressure Distribution")
plt.xlabel("Low Blood Pressure (mmHg)")
plt.ylabel("Count")
plt.show()
# remove outliers by dropping whichever values that falls below 2.5% or above 97.5% for systolic blood pressure and diastolic blood pressure
df.drop(df[(df['ap_lo'] > df['ap_lo'].quantile(0.975)) | (df['ap_lo'] < df['ap_lo'].quantile(0.025))].index,inplace=True)
# Scatter after removing low blood pressure outliers
blood_hi = df.plot.scatter(x='age_in_years', y='ap_lo', c='DarkBlue')
plt.title("Age vs Low Blood Pressure")
plt.xlabel("Age")
plt.ylabel("Low Blood Pressure (mmHg)")
plt.show()
# Histogram after removing low blood pressure outliers
df.ap_hi.hist()
plt.title("Low Blood Pressure Distribution")
plt.xlabel("Low Blood Pressure (mmHg)")
plt.ylabel("Count")
plt.show()
In order to perform modeling on the data, we have to first uniform the data set and get it to be the data we can work with.
Turn continuous data (BMI & Blood pressure) into ordinal datas because we are trying to minimize the predictor types to two, binary and ordinal.
# create a column called BMI
df['bmi'] = df['weight'] / ((df['height'] / 100) ** 2)
We created a new feature :
# create a new column called bmi_scale with 1, 2, 3 scale indicating
# underweight", "Healthy", "overweight" respectively.
conditions = [(df['bmi'] < 18.5), ((df['bmi'] >= 18.5) & (df['bmi'] <= 24.9)), (df['bmi'] > 24.9)]
choices = [1,2,3]
df['bmi_scale'] = np.select(conditions, choices, default=np.nan).astype('int')
# create a column called blood pressure where we categorize 'Systolic blood pressure' & 'Diastolic blood pressure'
conditions = [\
((df['ap_hi'] <= 129) & (df['ap_lo'] < 80)), \
(((df['ap_hi'] >= 130) & (df['ap_hi'] < 140)) | ((df['ap_lo'] >= 80) & (df['ap_lo'] <= 90))), \
((df['ap_hi'] >= 140) | (df['ap_lo'] > 90))]
choices = [1, 2, 3] #["Normal", "High Blood Pressure", "Hypertensive crisis"]
df['blood_pressure'] = np.select(conditions, choices).astype('int')
df.groupby('blood_pressure').count()
# create a column called gender_new where we convert values 1-2 to 0-1
conditions = [(df['gender'] == 1), (df['gender'] == 2)]
choices = [0,1] #["Underweight", "Healthy", "Overweight", "Obese"]
df['gender_new'] = np.select(conditions, choices)#, default=np.nan).astype('int')
df = df.drop('gender', axis=1)
# create a graph that showcases the distribution of people who have and don't have the cardiovascular disease
df['cardio'].value_counts()
sns.countplot(x="cardio", data=df, palette="hls")
plt.show()
# make a function that creates a stacked bar of different features versus Cardiovascular disease
def viz_stack(df, factor, title, xlabel, ylabel, xlabels, rot):
table=pd.crosstab(df[factor],df.cardio)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title(title)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plt.xticks(range(len(xlabels)), xlabels, rotation=rot)
# create a stacked bar of BMI scale versus proportion of Cardiovascular disease
bmi_xlabels=["Underweight", "Healthy", "Overweight"]
viz_stack(df, "bmi_scale", "Stacked Bar Chart of BMI vs Cardiovascular disease", "BMI Scale", "Proportion of Cardiovascular disease", bmi_xlabels, 45)
From the graph above we can see that overweight people have the highest percentage of cardiovascular disease and people with underweight have the lowest percentage of cardiovascular disease.
# create a stacked bar of Cholesterol versus proportion of Cardiovascular disease
cholesterol_xlabels = ["Normal", "Above Normal", "Way above Normal"]
viz_stack(df, "cholesterol", "Stacked Bar Chart of Cholesterol vs Cardiovascular disease", "Cholesterol", "Proportion of Cardiovascular disease", cholesterol_xlabels, 45)
From the above graph we can see that the people with cholestrol level of way above normal have highest percentage of cardiovascular disease, and people with cholestrol level of normal have lowest percentage of cardio vascular disease.
# creates a stacked bar of Cholesterol versus proportion of Cardiovascular disease
gluc_xlabels = ["Normal", "Above Normal", "Way above Normal"]
viz_stack(df, "gluc", "Stacked Bar Chart of blood pressure vs Cardiovascular disease", "GLUCOSE", "Proportion of Cardiovascular disease", gluc_xlabels, 45)
From the above graph we can see that the people with glucose level of way above normal have highest percentage of cardiovascular disease, and people with glucose level of normal have lowest percentage of cardio vascular disease.
# creates a stacked bar of Cholesterol versus proportion of Cardiovascular disease
blood_xlabels = ["Normal", "High Blood Pressure", "Hypertensive crisis"]
viz_stack(df, "blood_pressure", "Stacked Bar Chart of blood pressure vs Cardiovascular disease", "BLOOD PRESSURE", "Proportion of Cardiovascular disease", blood_xlabels, 45)
From the above graph we can see that people with hypertensive crisis have highest cardiovascular disease percentage, and people with normal blood pressure have lowest cardiovascular disease percentage.
def viz_cat(df, cat, xlabel, title, labels):
# could reorder legedn + plot
df_categorical = df.loc[:, cat]
g = sns.countplot(x="variable", hue="value", data= pd.melt(df_categorical), palette='hls');
g.set_xticklabels(df_categorical, rotation=30)
g.set_xlabel(xlabel)
plt.legend(title='level', labels=labels, loc='upper right')
plt.show(g)
labels=['normal', 'above normal', 'way above normal']
cat = ['cholesterol','gluc']
viz_cat(df, cat, 'risk factor', 'level', labels )
cat = ['smoke', 'alco', 'active']
labels=['no', 'yes']
viz_cat(df, cat, 'risk factor', 'level', labels )
cat = ['bmi_scale']
labels=['underweight', 'healthy', 'overweight']
viz_cat(df, cat, 'risk factor', 'level', labels )
cat = ['blood_pressure']
labels=['normal', 'high blood pressure', 'hypertensive crisis']
viz_cat(df, cat, 'risk factor', 'level', labels )
This is the distribution of our data across different factors, particularly cholesterol levels, glucose levels, smoker status, alcohol drinker status, and bmi category.
From here, we can see that most of the individuals in the dataset have normal cholesterol levels and normal glucose levels. Most of them also don't smoke, don't drink, and are active in lifestyle. Looking at the BMI scale distribution, the largest category that individuals fall under are overweight, followed by healthy and underweight. Looking at the blood pressure distribution, the largest category that individuals fall under is high blood pressure, followed by normal and hypertensive crisis.
Our dataset contains two different types of data: ordinal/categorical and binary.
Because the data types are different, we decided to treat them separately when applying our logistic regression model.
# Logistic regression with categorical data
formula_cate = "cardio ~ bmi_scale + cholesterol + gluc + blood_pressure"
logistic_model_cate = smf.glm(formula=formula_cate, data=df, family=sm.families.Binomial()).fit()
logistic_model_cate.summary()
The estimated coefficients are in log odds. The odds of an event is the probability of that event divided by its complement: $$\frac{P}{1-P}$$
By exponentiating the coefficients, we can calculate the odds, which are easier to interpret.
# Odds ratio of ordinal data
np.exp(logistic_model_cate.params)
We can then do some interpretation with the data above.
Coefficient:
Standard Error:
P-value:
# Logistic regression with binary data
formula_bi = "cardio ~ gender_new + smoke + alco + active"
logistic_model_bi = smf.glm(formula=formula_bi, data=df, family=sm.families.Binomial()).fit()
logistic_model_bi.summary()
We would apply the same logic as categorical data
# Odds ratio of binary data
np.exp(logistic_model_bi.params)
Coefficient:
Standard Error:
P-value:
# For splitting data
from sklearn.model_selection import train_test_split
# For scaler/normalization
from sklearn.preprocessing import MinMaxScaler
# Preprocessor
from sklearn.preprocessing import PolynomialFeatures
# Feature selections
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import VarianceThreshold
# for making pipelines
from sklearn.pipeline import make_pipeline
# for grid search
from sklearn.model_selection import GridSearchCV
# split train feature data into testing and training data
train_features, test_features, train_outcome, test_outcome = train_test_split(
df.drop(["cardio"], axis=1),
df['cardio'],
test_size=0.30,
random_state=11
)
# make a function to run multiple models
def run_model(model, param_grid, xtrain, ytrain, xtest, ytest, do_poly = False):
# Create a scaler
scaler = MinMaxScaler()
# Create polynomial transformation, percentile selector, and variance threshold
selecter = SelectPercentile()
threshold = VarianceThreshold(.1)
if do_poly:
poly = PolynomialFeatures()
pipe = make_pipeline(poly, threshold, selecter, scaler, model)
else:
pipe = make_pipeline(threshold, selecter, scaler, model)
grid = GridSearchCV(pipe, param_grid)
grid.fit(xtrain, ytrain)
grid.best_params_
accuracy = grid.score(xtest, ytest)
return grid, accuracy
We ran our training data on several machine learning models. For each one, we calculated accuracy scores that reflect the fraction of predictions our model got right.
We used classifier models because our prediction outcome is categorical data, not continuous. In addition, we have labeled features. In other words, all rows/units of analyses are binary in that they are either a yes (cardiovascular disease positive) or a no (cardiovascular disease negative).
We started with a Logistic Regression model, because it is easy to implement and is a very basic machine learning model. We weren't expecting it to have a high accuracy score, but we wanted a benchmark to compare our other models' accuracy scores to.
# Model 1: Logistic Regression
from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept=False)
param_grid_linear = {'polynomialfeatures__degree':range(1, 3),
'selectpercentile__percentile':range(10, 30, 5)}
grid_dtc, score = run_model(lr, param_grid_linear, train_features, train_outcome, test_features, test_outcome, do_poly = True)
score
After running the Logistic Regression model, we get an accuracy of 23%. This means that when we ran our test data on the model, it was accurate 23% of the time.
The reason why we wanted to use KNN Classifier was because it is useful particularly for nonlinear data and it does not have assumptions about the data.
# Model 2: k-Nearest Neighbors Classifier
from sklearn.neighbors import KNeighborsClassifier
knc = KNeighborsClassifier()
param_grid_knc = {'kneighborsclassifier__n_neighbors': np.arange(1,10),
'kneighborsclassifier__weights':["uniform", "distance"]}
grid_knc_a, score = run_model(knc, param_grid_knc, train_features, train_outcome, test_features, test_outcome, do_poly = False)
score
After running the K-Nearest Neighbors model, we get an accuracy of 69%. This means that when we ran our test data on the model, it predicted correctly 69% of the time.
Decision Tree Classifier is good to apply because it has the ability to assign specific values to the problem, and decisions and outcomes of each decision. Hence, ambiguity in decision-making is reduced.
# Model 2: Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
param_grid_dtc = {'decisiontreeclassifier__max_depth': np.arange(1,10)}
grid_dtc, score = run_model(dtc, param_grid_dtc, train_features, train_outcome, test_features, test_outcome, do_poly = False)
score
After running the Decision Tree Classifier model, we get an accuracy of 71%. This means that when we ran our test data on the model, it was accurate 71% of the time.
Naive Bayes is easy to implement and it requires less training data. Moreover, in our datasets, we have both binary (0,1) and categorical (1, 2, 3, 4) data, which Naive Bayes models can handle well.
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
param_grid_naive = {'polynomialfeatures__degree':range(1, 3),
'selectpercentile__percentile':range(10, 30, 5)}
grid_dtc, score = run_model(gnb, param_grid_naive, train_features, train_outcome, test_features, test_outcome, do_poly = True)
score
After running the Naive Bayes model, we get an accuracy of 71%. This means that when we ran our test data on the model, it was accurate 71% of the time.
From the accuracy scores that we calculated above, we find that the accuracy score is the highest when we use Naive Bayes model. We think the reason is because Naive Bayes is highly scalable and it can scale linearly with predictors and size of the dataset.
Ultimately, we confirmed previous research findings, in that we found cardiovascular disease outcomes to be associated with having higher cholesterol levels, having higher glucose levels, being overweight, being a smoker and drinker, and living an inactive lifestyle.