Problem Overview¶

Description¶

The overarching purpose of our research project is it study the relationship between different risk factors that can cause cardiovascular disease. We believe it is an important undertaking because we are always looking to understand the relationships between lifestyle and health. With science-backed findings, we want to help individuals in making better decisions, communities in promoting healthier public health practices, and governments in finding the best balance in resource-allocation between public healthcare and research and development.

Background¶

For decades now, cardiovascular disease has been the leading killer of Americans. In the past years however, advances in biomedical research has improved emergency response systems and treatment, and public health has been better in prevention efforts. Source: CVD: A costly burden for America. Even so, cardiovascular disease continues to be the leading cause of death, a major cause of disability, and a major contributor to productivity loss in Americans. In fact, an estimated 71.3 million--about one in three--Americans have one or more types of heart disease. This burden translates not only in the loss of life, but also affects society’s quality of lives and puts a toll on public healthcare spending that could otherwise be put into other social good programs. Prevention is usually much cheaper to invest in than treatment, and so understanding risk factors has a huge potential impact on reducing burden. Source: An overview of CVD burden in the U.S.

Data Description and Features¶

Data description: We will be working on a dataset about cardiovascular disease available on Kaggle that consists of 70,000 records of patients data. We were particularly interested in exploring a research question in the domain of healthcare. We wanted to understand the relationship between some given risk factors that contribute to a certain disease. The main outcome that we are concerned with is the presence or absence of cardiovascular disease. Our search led us to this dataset and had all the relevant information that we could use to predict cardiovascular disease.

There are 3 types of input features:

Objective: factual information
Examination: results of medical examination
Subjective: information given by the patient

Feature	Variable Type	Variable	Value Type
Age	Objective Feature	age	int(days)
Height	Objective Feature	height	int(cm)
Weight	Objective Feature	weight	float(kg)
Gender	Objective Feature	gender	categorical code
Systolic blood pressure	Examination Feature	ap_hi	int
Diastolic blood pressure	Examination Feature	ap_lo	int
Cholesterol	Examination Feature	cholesterol	1: normal, 2: above normal, 3: well above normal
Glucose	Examination Feature	gluc	1: normal, 2: above normal, 3: well above normal
Smoking	Subjective Feature	smoke	binary
Alcohol Intake	Subjective Feature	alco	binary
smoking	Subjective Feature	smoke	binary
Physical Activity	Subjective Feature	active	binary
Presence or absence of cardiovascular disease	Target Variable	cardio	binary

Data Preperation¶

import pandas as pd    
import glob
import os
import matplotlib.pyplot as plt
import numpy as np
import math
import seaborn as sns
import statsmodels.formula.api as smf
import statsmodels.api as sm
import warnings
warnings.filterwarnings("ignore")

# styling 
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

# read the data from csv file
df = pd.read_csv('data/cardio_train.csv', sep=';', index_col = "id")

# check if the dataframe has any missing values
df.isnull().values.any()

False

# convert the unit of the column "age" in days into in years
df['age_in_years'] = (df['age'] / 365).round().astype('int')

Outliers and Incorrect Values¶

# use describe function to display statistics such as min, max, mean and std for each features 
df.describe()

From the table above, we saw some inconsistency in the data. For example, minimum height is 55 cm (which is 1.8 feet) and minimum weight is 10 kg, which initially made us believe that the dataset would include children. However, the minimum age is 29 years old, which means that those data points were possibly incorrectly recorded.

To check our assumptions, we decided to see those outliers visually. We will be looking at height, weight, high and low blood pressure.

Height¶

# create a scatter plot that tells us which dots could be considered as outliers 
height = df.plot.scatter(x='age_in_years', y='height', c='DarkBlue')

From the above scatter plot, we can see that there are some data that are not within the cluster and seems to be the outliers.

# create a histogram that displays the distribution of individual's height
df.height.hist()

<matplotlib.axes._subplots.AxesSubplot at 0x1c1d839240>

From the bar graph above, we see that the distribution is not normal.
To remove outliers, we decided to drop any data that's over the quantile of 97.5% and under 2.5%.

# remove height outliers
df.drop(df[(df['height'] > df['height'].quantile(0.975)) | (df['height'] < df['height'].quantile(0.025))].index,inplace=True)

# create a scatter plot that shows the correlation between age and height after removing outliers
height_after_scat = df.plot.scatter(x='age_in_years', y='height', c='DarkBlue', title="Age vs Height")
plt.xlabel("Age")
plt.ylabel("Height (cm)")
plt.show()

# create a histogram that shows the distribution of height after removing outliers
height_after_hist = df.height.hist()
plt.title("Height distribution")
plt.xlabel("Height (cm)")
plt.ylabel("Count")
plt.show()

After removing the outliers, we can see that the scatter plot is more evenly spread and bar graph is more evenly distributed.

Weight¶

# create a scatter plot that tells us which dots could be considered as outliers 
weight = df.plot.scatter(x='age_in_years', y='weight', c='DarkBlue')
plt.title("Age vs Weight")
plt.xlabel("Age")
plt.ylabel("Weight (kg)")
plt.show()

# create a histogram that displays the distribution of individual's weight
df.weight.hist()
plt.title("Weight distribution")
plt.xlabel("Weight (kg)")
plt.ylabel("Count")
plt.show()

# get rid of outliers by dropping whichever values that falls below 2.5% or above 97.5% 
df.drop(df[(df['weight'] > df['weight'].quantile(0.975)) | (df['weight'] < df['weight'].quantile(0.025))].index,inplace=True)

# create a scatter plot that shows the correlation between age and weight after getting rid of outliers
weight2 = df.plot.scatter(x='age_in_years', y='weight', c='DarkBlue')
plt.title("Age vs Weight")
plt.xlabel("Age")
plt.ylabel("Weight (kg)")
plt.show()

# create a histogram that shows the distribution of weight after getting rid of outliers
df.weight.hist()
plt.title("Weight distribution")
plt.xlabel("Weight (kg)")
plt.ylabel("Count")
plt.show()

Blood Pressure¶

High Blood Pressure¶

# create a scatter plot that tells us which dots could be considered as outliers for systolic blood pressure
blood_hi = df.plot.scatter(x='age_in_years', y='ap_hi', c='DarkBlue')
plt.title("Age vs High Blood Pressure")
plt.xlabel("Age")
plt.ylabel("High Blood Pressure (mmHg)")
plt.show()

# create a histogram that displays the distribution of each individual's systolic blood pressure
df.ap_hi.hist()
plt.title("High Blood Pressure Distribution")
plt.xlabel("High Blood Pressure (mmHg)")
plt.ylabel("Count")
plt.show()

# remove outliers by dropping whichever values that falls below 2.5% or above 97.5% for systolic blood pressure and diastolic blood pressure
# remove rows where 'ap_lo' is higher than 'ap_hi', because a negative blood pressure is scientifically impossible
indexNames = df[df['ap_hi'] < df['ap_lo']].index
df.drop(indexNames, inplace=True)
df.drop(df[(df['ap_hi'] > df['ap_hi'].quantile(0.975)) | (df['ap_hi'] < df['ap_hi'].quantile(0.025))].index,inplace=True)

# Scatter after removing high blood pressure outliers
blood_hi = df.plot.scatter(x='age_in_years', y='ap_hi', c='DarkBlue')
plt.title("Age vs High Blood Pressure")
plt.xlabel("Age")
plt.ylabel("High Blood Pressure (mmHg)")
plt.show()

# Histogram after removing high blood pressure outliers
df.ap_hi.hist()
plt.title("High Blood Pressure Distribution")
plt.xlabel("High Blood Pressure (mmHg)")
plt.ylabel("Count")
plt.show()

Low Blood Pressure¶

# create a scatter plot that tells us which dots could be considered as outliers for diastolic blood pressure
blood_lo = df.plot.scatter(x='age_in_years', y='ap_lo', c='DarkBlue')
plt.title("Age vs Low Blood Pressure")
plt.xlabel("Age")
plt.ylabel("Low Blood Pressure (mmHg)")
plt.show()

# create a histogram that displays the distribution of each individual's diastolic blood pressure
df.ap_lo.hist()
plt.title("Low Blood Pressure Distribution")
plt.xlabel("Low Blood Pressure (mmHg)")
plt.ylabel("Count")
plt.show()

# remove outliers by dropping whichever values that falls below 2.5% or above 97.5% for systolic blood pressure and diastolic blood pressure
df.drop(df[(df['ap_lo'] > df['ap_lo'].quantile(0.975)) | (df['ap_lo'] < df['ap_lo'].quantile(0.025))].index,inplace=True)

# Scatter after removing low blood pressure outliers
blood_hi = df.plot.scatter(x='age_in_years', y='ap_lo', c='DarkBlue')
plt.title("Age vs Low Blood Pressure")
plt.xlabel("Age")
plt.ylabel("Low Blood Pressure (mmHg)")
plt.show()

# Histogram after removing low blood pressure outliers
df.ap_hi.hist()
plt.title("Low Blood Pressure Distribution")
plt.xlabel("Low Blood Pressure (mmHg)")
plt.ylabel("Count")
plt.show()

Data Modification¶

In order to perform modeling on the data, we have to first uniform the data set and get it to be the data we can work with.

Turn continuous data (BMI & Blood pressure) into ordinal datas because we are trying to minimize the predictor types to two, binary and ordinal.

Calculated BMIs using weight and height.
Created new feature, BMI_Scale.
Converted blood pressure from continous to scale of 1-3 (Normal, High Blood Pressure, Hypertensive crisis).
Changed gender from 1 and 2 to 0 and 1.
So now all ordinal features should have the same scale of 1-3.

# create a column called BMI 
df['bmi'] = df['weight'] / ((df['height'] / 100) ** 2)

We created a new feature :

Body Mass Index (BMI):

$$BMI=\frac{mass(kg)}{height(m)^2}$$

# create a new column called bmi_scale with 1, 2, 3 scale indicating 
# underweight", "Healthy", "overweight" respectively.
conditions = [(df['bmi'] < 18.5), ((df['bmi'] >= 18.5) & (df['bmi'] <= 24.9)), (df['bmi'] > 24.9)]
choices = [1,2,3] 
df['bmi_scale'] = np.select(conditions, choices, default=np.nan).astype('int')

# create a column called blood pressure where we categorize 'Systolic blood pressure'  & 'Diastolic blood pressure'
conditions = [\
              ((df['ap_hi'] <= 129) & (df['ap_lo'] < 80)), \
              (((df['ap_hi'] >= 130) & (df['ap_hi'] < 140)) | ((df['ap_lo'] >= 80) & (df['ap_lo'] <= 90))), \
              ((df['ap_hi'] >= 140) | (df['ap_lo'] > 90))]
choices = [1, 2, 3] #["Normal", "High Blood Pressure", "Hypertensive crisis"]
df['blood_pressure'] = np.select(conditions, choices).astype('int')
df.groupby('blood_pressure').count()

# create a column called gender_new where we convert values 1-2 to 0-1
conditions = [(df['gender'] == 1), (df['gender'] == 2)]
choices = [0,1] #["Underweight", "Healthy", "Overweight", "Obese"]
df['gender_new'] = np.select(conditions, choices)#, default=np.nan).astype('int')
df = df.drop('gender', axis=1)

Data Study¶

# create a graph that showcases the distribution of people who have and don't have the cardiovascular disease
df['cardio'].value_counts()
sns.countplot(x="cardio", data=df, palette="hls")
plt.show()

# make a function that creates a stacked bar of different features versus Cardiovascular disease
def viz_stack(df, factor, title, xlabel, ylabel, xlabels, rot): 
    table=pd.crosstab(df[factor],df.cardio)
    table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
    plt.title(title)
    plt.xlabel(xlabel)
    plt.ylabel(ylabel)
    plt.xticks(range(len(xlabels)), xlabels, rotation=rot)

# create a stacked bar of BMI scale versus proportion of Cardiovascular disease
bmi_xlabels=["Underweight", "Healthy", "Overweight"]
viz_stack(df, "bmi_scale", "Stacked Bar Chart of BMI vs Cardiovascular disease", "BMI Scale", "Proportion of Cardiovascular disease", bmi_xlabels, 45)

From the graph above we can see that overweight people have the highest percentage of cardiovascular disease and people with underweight have the lowest percentage of cardiovascular disease.

# create a stacked bar of Cholesterol versus proportion of Cardiovascular disease
cholesterol_xlabels = ["Normal", "Above Normal", "Way above Normal"]
viz_stack(df, "cholesterol", "Stacked Bar Chart of Cholesterol vs Cardiovascular disease", "Cholesterol", "Proportion of Cardiovascular disease", cholesterol_xlabels, 45)

From the above graph we can see that the people with cholestrol level of way above normal have highest percentage of cardiovascular disease, and people with cholestrol level of normal have lowest percentage of cardio vascular disease.

# creates a stacked bar of Cholesterol versus proportion of Cardiovascular disease
gluc_xlabels = ["Normal", "Above Normal", "Way above Normal"]
viz_stack(df, "gluc", "Stacked Bar Chart of blood pressure vs Cardiovascular disease", "GLUCOSE", "Proportion of Cardiovascular disease", gluc_xlabels, 45)

From the above graph we can see that the people with glucose level of way above normal have highest percentage of cardiovascular disease, and people with glucose level of normal have lowest percentage of cardio vascular disease.

# creates a stacked bar of Cholesterol versus proportion of Cardiovascular disease
blood_xlabels = ["Normal", "High Blood Pressure", "Hypertensive crisis"]
viz_stack(df, "blood_pressure", "Stacked Bar Chart of blood pressure vs Cardiovascular disease", "BLOOD PRESSURE", "Proportion of Cardiovascular disease", blood_xlabels, 45)

From the above graph we can see that people with hypertensive crisis have highest cardiovascular disease percentage, and people with normal blood pressure have lowest cardiovascular disease percentage.

def viz_cat(df, cat, xlabel, title, labels):
    # could reorder legedn + plot

    df_categorical = df.loc[:, cat]
    g = sns.countplot(x="variable", hue="value", data= pd.melt(df_categorical), palette='hls');
    g.set_xticklabels(df_categorical, rotation=30)
    g.set_xlabel(xlabel)

    plt.legend(title='level', labels=labels, loc='upper right')
    plt.show(g)

labels=['normal', 'above normal', 'way above normal']
cat = ['cholesterol','gluc']
viz_cat(df, cat, 'risk factor', 'level', labels )

cat = ['smoke', 'alco', 'active']
labels=['no', 'yes']
viz_cat(df, cat, 'risk factor', 'level', labels )

cat = ['bmi_scale']
labels=['underweight', 'healthy', 'overweight']
viz_cat(df, cat, 'risk factor', 'level', labels )

cat = ['blood_pressure']
labels=['normal', 'high blood pressure', 'hypertensive crisis']
viz_cat(df, cat, 'risk factor', 'level', labels )

This is the distribution of our data across different factors, particularly cholesterol levels, glucose levels, smoker status, alcohol drinker status, and bmi category.

From here, we can see that most of the individuals in the dataset have normal cholesterol levels and normal glucose levels. Most of them also don't smoke, don't drink, and are active in lifestyle. Looking at the BMI scale distribution, the largest category that individuals fall under are overweight, followed by healthy and underweight. Looking at the blood pressure distribution, the largest category that individuals fall under is high blood pressure, followed by normal and hypertensive crisis.

Statistical Model¶

Logistic Regression¶

Our dataset contains two different types of data: ordinal/categorical and binary.

Ordinal/Categorical data includes cholesterol level (1-3), glucose level (1-3), BMI level (1-3), and blood pressure level (1-3).
Binary data includes gender, smoker status, alcohol drinker status, and active status.

Because the data types are different, we decided to treat them separately when applying our logistic regression model.

# Logistic regression with categorical data 
formula_cate = "cardio ~ bmi_scale + cholesterol + gluc + blood_pressure"
logistic_model_cate = smf.glm(formula=formula_cate, data=df, family=sm.families.Binomial()).fit()
logistic_model_cate.summary()

The estimated coefficients are in log odds. The odds of an event is the probability of that event divided by its complement: $$\frac{P}{1-P}$$

By exponentiating the coefficients, we can calculate the odds, which are easier to interpret.

# Odds ratio of ordinal data
np.exp(logistic_model_cate.params)

Intercept         0.019098
bmi_scale         1.475321
cholesterol       1.913896
gluc              0.923351
blood_pressure    3.062829
dtype: float64

We can then do some interpretation with the data above.
Coefficient:

With a unit increase in BMI scale, we can expect the odds of cardiovascular disease to increase by almost 1.48 times while holding everything else constant.
With a unit increase in cholesterol level, we can expect the odds of cardiovascular disease to increase by almost 1.91 times while holding everything else constant.
With a unit increase in glucose level, we can expect the odds of cardiovascular disease to increase by 0.9 times while holding everything else constant.
With a unit increase in blood pressure level, we can expect the odds of cardiovascular disease to increase by 3.06 times while holding everything else constant.

Standard Error:

The standard errors are pretty low for all predictors meaning they we are not too far off from the true mean.

P-value:

All p-values are lower than 0.05 so we can say that there are statistical significance between the predictors and cardiovascular disease.

# Logistic regression with binary data 
formula_bi = "cardio ~ gender_new + smoke + alco + active"

logistic_model_bi = smf.glm(formula=formula_bi, data=df, family=sm.families.Binomial()).fit()
logistic_model_bi.summary()

We would apply the same logic as categorical data

# Odds ratio of binary data
np.exp(logistic_model_bi.params)

Intercept     1.124019
gender_new    1.016208
smoke         0.857116
alco          0.937454
active        0.826249
dtype: float64

Coefficient:

Individuals with cardiovascular disease are...
- 1.02 times more likely to be male than female.
- 0.86 times more likely to be smokers than non smokers.
- 0.94 times more likely to be alcohol drinkers than non-alcohol drinkers.
- 0.83 times more likely to live active lifestyles than inactive lifestyles.

Standard Error:

The standard errors are pretty low for all predictors meaning they we are not too far off from the true mean.

P-value:

All p-values are lower than 0.05 except alcohol so we can say that there are statistical significance between the predictors and cardiovascular disease.

Machine learning¶

# For splitting data
from sklearn.model_selection import train_test_split 
# For scaler/normalization
from sklearn.preprocessing import MinMaxScaler 
# Preprocessor
from sklearn.preprocessing import PolynomialFeatures
# Feature selections
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import VarianceThreshold
# for making pipelines
from sklearn.pipeline import make_pipeline           
# for grid search
from sklearn.model_selection import GridSearchCV  

# split train feature data into testing and training data
train_features, test_features, train_outcome, test_outcome = train_test_split(
   df.drop(["cardio"], axis=1),      
   df['cardio'],      
   test_size=0.30, 
   random_state=11 
)

# make a function to run multiple models

def run_model(model, param_grid, xtrain, ytrain, xtest, ytest, do_poly = False):
    # Create a scaler
    scaler = MinMaxScaler()

    # Create polynomial transformation, percentile selector, and variance threshold 
    selecter = SelectPercentile()
    threshold = VarianceThreshold(.1)
    if do_poly: 
        poly = PolynomialFeatures()
        pipe = make_pipeline(poly, threshold, selecter, scaler, model)
    else: 
        pipe = make_pipeline(threshold, selecter, scaler, model)

    grid = GridSearchCV(pipe, param_grid)
    grid.fit(xtrain, ytrain)
    grid.best_params_
    accuracy = grid.score(xtest, ytest)
    return grid, accuracy

We ran our training data on several machine learning models. For each one, we calculated accuracy scores that reflect the fraction of predictions our model got right.

We used classifier models because our prediction outcome is categorical data, not continuous. In addition, we have labeled features. In other words, all rows/units of analyses are binary in that they are either a yes (cardiovascular disease positive) or a no (cardiovascular disease negative).

Machine Learning Models¶

Logistic Regression¶

We started with a Logistic Regression model, because it is easy to implement and is a very basic machine learning model. We weren't expecting it to have a high accuracy score, but we wanted a benchmark to compare our other models' accuracy scores to.

# Model 1: Logistic Regression 
from sklearn.linear_model import LinearRegression
lr = LinearRegression(fit_intercept=False)
param_grid_linear = {'polynomialfeatures__degree':range(1, 3), 
              'selectpercentile__percentile':range(10, 30, 5)}
grid_dtc, score = run_model(lr, param_grid_linear, train_features, train_outcome, test_features, test_outcome, do_poly = True)
score

0.23386346404833325

After running the Logistic Regression model, we get an accuracy of 23%. This means that when we ran our test data on the model, it was accurate 23% of the time.

K-Nearest Neighbors Classifier¶

The reason why we wanted to use KNN Classifier was because it is useful particularly for nonlinear data and it does not have assumptions about the data.

# Model 2: k-Nearest Neighbors Classifier
from sklearn.neighbors import KNeighborsClassifier
knc = KNeighborsClassifier()
param_grid_knc = {'kneighborsclassifier__n_neighbors': np.arange(1,10),
                 'kneighborsclassifier__weights':["uniform", "distance"]}


grid_knc_a, score = run_model(knc, param_grid_knc, train_features, train_outcome, test_features, test_outcome, do_poly = False)
score

0.6939024390243902

After running the K-Nearest Neighbors model, we get an accuracy of 69%. This means that when we ran our test data on the model, it predicted correctly 69% of the time.

Decision Tree Classifier¶

Decision Tree Classifier is good to apply because it has the ability to assign specific values to the problem, and decisions and outcomes of each decision. Hence, ambiguity in decision-making is reduced.

# Model 2: Decision Tree Classifier
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
param_grid_dtc = {'decisiontreeclassifier__max_depth': np.arange(1,10)}

grid_dtc, score = run_model(dtc, param_grid_dtc, train_features, train_outcome, test_features, test_outcome, do_poly = False)
score

0.7061529933481153

After running the Decision Tree Classifier model, we get an accuracy of 71%. This means that when we ran our test data on the model, it was accurate 71% of the time.

Naive Bayes¶

Naive Bayes is easy to implement and it requires less training data. Moreover, in our datasets, we have both binary (0,1) and categorical (1, 2, 3, 4) data, which Naive Bayes models can handle well.

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
param_grid_naive = {'polynomialfeatures__degree':range(1, 3), 
              'selectpercentile__percentile':range(10, 30, 5)}
grid_dtc, score = run_model(gnb, param_grid_naive, train_features, train_outcome, test_features, test_outcome, do_poly = True)
score

0.7131929046563192

After running the Naive Bayes model, we get an accuracy of 71%. This means that when we ran our test data on the model, it was accurate 71% of the time.

From the accuracy scores that we calculated above, we find that the accuracy score is the highest when we use Naive Bayes model. We think the reason is because Naive Bayes is highly scalable and it can scale linearly with predictors and size of the dataset.

Conclusion¶

Ultimately, we confirmed previous research findings, in that we found cardiovascular disease outcomes to be associated with having higher cholesterol levels, having higher glucose levels, being overweight, being a smoker and drinker, and living an inactive lifestyle.

	age	gender	height	weight	ap_hi	ap_lo	cholesterol	gluc	smoke	alco	active	cardio	age_in_years
count	70000.000000	70000.000000	70000.000000	70000.000000	70000.000000	70000.000000	70000.000000	70000.000000	70000.000000	70000.000000	70000.000000	70000.000000	70000.000000
mean	19468.865814	1.349571	164.359229	74.205690	128.817286	96.630414	1.366871	1.226457	0.088129	0.053771	0.803729	0.499700	53.338686
std	2467.251667	0.476838	8.210126	14.395757	154.011419	188.472530	0.680250	0.572270	0.283484	0.225568	0.397179	0.500003	6.765294
min	10798.000000	1.000000	55.000000	10.000000	-150.000000	-70.000000	1.000000	1.000000	0.000000	0.000000	0.000000	0.000000	30.000000
25%	17664.000000	1.000000	159.000000	65.000000	120.000000	80.000000	1.000000	1.000000	0.000000	0.000000	1.000000	0.000000	48.000000
50%	19703.000000	1.000000	165.000000	72.000000	120.000000	80.000000	1.000000	1.000000	0.000000	0.000000	1.000000	0.000000	54.000000
75%	21327.000000	2.000000	170.000000	82.000000	140.000000	90.000000	2.000000	1.000000	0.000000	0.000000	1.000000	1.000000	58.000000
max	23713.000000	2.000000	250.000000	200.000000	16020.000000	11000.000000	3.000000	3.000000	1.000000	1.000000	1.000000	1.000000	65.000000

	age	gender	height	weight	ap_hi	ap_lo	cholesterol	gluc	smoke	alco	active	cardio	age_in_years	bmi	bmi_scale
blood_pressure
1	10471	10471	10471	10471	10471	10471	10471	10471	10471	10471	10471	10471	10471	10471	10471
2	46239	46239	46239	46239	46239	46239	46239	46239	46239	46239	46239	46239	46239	46239	46239
3	3422	3422	3422	3422	3422	3422	3422	3422	3422	3422	3422	3422	3422	3422	3422

Dep. Variable:	cardio	No. Observations:	60132
Model:	GLM	Df Residuals:	60127
Model Family:	Binomial	Df Model:	4
Link Function:	logit	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-38188.
Date:	Tue, 12 Mar 2019	Deviance:	76376.
Time:	21:38:50	Pearson chi2:	6.04e+04
No. Iterations:	4	Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	-3.9582	0.063	-62.828	0.000	-4.082	-3.835
bmi_scale	0.3889	0.018	21.671	0.000	0.354	0.424
cholesterol	0.6491	0.016	41.797	0.000	0.619	0.680
gluc	-0.0797	0.018	-4.514	0.000	-0.114	-0.045
blood_pressure	1.1193	0.021	53.553	0.000	1.078	1.160

Dep. Variable:	cardio	No. Observations:	60132
Model:	GLM	Df Residuals:	60127
Model Family:	Binomial	Df Model:	4
Link Function:	logit	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-41602.
Date:	Tue, 12 Mar 2019	Deviance:	83203.
Time:	21:39:38	Pearson chi2:	6.01e+04
No. Iterations:	4	Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	0.1169	0.019	6.042	0.000	0.079	0.155
gender_new	0.0161	0.018	0.881	0.378	-0.020	0.052
smoke	-0.1542	0.033	-4.716	0.000	-0.218	-0.090
alco	-0.0646	0.039	-1.640	0.101	-0.142	0.013
active	-0.1909	0.021	-9.276	0.000	-0.231	-0.151