AI/ML2024

Heart Disease Risk Prediction

Lightweight clinical decision-support system that estimates cardiovascular risk from structured indicators (age, chest pain type, cholesterol, resting BP, max heart rate, ST depression, major vessels, resting ECG, exercise-induced angina). Boosted ensemble selected by comparative benchmarking against KNN (84.96%), SVM (87.47%), Random Forest (88.02%), and a voting ensemble (86.35%); the boosted model reaches 89.42% with strong cross-validated generalization. Clean separation between the training notebook and a Streamlit serving entrypoint that loads a single Joblib artifact, assembles a fixed-order feature vector, and surfaces predict_proba as a tiered risk band rather than a binary label.

Technology stack

Python 3.10+XGBoostscikit-learnStreamlitPlotlyJoblib

Problem statement

Cardiovascular disease remains the leading cause of mortality worldwide. Risk stratification typically depends on the manual interpretation of structured indicators — age, blood pressure, cholesterol, resting ECG, exercise-induced angina, ST depression. The goal here was to package a trained boosted classifier into a maintainable, reproducible inference app that surfaces probabilistic risk (not a binary label), runs cheaply enough for low-cost deployment, and leaves a clean migration path to a hardened FastAPI service later.

Dataset & data

The standard heart-disease dataset under data/heart.csv — structured clinical features including chest pain type, serum cholesterol, resting blood pressure, max heart rate achieved, ST depression induced by exercise, number of major vessels colored by fluoroscopy, resting ECG results, and exercise-induced angina. Cleaned and split into train/test partitions to measure out-of-sample generalization.

Architecture & design

Single boosted classifier exported as a Joblib .pkl — the simplest unit of model versioning, trivial to swap, hash, audit, and load from a controlled inference boundary. Streamlit serves the inference path interactively; the app loads the artifact once, builds the feature vector explicitly (fixed column ordering, no dict-to-DataFrame coercion that could silently drift the schema), calls predict_proba, and surfaces a tiered low/moderate/high interpretation alongside a risk gauge. The training notebook and the serving app are kept mutually independent — no training imports leak into the inference runtime.

Training pipeline

Five model families compared under the same split: KNN, SVM, Voting Ensemble, Random Forest, and a boosted ensemble. The boosted model is then refined and validated with cross-validation to confirm out-of-sample behavior matches the held-out test set.

Results & performance

KNN: 84.96%. SVM: 87.47%. Voting Ensemble: 86.35%. Random Forest: 88.02%. Boosted model: 89.42% — promoted. Cross-validation, comparative accuracy, and boosted-model behavior are captured as committed artifacts (Model accuracy.png, boosted model accuracy.png, cross validation.png). Live at heart-disease-clinical-ai.streamlit.app with sample inference snapshots for healthy vs. high-risk patient inputs.