Diabetes Risk Prediction
Clinical decision-support system that estimates the probability of Type 2 Diabetes from eight structured patient features (age, hypertension, heart disease, BMI, HbA1c, blood glucose, gender, smoking history). Random Forest selected by controlled benchmarking against Gradient Boosting, Logistic Regression, Decision Tree, and Gaussian Naive Bayes under SMOTE-balanced training and RandomizedSearchCV — ROC-AUC 0.996, stable across folds. Stateless Streamlit serving layer with deterministic 13-column feature encoding, scikit-learn 1.3.2 pinned for pickle ABI compatibility, and a lazy-loaded model cached via st.cache_resource. The serialized estimator is hydrated from Google Drive on cold start, keeping the Git history free of binary blobs.
Technology stack
Problem statement
Type 2 Diabetes is largely preventable yet routinely under-screened. A lightweight, reproducible inference layer over a high-recall classifier can act as a triage signal in primary-care settings, employer wellness programs, and population-health analytics — none of which require a hospital-grade EHR integration to be useful. The engineering goal was to package that triage layer the right way: a single deployable artifact, a stateless serving surface, deterministic inputs, and a serving cost low enough that nothing about the platform discourages re-deploying it.
Dataset & data
~100,000 patient records with 13 encoded input features after one-hot expansion and a binary diabetes target. Class imbalance is significant, so SMOTE is applied to the training set before model selection — effective training size after balancing is ~175,000 samples. A stratified train/test split keeps the original class proportions on the held-out set.
Architecture & design
A single Random Forest classifier exported as a pickle artifact, hydrated from Google Drive on first request and pinned in process memory via st.cache_resource for the lifetime of the Streamlit container. UI inputs go through an explicit build_feature_array helper that encodes them into the exact 13-column ordering the model expects — the schema is grep-able rather than implicit in a pipeline object, which makes the system trivial to port to FastAPI or ONNX later. Every prediction returns a class label, a calibrated probability, and a categorical risk band consumable by both the UI and any downstream API wrapper.
Training pipeline
Five model families benchmarked under identical splits, SMOTE-balanced training, and RandomizedSearchCV hyperparameter selection: Random Forest, Gradient Boosting, Logistic Regression, Decision Tree, Gaussian Naive Bayes. Cross-validation runs on accuracy, precision, recall, and ROC-AUC to confirm generalization stability across folds. scikit-learn 1.3.2 is pinned across both training and serving environments — pickle compatibility across minor sklearn versions is not guaranteed, and this pin is the smallest amount of discipline that prevents the most common production failure for sklearn-based services.
Results & performance
Random Forest: ROC-AUC 0.996. Gradient Boosting: 0.97. Logistic Regression: 0.96. Decision Tree: 0.95. Gaussian Naive Bayes: 0.93. Cross-validation was stable across folds — Random Forest also had the cleanest calibration curve and lowest variance, which mattered more than the marginal AUC delta because the system surfaces a probability, not a class label. Live at diabetes-risk-prediction-ai.streamlit.app; the cold-start hydration adds a few seconds once per container lifetime and zero cost thereafter.