Case study · Healthcare machine learning

Spotting risk early when minutes and information are both scarce

A machine learning study on 61,018 maternal and neonatal records from Kenya and Uganda, built to test whether real, messy health data can surface adverse outcome risk early enough to actually help a clinician at the bedside.

PythonScikit-learnPandasRandom ForestLogistic RegressionKNNHealthcare analytics

The problem

In a busy low resource maternity ward, the clinician walking into the room has minutes, not hours, to decide whether this mother needs closer monitoring, a different delivery plan, or an urgent referral. They are reading partial notes, incomplete vitals, and often a paper chart that started ten hours ago. The cost of missing a high risk case is enormous; the cost of over triaging everyone is a system that quietly collapses under its own caution.

The question I wanted to answer was practical: given the data a facility in Kenya or Uganda actually captures during antenatal care and delivery, can a model flag the mothers and newborns most likely to experience an adverse outcome such as preterm birth, low birth weight, or neonatal complications, early enough to change the decision? Not a research toy. A model that respects the data clinicians really have.

The data

The working dataset was 61,018 birth records pulled together from Kenyan and Ugandan maternal health sources, covering maternal demographics, gestational age, parity, antenatal visit history, delivery mode, birth weight, and a set of binary outcome indicators for the mother and the neonate. On paper that's a rich panel. In practice it had every problem real health data has.

Outcome classes were heavily imbalanced. The vast majority of births were uncomplicated, which is great news for mothers and terrible news for a naive classifier that learns to predict "fine" every time and reports 92% accuracy. Missingness was not random: certain tests were only ordered when a clinician was already suspicious, so a blank field often meant "no concern" rather than "unknown". A handful of birth weight entries were implausibly low (clearly transcription errors), and gestational age was sometimes recorded as a date difference and sometimes as a week count, depending on the facility.

The cleaning pipeline standardised units, reconciled the two country schemas into one frame, flagged but did not blindly drop biologically implausible values, and kept a separate missingness indicator alongside the imputed value so the model could learn from the fact that something was missing. That last decision turned out to matter more than any hyperparameter I touched later.

The approach

I deliberately started simple. Logistic regression first, with properly scaled features and a clear baseline on stratified cross validation. Not because I expected it to win, but because without a baseline you can't tell whether a fancier model is actually learning the signal or just memorising noise. Logistic regression got me to a usable AUROC and, more importantly, gave me interpretable coefficients I could sanity check against obstetric intuition. Gestational age and prior complications dominating the top of the list was a good sign.

From there I worked through K Nearest Neighbors (with k tuned via grid search), a Decision Tree, and a Random Forest. Each model was evaluated on the same stratified splits, with the same preprocessing, so the comparison meant something. Random Forest came out ahead on AUROC and, crucially, on recall for the adverse outcome class, which was the metric that actually mattered. A model that catches more at risk births at the cost of a few extra false positives is the right trade off in a triage setting.

The mistake I corrected mid project was reporting overall accuracy in the early write up. On an imbalanced healthcare problem, accuracy is almost a lie. A do nothing classifier scores in the high 80s. I switched the primary metric to recall on the adverse class plus AUROC, kept precision and F1 as secondary, and used a confusion matrix as the headline visualisation rather than a single number. That reframing changed which model I would actually recommend.

Fig 1. AUROC comparison across four classifiers on stratified test data

The result

Random Forest delivered the strongest balance of AUROC and recall on the at risk class, and its confusion matrix on held out data told a clinically usable story: most adverse outcomes were caught, the false negative rate on the class that mattered most was meaningfully lower than the baseline, and the false positives that did occur clustered in cases that a clinician would already want a second look at.

The feature correlation work surfaced the variables doing the heavy lifting: gestational age, birth weight, prior pregnancy complications, and antenatal visit count repeatedly ranked highest, with a long tail of weaker but consistent signals from maternal age extremes and delivery mode. None of that should surprise an obstetrician, and that's the point. A model whose top features make clinical sense is a model that can be trusted into a workflow.

The birth weight and gestational age distributions made the stakes visible. The bulk of births clustered in the healthy range, but the left tail of preterm, low weight births was thick enough and predictable enough from the antenatal features that an early warning layer is genuinely feasible. The story isn't "the model can replace a clinician". The story is "the model can flag the chart that should land on the clinician's desk first".

Fig 2. Random Forest confusion matrix on the held out test set

Fig 3. Feature correlation heatmap across the modelled predictors

Variable legend

bba: born before arrival
doc_iufd: intrauterine fetal death
doc_abortion: pregnancy loss documented
c_birth_weight_g2_: birth weight category

Fig 4. Birth weight distribution across the cohort

Fig 5. Gestational age versus birth outcome

Stillbirth risk rises sharply below 28 weeks gestational age. The 37 weeks and above cohort shows the lowest adverse outcome rate, consistent with established clinical thresholds.

What's next

The biggest honest limitation is that recall on the at risk class, while meaningfully better than baseline, is still not where it needs to be for unsupervised deployment. The next step is targeted resampling using SMOTE or class weighted boosting, paired with a calibrated probability output, so the model produces a risk score rather than a hard label and a clinician sets the threshold based on the ward's capacity to investigate.

I'd also push for facility level features including staffing levels, distance to referral hospital, and equipment availability, because outcome risk is not purely biological. A borderline case in a well staffed urban facility is a different decision from the same case in a rural clinic. Longer term, the interesting question is whether the model can be retrained on a single facility's data after deployment, so it adapts to local patient mix rather than assuming Kenya and Uganda look the same as each other.

Curious about the modelling code, the cleaning pipeline, or the evaluation notebooks?

View on GitHub