Machine learning based lead generation, modelling stress scenarios through multi-correlated GBMs and Brownian bridges, forecasting credit defaults using random forests… I never expected our first Data Science post to be about measuring the weight of babies!
Why this topic?
The reason for this is better than any of the latter: on October 7th Ebury’s Data Team won Bill and Melinda Gates Foundation “Global Health in Numb3rs Hackathon”, organized by ODSC (Open Data Science Conference) London and focused on ‘Healthy Birth, Growth and Development’. From the event’s website:
[…] the Gates Foundation’s vision is to ensure a world where every person has the opportunity to live a healthy, productive life. The Foundation’s Healthy Birth, Growth, and Development knowledge integration (HBGDki) initiative is a global data-driven project. We have integrated many data sets about child growth and development into a large knowledge base. We are using these data sets to learn about factors that stop body and brain growth, and to develop optimal solutions […].
Our challenge was clear: how can ultrasonogram measurements be used to predict fetal weight at week 40. As for the data – over 17k measurements of head and abdominal circumference, biparietal diameter and femur length carried out between week 26 and 39 of gestation across 2.5k subjects in two different countries.
Our young team of data scientists seemed better equipped to launch a rocket than to deliver a baby (Inigo and Pedro are aeronautical engineers while Antonio, Vicente and myself are electromechanical engineers). We rushed into the exploratory analysis and two hours worth of wrangling later, our best conclusions were: postnatal data from babies should be discarded and sex doesn’t seem important, but we’ll see later…
We decided to call a timeout for lunch and tackle the problem with fresh eyes. Back in the game we faced a key decision: should we approach the dataset in a wide format (that is, one baby per row, 1 to 15 measurements per baby) or in a long format (one row per measurement). So we decided to split the team in two and addressed the problem both ways.
For the first approach to make sense, gestational age at time of measurement was added as predictive feature and all physical variables were weighted by time. This would account for the fact that measurements conducted closer to week 40 should be more relevant than those conducted earlier.
Models and features
As for the model chosen: an ensemble of Boosted trees seemed like good option. Whilst our dataset wasn’t massive, cross validation showed that it outperformed other models such as Support Vector Machines or Random Forests – once the most appropriate regularisation and learning rate parameters were found.
Having chosen what we considered to be the optimal model, it was time to select the best features. Brute forcing all variables into the model is always an option but what experience has shown us is that engineering the right features is what boosts a model’s performance. Well, if there’s one thing we learned back at uni it is rigid body modelling, volume integrals and geometry parameterisation. This said, we decided to make good use of these concepts and approached the problem as a mechanical one: assuming that density rho is homogenous across the body, can we not model mass as a direct function of volume?
Given that we had no Catia licenses around, we got hands on feature engineering and ended up with some curious yet relevant predictors: abdominal circumference squared times femur length (yes, something like the volume of a cylinder), head circumference to biparietal diameter (similar to the eccentricity of an ellipse, represented by a zenithal view of the head) and femur length to abdominal circumference (which we defined as the aspect ratio of the fetus). We later discovered that the eccentricity proxy represents a magnitude of cephalic disorder, which may discriminate fetus with growth disorders. The aspect ratio on the other hand, a measure of the slenderness. These features together with gestational age and several other time-weighted and standalone features soon put us at a mean absolute error of 0.23 for the out of sample test (which implies a 230g average error, considering the magnitudes of the problem).
One final touch – taking the logarithm of the target – and the test mean percentage error came down to 6.5% (mean absolute error of 0.2, mean squared error of 0.045). No credit on this – after all, experts in the field do this with their linear models so we decided to borrow the idea for our highly non-linear tree-based approach.
We presented our approach and results to a jury formed by experts in the field: Chris Fregly (Research Specialist at PipelineIO, Netflix, Databricks and Spark), Amanda Schierz (Top Kaggler and Data Scientist at DataRobot), Ajit Jaokar (Director at AI for Smart Cities Lab at Politecnica de Madrid) and Ankur Modi (CEO at StatusToday, AI startup). They challenged the potential implementation of our model, but valued the thinking behind the feature engineering, the overall performance of our sample data and the possibility of using it for an expectant mother with no prior measurement records.
As winners of the competition, we will continue collaborating with the Foundation on this project by improving the model’s performance and generalisation and exploring the limits of a real life implementation.