Ebury’s Data Science Team wins Gates Foundation Hacking Global Health Hackathon

Data science

by Enrique Colin on October 31, 2016

Machine learning based lead generation, modelling stress scenarios through multi-correlated GBMs and Brownian bridges, forecasting credit defaults using random forests… I never expected our first Data Science post to be about measuring the weight of babies!

Why this topic?

The reason for this is better than any of the latter: on October 7th Ebury’s Data Team won Bill and Melinda Gates Foundation “Global Health in Numb3rs Hackathon”, organized by ODSC (Open Data Science Conference) London and focused on ‘Healthy Birth, Growth and Development’. From the event’s website:

[…] the Gates Foundation’s vision is to ensure a world where every person has the opportunity to live a healthy, productive life. The Foundation’s Healthy Birth, Growth, and Development knowledge integration (HBGDki) initiative is a global data-driven project. We have integrated many data sets about child growth and development into a large knowledge base. We are using these data sets to learn about factors that stop body and brain growth, and to develop optimal solutions […].

The challenge

Our challenge was clear: how can ultrasonogram measurements be used to predict fetal weight at week 40. As for the data – over 17k measurements of head and abdominal circumference, biparietal diameter and femur length carried out between week 26 and 39 of gestation across 2.5k subjects in two different countries.

Our young team of data scientists seemed better equipped to launch a rocket than to deliver a baby (Inigo and Pedro are aeronautical engineers while Antonio, Vicente and myself are electromechanical engineers). We rushed into the exploratory analysis and two hours worth of wrangling later, our best conclusions were: postnatal data from babies should be discarded and sex doesn’t seem important, but we’ll see later…

Next steps

We decided to call a timeout for lunch and tackle the problem with fresh eyes. Back in the game we faced a key decision: should we approach the dataset in a wide format (that is, one baby per row, 1 to 15 measurements per baby) or in a long format (one row per measurement). So we decided to split the team in two and addressed the problem both ways.

For the first approach to make sense, gestational age at time of measurement was added as predictive feature and all physical variables were weighted by time. This would account for the fact that measurements conducted closer to week 40 should be more relevant than those conducted earlier.

Models and features

As for the model chosen: an ensemble of Boosted trees seemed like good option. Whilst our dataset wasn’t massive, cross validation showed that it outperformed other models such as Support Vector Machines or Random Forests – once the most appropriate regularisation and learning rate parameters were found.

Having chosen what we considered to be the optimal model, it was time to select the best features. Brute forcing all variables into the model is always an option but what experience has shown us is that engineering the right features is what boosts a model’s performance. Well, if there’s one thing we learned back at uni it is rigid body modelling, volume integrals and geometry parameterisation. This said, we decided to make good use of these concepts and approached the problem as a mechanical one: assuming that density rho is homogenous across the body, can we not model mass as a direct function of volume?

Given that we had no Catia licenses around, we got hands on feature engineering and ended up with some curious yet relevant predictors: abdominal circumference squared times femur length (yes, something like the volume of a cylinder), head circumference to biparietal diameter (similar to the eccentricity of an ellipse, represented by a zenithal view of the head) and femur length to abdominal circumference (which we defined as the aspect ratio of the fetus). We later discovered that the eccentricity proxy represents a magnitude of cephalic disorder, which may discriminate fetus with growth disorders. The aspect ratio on the other hand, a measure of the slenderness. These features together with gestational age and several other time-weighted and standalone features soon put us at a mean absolute error of 0.23 for the out of sample test (which implies a 230g average error, considering the magnitudes of the problem).

One final touch – taking the logarithm of the target – and the test mean percentage error came down to 6.5% (mean absolute error of 0.2, mean squared error of 0.045). No credit on this – after all, experts in the field do this with their linear models so we decided to borrow the idea for our highly non-linear tree-based approach.

Results

We presented our approach and results to a jury formed by experts in the field: Chris Fregly (Research Specialist at PipelineIO, Netflix, Databricks and Spark), Amanda Schierz (Top Kaggler and Data Scientist at DataRobot), Ajit Jaokar (Director at AI for Smart Cities Lab at Politecnica de Madrid) and Ankur Modi (CEO at StatusToday, AI startup). They challenged the potential implementation of our model, but valued the thinking behind the feature engineering, the overall performance of our sample data and the possibility of using it for an expectant mother with no prior measurement records.

As winners of the competition, we will continue collaborating with the Foundation on this project by improving the model’s performance and generalisation and exploring the limits of a real life implementation.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_5ZETTGME4T	2 years	This cookie is installed by Google Analytics.
_gat_gtag_UA_51187572_43	1 minute	This cookie is set by Google and is used to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	16 years 4 months	These cookies are set via embedded youtube-videos. They register anonymous statistical data on for example how many times the video is displayed and what settings are used for playback.No sensitive data is collected unless you log in to your google account, in that case your choices are linked with your account, for example if you click “like” on a video.

Cookie	Duration	Description
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	These cookies are set via embedded youtube-videos.
yt-remote-device-id	never	These cookies are set via embedded youtube-videos.
yt.innertube::nextId	never	These cookies are set via embedded youtube-videos.
yt.innertube::requests	never	These cookies are set via embedded youtube-videos.