I’ve taken a little bit of time away from my personal research in the past couple weeks to look at how I can apply the modeling skills that I’ve developed over the years to help with the COVID-19 outbreak. I have been working with my brother (a biostatistician, and some folks at Napier) to try to build a model that forecasts the total number of cases in the short term, just by looking at the data. Our aim is not to replace or produce a more accurate model that the epidemiologist (they’re certainly the experts in that), but rather to identify what variables have the most importance in predicting the number of cases from a purely machine learning perspective. Essentially asking the question: what aspects of country X will be influential in determining the number of cases in country Y? And how can we utilize this data to support decision making in countries with curves further behind?
We are using three primary data types, (1) number of cases and the spread, (2) non-pharmaceutical interventions such as physical distancing, and closing borders, and (3) static socioeconomic data about the country such as per-capita GDP, healthcare ratings, population structure etc. We then are applying a random forest model to distill the 100+ variables into a 14 day forecast. .
We are still developing this project and tweaking things all the time. We hope to turn this into an interactive visualization soon, and I will certainly share that here when it’s ready. But in the meantime, you can see below a snapshot of how the model is performing. On the top left is all the country data we are using to train the model. Then the top right is country we are testing, the dashed line is the model’s prediction, and the solid line is the actual observed data. Below that is the importance of variables I the model. Interestingly, after the number of cases the previous ‘n’ days (as expected), other variable start to become important such as quarantining cases, social distancing, and a country’s preparedness for an outbreak such as this. Not that the flat part at the top of Italy’s curve is a symptom of not having good data past that point. An area we are trying to improve this week.
While there has been lots of time spent indoors these days, I’ve been able to get my “once-a-day” exercise at Holyrood Park which has been fantastic with the current Spring weather. There have been nearly full weeks of sun here in Scotland – making it a magical time, especially with the days getting longer. Just wish I could explore more than just outside my door…