This is a competition from the website Drivendata.org (https://www.drivendata.org/competitions/44/dengai-predicting-disease-spread/)
The competition wants you to predict the number of new dengue cases in two cities having a set of features both dependent and independet with the target.
The prediction is a forecast on future cases, knowing the behaviour of the all the features.
The features involved are related to temperature, vegetation density, humidity and rain.
Correlation between the features, correlation between features and target, autocorrelation and seasonality are taken into account.
Differentiation the features and adding features with the lag is condered as well.
The example of the target and few features is presented here below:
For the solution, an ensamble of the following models have been used:
DNN LSTM (Long short term memory)
prophet
VARIMAX (Vector Autoregressive Moving Average with eXogenous regressors )
SARIMAX (Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors)
Bagging Regressor
Random Forest Regressor
XGB Regressor
Negative Binomial
The training have been done considering both the entire dataset or with a rolling window, the best score between the two approach have been kept for the final ensamble.
Here Below and example of prediction with the XGBRegressor, without modifying nor adding features:
Prediction of dengue new cases on a single city
My result has been submited at the competition:
Ranked 371 over 9000 participant (MAE 21.2236)
The code can be found in my github repository: