Time Series Malaria Commodity Prediction

Theophilus Owiti

Theophilus Owiti

Nairobi, Nakuru County

0 0
  • 0 Collaborators

Tracking and restocking of Malaria commodities is currently done on the basis of 6-month averages computed by MoH officials from the KHIS system data. The averages utilized are not efficient hence XGBoost is used in prediction of this monthly average trends using Time Series. ...learn more

Project status: Under Development

Artificial Intelligence

Docs/PDFs [1]Code Samples [1]

Overview / Usage

Tracking and restocking of Malaria commodities is currently done on the basis of 6-month averages computed by MoH officials from the KHIS system data. The averages utilized are not efficient in projecting a clear and desired picture of trends related to consumption such as seasonality, increases or decreases as well as other confounding factors. KHIS through a representative shares problems that they have faced in trying to use the averages in restocking Malaria commodities. The representative showed a system based of KHIS and desired the same predictive outcome instead of the averages projected to create and efficient way of making decisions to restock Malaria commodities amidst demand. Need of client Therefore, KHIS needs a Time Analysis Predictive model that is able to monitor common patterns and trends and give our projections of what is expected as per areas of low demand and areas of high demand in order to have efficient restocking of Malaria commodities. The desired machine learning model is XGBoost.

Methodology / Approach

Feature Engineering

In a trial Python notebook, an Analysis of the data is done to find out if the trend present has an aspect of stationarity. We use the rolling mean and rolling standard deviation to vaguely determine if our data is stationary from a graph. To see if the trends that we have are consistent throughout the seasons we have. In this instance we use Python’s Statmodel. From statsmodel we import dickey fuller test and use it to see the metrics and determine the nature of our data and give us our p-values of the data and tells us about our stationarity condition. In the official notebook we then do a split for the test and train data. At this point it is wise not to use random split as we are dealing with a time series therefore we will use a range from the most recent years from 2021 as our test data set and early years as out train dataset. At this point we will split it using pandas loc() method.

Model Development

Several models were used to test for proper prediction based on the new features we have adopted from our indexed dataset. The Machine Learning models of choice were: Linear Regression with TensorFlow Neural Networks using Sequential Layers, Gradient Boost Regressor and lastly Extreme Gradient Boosting. Linear Regression resulted into large MSE Error and did not fit any of the curves in an optimal way and performed immensely poor during regularization with new data. Gradient Boost Regressor resulted into continuous underfitting and large RMSE and MSE Errors and had poor regularization that projected impossible values even after adding an appropriate learning rate and proper number of boosted trees. Extreme Gradient Boosting resulted into acceptable regularized values that gave an error of about RMSE of 500 and an observation of overfitting was observed at the point of cross-validation data set. At this point a number of trees were reduced and an appropriate learning rate was adopted for proper learning. After a doing all checks to make the algorithm work better it was therefore concluded it fit the curve as proposed. The Algorithm was derived from xgboost library in python and incorporated as a function with relevant parameters for optimal performance.

Summary: The metrics that was used for our model was the Root Mean Squared Error (RMSE) and Mean Squared Error (MSE), important thing to note is that the Regressor uses RMSE only by default because of the inconsistencies. In accordance with the features, this terms quarter prediction as an inaccurate strategy for prediction of commodities. Meanwhile, the month features make it to the spot light as the highest importance. We used a pandas feature that allows us to compute the importance of the features that we are to use for plotting. XGBR is a Regression Model that works by building ensemble of decision trees, where each tree is trained to make prediction based on a subset of the available data. The final prediction is made by taking the average of the prediction from all the trees in the ensemble. It was the used model as it worked better as compared to the others and the over-fitting was easier to identify using the cross-validation function.

Technologies Used

  • XGBoost
  • Jupyter Notebooks
  • Scikit-learn
  • Scipy
  • Pandas
  • Numpy
  • Statmodel
  • Python

Documents and Presentations

Repository

https://github.com/tiprock-network/Malaria-Commodities-Demand-Prediction-Model

Comments (0)