Neural networks like Long Short-Term Memory (LSTM) recurrent neural networks are able to almost seamlessly model problems with multiple input variables. This is a great benefit in time series forecasting, where classical linear methods can be difficult to adapt to multivariate or multiple input forecasting problems. In this tutorial, you will discover how you can develop an LSTM model for multivariate time series forecasting in the Keras deep learning library. After completing this tutorial, you will know: - How to transform a raw dataset into something we can use for time series forecasting.
- How to prepare data and fit an LSTM for a multivariate time series forecasting problem.
- How to make a forecast and rescale the result back into the original units.
Let’s get started. **Updated Aug/2017**: Fixed a bug where yhat was compared to obs at the previous time step when calculating the final RMSE. Thanks, Songbin Xu and David Righart.
## Tutorial OverviewThis tutorial is divided into 3 parts; they are: - Air Pollution Forecasting
- Basic Data Preparation
- Multivariate LSTM Forecast Model
## Python EnvironmentThis tutorial assumes you have a Python SciPy environment installed. You can use either Python 2 or 3 with this tutorial. You must have Keras (2.0 or higher) installed with either the TensorFlow or Theano backend. The tutorial also assumes you have scikit-learn, Pandas, NumPy and Matplotlib installed. If you need help with your environment, see this post: ## 1. Air Pollution ForecastingIn this tutorial, we are going to use the Air Quality dataset. This is a dataset that reports on the weather and the level of pollution each hour for five years at the US embassy in Beijing, China. The data includes the date-time, the pollution called PM2.5 concentration, and the weather information including dew point, temperature, pressure, wind direction, wind speed and the cumulative number of hours of snow and rain. The complete feature list in the raw data is as follows: **No**: row number**year**: year of data in this row**month**: month of data in this row**day**: day of data in this row**hour**: hour of data in this row**pm2.5**: PM2.5 concentration**DEWP**: Dew Point**TEMP**: Temperature**PRES**: Pressure**cbwd**: Combined wind direction**Iws**: Cumulated wind speed**Is**: Cumulated hours of snow**Ir**: Cumulated hours of rain
We can use this data and frame a forecasting problem where, given the weather conditions and pollution for prior hours, we forecast the pollution at the next hour. This dataset can be used to frame other forecasting problems. You can download the dataset from the UCI Machine Learning Repository. Download the dataset and place it in your current working directory with the filename “ ## 2. Basic Data PreparationThe data is not ready to use. We must prepare it first. Below are the first few rows of the raw dataset. The first step is to consolidate the date-time information into a single date-time so that we can use it as an index in Pandas. A quick check reveals NA values for pm2.5 for the first 24 hours. We will, therefore, need to remove the first row of data. There are also a few scattered “NA” values later in the dataset; we can mark them with 0 values for now. The script below loads the raw dataset and parses the date-time information as the Pandas DataFrame index. The “No” column is dropped and then clearer names are specified for each column. Finally, the NA values are replaced with “0” values and the first 24 hours are removed. The “No” column is dropped and then clearer names are specified for each column. Finally, the NA values are replaced with “0” values and the first 24 hours are removed. Running the example prints the first 5 rows of the transformed dataset and saves the dataset to “ Now that we have the data in an easy-to-use form, we can create a quick plot of each series and see what we have. The code below loads the new “ Running the example creates a plot with 7 subplots showing the 5 years of data for each variable. Line Plots of Air Pollution Time Series ## 3. Multivariate LSTM Forecast ModelIn this section, we will fit an LSTM to the problem. ## LSTM Data PreparationThe first step is to prepare the pollution dataset for the LSTM. This involves framing the dataset as a supervised learning problem and normalizing the input variables. We will frame the supervised learning problem as predicting the pollution at the current hour (t) given the pollution measurement and weather conditions at the prior time step. This formulation is straightforward and just for this demonstration. Some alternate formulations you could explore include: - Predict the pollution for the next hour based on the weather conditions and pollution over the last 24 hours.
- Predict the pollution for the next hour as above and given the “expected” weather conditions for the next hour.
We can transform the dataset using the First, the “ Next, all features are normalized, then the dataset is transformed into a supervised learning problem. The weather variables for the hour to be predicted (t) are then removed. The complete code listing is provided below. Running the example prints the first 5 rows of the transformed dataset. We can see the 8 input variables (input series) and the 1 output variable (pollution level at the current hour). This data preparation is simple and there is more we could explore. Some ideas you could look at include: - One-hot encoding wind speed.
- Making all series stationary with differencing and seasonal adjustment.
- Providing more than 1 hour of input time steps.
This last point is perhaps the most important given the use of Backpropagation through time by LSTMs when learning sequence prediction problems. ## Define and Fit ModelIn this section, we will fit an LSTM on the multivariate input data. First, we must split the prepared dataset into train and test sets. To speed up the training of the model for this demonstration, we will only fit the model on the first year of data, then evaluate it on the remaining 4 years of data. If you have time, consider exploring the inverted version of this test harness. The example below splits the dataset into train and test sets, then splits the train and test sets into input and output variables. Finally, the inputs (X) are reshaped into the 3D format expected by LSTMs, namely [samples, timesteps, features]. Running this example prints the shape of the train and test input and output sets with about 9K hours of data for training and about 35K hours for testing. Now we can define and fit our LSTM model. We will define the LSTM with 50 neurons in the first hidden layer and 1 neuron in the output layer for predicting pollution. The input shape will be 1 time step with 8 features. We will use the Mean Absolute Error (MAE) loss function and the efficient Adam version of stochastic gradient descent. The model will be fit for 50 training epochs with a batch size of 72. Remember that the internal state of the LSTM in Keras is reset at the end of each batch, so an internal state that is a function of a number of days may be helpful (try testing this). Finally, we keep track of both the training and test loss during training by setting the ## Evaluate ModelAfter the model is fit, we can forecast for the entire test dataset. We combine the forecast with the test dataset and invert the scaling. We also invert scaling on the test dataset with the expected pollution numbers. With forecasts and actual values in their original scale, we can then calculate an error score for the model. In this case, we calculate the Root Mean Squared Error (RMSE) that gives error in the same units as the variable itself. ## Complete ExampleThe complete example is listed below.
Running the example first creates a plot showing the train and test loss during training. Interestingly, we can see that test loss drops below training loss. The model may be overfitting the training data. Measuring and plotting RMSE during training may shed more light on this. Line Plot of Train and Test Loss from the Multivariate LSTM During Training The Train and test loss are printed at the end of each training epoch. At the end of the run, the final RMSE of the model on the test dataset is printed. We can see that the model achieves a respectable RMSE of 26.496, which is lower than an RMSE of 30 found with a persistence model. This model is not tuned. Can you do better? ## Further ReadingThis section provides more resources on the topic if you are looking go deeper. - Beijing PM2.5 Data Set on the UCI Machine Learning Repository
- The 5 Step Life-Cycle for Long Short-Term Memory Models in Keras
- Time Series Forecasting with the Long Short-Term Memory Network in Python
- Multi-step Time Series Forecasting with Long Short-Term Memory Networks in Python
## SummaryIn this tutorial, you discovered how to fit an LSTM to a multivariate time series forecasting problem. Specifically, you learned: - How to transform a raw dataset into something we can use for time series forecasting.
- How to prepare data and fit an LSTM for a multivariate time series forecasting problem.
- How to make a forecast and rescale the result back into the original units.
Do you have any questions? |