I applied advanced techniques and model to gain insights in the export prices for gold from the USA.
Below is an overview of the project's key components and methodologies.
The motivation for embarking on this project stemmed from my fascination with the predictive power inherent in time series data. Witnessing the ability to extrapolate future trends and patterns from historical data ignited my curiosity and propelled me to delve deeper into time series analytics.
Pandas: It provides data structures and functions for working with structured data, making it ideal for loading, cleaning, and preprocessing time series data.
NumPy: NumPy supports multidimensional arrays and mathematical functions, essential for handling numerical data in time series analysis.
Matplotlib or Seaborn: They provide tools for creating various plots and charts to visualize time series data, such as line plots, scatter plots, and histograms.
DateTime: It parses date-time data from time series datasets and performs time-based operations.
Statsmodels: Statsmodels is a library for statistical modeling and hypothesis testing in Python. I have used tools for fitting ARIMA models, conducting statistical tests, and performing time series decomposition.
scikit-learn: I used it for train-test splitting and evaluating forecast models.
pmdarima: The auto_arima function in pmdarima simplifies the process of fitting ARIMA models with different parameters and selecting the best model based on information criteria.
Data Type Conversion: The code converts the values in a specific column (column_name) of the DataFrame df to numeric format using the pd.to_numeric() function. Errors encountered during conversion are coerced to NaN (Not a Number) using the errors='coerce' parameter.
Analysis of Missing Values: The code prints the count of missing values (NaN) in the specified column using the .isna().value_counts() method.
Visualization of Missing Data: Two plots are created to visualize missing data and exported prices over time:
Data Cleaning: Missing values are dropped from the DataFrame using the .dropna(axis=0, inplace=True) method, removing rows containing NaN values.
Data Truncation: Data is truncated to retain only records from 1995 onwards. Records before this date are discarded from the DataFrame.
Imputation (Not Applicable): Although there is a comment regarding imputation with the mean, the corresponding code is not included in this snippet, indicating that imputation was not performed in this specific case.
Missing data represented in red
Implemented down-sampling to calculate yearly mean and up-sampling using linear interpolation for weekly data.
Leveraged pandas resample() function for time-based aggregation and interpolation.
Upsample/ Downsampled time-series data
The check for stationarity can be done via three different approaches. visually:
plot time series and check for trends or seasonality
basic statistics: split time series and compare the mean and variance of each partition
statistical test: Augmented Dickey-Fuller test
Trend (non- stationary) time series plot
Augmented Dickey-Fuller test
The augmented Dickey-Fuller test is a statistical test used to determine whether a unit root is present in a time series dataset, making it non-stationary. It assesses the null hypothesis that the time series has a unit root against the alternative hypothesis of stationarity after differencing. In simpler terms, it helps ascertain if a time series exhibits a stable pattern or is influenced by random fluctuations over time.
IT Is key test for ARIMA, such as the augmented Dickey-Fuller test, is essential to ensure the stationarity of the time series data, a prerequisite for accurate modeling and forecasting with ARIMA models.
p value is 0.9727644640822224
- Therefore we fail to reject the null hypothesis
- Meaning The time series has a unit root.
- The time series has a trend/non-stationary)
Non-stationary
p value is 2.166809681553704e-30
- Therefore we reject the null hypothesis
- Meaning The time series does not have a unit root.
- The time series is stationary.
Stationary (DIFF = 1)
I effectively mitigated the risk of data leakage by strategically splitting the dataset into distinct training and testing sets based on a predefined break date.
By ensuring that observations beyond the break date are reserved exclusively for testing, I safeguard against any inadvertent inclusion of future data in the training process, thereby maintaining the integrity of the model evaluation and preventing data leakage.
I have used the following evaluation metrics
Mean Squared Error (MSE)
Mean Absolute Error (MAE)
Mean Absolute Percentage Error (MAPE)
Decomposed the time series into its trend, seasonality, and residual components to uncover underlying patterns and trends.
Decomposition: Analyzed the trend and seasonal patterns within the time series data.
Prediction: Forecasted future values based on the decomposed components.
Results: Evaluated the accuracy of the decomposition-based forecasts.
Results of time series decomposition
On the right, I have decomposed the time series split into seasonality trend and the residential later, I will use these three components to forecast the future.
Below, you see the seasonality of gold export prices by month and the forecasts for 25 months.
{'MSE': 1683.32, 'MAE': 34.5, 'MAPE': 5.25}
Exponential smoothing, gradually decreases the influence of older data points, enabling me to capture underlying trends and make accurate predictions in dynamic datasets.
I have selected multiplicative - trend and additive - seasonality.
Results
{'MSE': 1015.75, 'MAE': 25.11, 'MAPE': 3.93}
ARIMA modeling, The Autoregressive (AR) component models the relationship between an observation and several lagged observations, where the order of the AR component is determined by analyzing the autocorrelation function (ACF). The Moving Average (MA) component captures the relationship between an observation and a residual error from a moving average model, and its order is determined by analyzing the partial autocorrelation function (PACF).
To select the hyperparameters for the ARMA model, I leverage the ACF and PACF plots to identify the optimal values for the AR and MA components, respectively. The ACF plot shows the correlation between observations at increasing lag distances. In contrast, the PACF plot displays the correlation between observations separated by a fixed number of time steps after removing the relationships explained by earlier lags.
Additionally, I have selected the differencing parameter equal to one to make the time series stationary. Stationarity is essential for ARIMA modeling as it ensures that the statistical properties of the time series, such as mean and variance, remain constant over time, facilitating accurate forecasting and modeling of trends and patterns. By carefully selecting hyperparameters and ensuring stationarity, I enhance the predictive capabilities of the ARIMA model in capturing the dynamics of the time series data.
Results
{'MSE': 2844.25, 'MAE': 45.25, 'MAPE': 6.7}
The ACF tapers off at lag 25
The PACF quicky reduces after 2 lags
The reason for sub-par results here is due to the use of ARIMA model with seasonal data, inorder to account to seasonal data we use SARIMA which accounts for seasonal data in our time series.
Seasonal Autoregressive Integrated Moving Average (SARIMA) is an extension of the ARIMA model that incorporates seasonality into the forecasting process.
In SARIMA, the "S" stands for seasonal, indicating that the model accounts for periodic fluctuations in the data. The rest is similar to ARIMA.
The SARIMA model is defined by three main sets of parameters: (p, d, q) for the non-seasonal components, (P, D, Q) for the seasonal components, and the seasonal period (s).
The (p, d, q) parameters correspond to the AR, differencing, and MA components of the non-seasonal part of the model, respectively.
The (P, D, Q) parameters represent the same components but for the seasonal part of the model.
The seasonal period (s) defines the length of the seasonal cycle in the data.
By effectively modeling both the seasonal and non-seasonal components of the time series data, SARIMA enables me to generate accurate forecasts that capture both short-term fluctuations and long-term trends.
The ACF tapers off at lag 25
Results
{'MSE': 1177.43, 'MAE': 29.45, 'MAPE': 4.47}