Time series forecasting model
Microsoft Machine Learning - Use Case "The Prophet" model
Microsoft Machine Learning - Use Case "The Prophet" model
Time Series Forecasting is a statistical technique used to predict future values based on past observations. It is extensively applied in fields such as finance, economics, and environmental science, among others. This method involves examining trends, patterns, and seasonality in time series data to forecast future events accurately.
Effective forecasting enables organizations to plan and make decisions by offering insights into future trends and possible outcomes. Techniques such as ARIMA, Exponential Smoothing, and the Prophet model are frequently utilized for forecasting.
Some key points to consider for any Time Series Forecasting:
Trend Analysis/Time Series Data: Identifying long-term movements in data over time. This is the foundation of forecasting. It's a sequence of data points indexed in chronological order, often collected at regular intervals (daily, hourly, etc.). Examples include stock prices, sales figures, or weather measurements. Time series can also show underlying trends, such as an increase, decrease, or plateauing of values over time. Capturing these trends is crucial for making reliable forecasts.
Seasonality: Understanding and adjusting for regular patterns that repeat over a known period. Many time series data exhibit seasonal patterns that repeat over time (e.g., monthly sales cycles, yearly temperature variations). Forecasting models need to account for these patterns for accurate predictions.
Cyclical Patterns: Recognizing and accounting for fluctuations that are not of a fixed period.
Error, Noise, or Randomness: Distinguishing unpredictable variations that cannot be modeled.
Time series forecasting has a wide range of applications across various domains:
Business: Predicting future sales, demand, and inventory levels to optimize resource allocation and marketing strategies.
Finance: Forecasting stock prices, exchange rates, and other financial metrics to make informed investment decisions.
Supply Chain Management: Predicting future demand for products to optimize inventory management and logistics.
Science & Engineering: Forecasting weather patterns, energy consumption, or equipment failures for proactive planning and maintenance.
There are numerous forecasting techniques, each with its strengths and weaknesses. Here are some of the most common ones:
Statistical Methods: These methods use statistical analysis to identify patterns in historical data and make predictions. Examples include ARIMA (Autoregressive Integrated Moving Average) models for stationary data.
Machine Learning Methods: These methods leverage machine learning algorithms to learn complex patterns from historical data and make predictions. Popular choices include Prophet (to discussed in details here) and LSTMs (Long Short-Term Memory networks) for complex time series data.
Choosing the Right Technique:
The best forecasting technique depends on the specific characteristics of your data and the desired level of accuracy. Here are some factors to consider:
Data Characteristics: Is the data stationary (constant mean and variance)? Does it exhibit seasonality or trends?
Forecast Horizon: How far into the future do you need to predict?
Accuracy Requirements: How important is it to have highly accurate forecasts?
The Prophet time series model is a forecasting tool developed by Facebook. It's designed for forecasting at scale, handling the common issues of time series data like missing values, outliers, and dramatic changes in a given time series.
Key features:
Additive Model: Prophet uses an additive model where non-linear trends are fit with yearly, weekly, and daily seasonality, plus holiday effects.
Automatic Seasonality and Trend Handling: Prophet automatically detects and models seasonality in your data, including weekly, monthly, and yearly patterns. It can also capture trends, whether linear or non-linear.
Holiday Effects: Prophet allows to incorporate the effects of holidays on your time series data. You can specify holidays and their expected impact on forecasts, which can be helpful for scenarios like retail sales where holidays significantly influence demand.
Robust to Noise: It's robust to missing data and shifts in the trend, and typically handles outliers well.
Tunable Forecasts: While it provides automated forecasts, it also allows for manual tuning to incorporate domain knowledge.
Easy to Use: Prophet is designed to be easy to use and requires minimal input from the user to get started.
Open Source: It's open source and implemented in both R and Python, sharing the same underlying Stan code for fitting. It's implemented as a Python library (FbProphet) and offers a simple API for data input, model fitting, and forecast generation. This makes it accessible even for those without a strong machine learning background.
Interpretability: Prophet provides interpretable results. The model outputs don't just provide forecasts, they also highlight the contributions of trend, seasonality, and holidays to the predictions. This can be valuable for understanding the reasoning behind the forecasts.
Prophet is particularly effective for data with strong seasonal effects and several seasons of historical data. It's used in many applications across Facebook and by various organizations for reliable forecasts for planning and goal setting.
Some limitations to keep in mind with Prophet:
Univariate: Prophet is designed for univariate forecasting, meaning it can only handle one target variable at a time.
Complexities or Non-Stationarity: While Prophet can handle trends and seasonality, it might struggle with highly complex patterns or data with significant non-stationarity (where the statistical properties change over time).
Accuracy: Prophet may not always outperform other forecasting models in terms of raw accuracy.
Here we will cover Microsoft Fabric end-to-end data science workflow for a time series forecasting model. This scenario uses historic sales data to predict the total monthly sales of properties in New York City. Time series forecasting predicts future values, based on historical data. This is a common, important part of business operations. The workflow discussed here can apply to other forecasting tasks: weather, sales numbers, stock prices, capacity planning, etc.
Below we covers these topics:
Install custom library resources
Load the data
Examine and process the data through exploratory data analysis
Train a machine learning model with Prophet - an open source software package - and track experiments using MLflow and the Fabric Autologging feature
Save the final machine learning model and make predictions.
Dataset
The notebook in the workflow uses the NYC Property Sales data dataset. It covers data from 2003 to 2015, published by the NYC Department of Finance. The dataset includes a record of every building sale in the New York City property market, within a thirteen-year period.
Goal
The goal is to build a model that forecasts the monthly total sales, based on historical data. For this, we will use Prophet.
Prophet uses a decomposable time series model, consisting of three components:
trend: Prophet assumes a piece-wise constant rate of growth, with automatic change point selection
seasonality: By default, Prophet uses Fourier Series to fit weekly and yearly seasonality
holidays: Prophet requires all past and future occurrences of holidays. If a holiday doesn't repeat in the future, Prophet will not include it in the forecast.
This notebook aggregates the data on a monthly basis, so it ignores the holidays.
The data source consists of fifteen .csv files. These files contain property sales records from five boroughs in New York, between 2003 and 2015. For convenience, the nyc_property_sales.tar file holds all of these .csv files, compressing them into one file. A publicly-available blob storage hosts this .tar file.
**With the parameters used in this code cell, can easily apply this notebook to different datasets.
URL = "https://synapseaisolutionsa.blob.core.windows.net/public/NYC_Property_Sales_Dataset/"
TAR_FILE_NAME = "nyc_property_sales.tar"
DATA_FOLDER = "Files/NYC_Property_Sales_Dataset"
TAR_FILE_PATH = f"/lakehouse/default/{DATA_FOLDER}/tar/"
CSV_FILE_PATH = f"/lakehouse/default/{DATA_FOLDER}/csv/"
EXPERIMENT_NAME = "aisample-timeseries" # MLflow experiment name
import os
if not os.path.exists("/lakehouse/default"):
# Add a lakehouse if the notebook has no default lakehouse
# A new notebook will not link to any lakehouse by default
raise FileNotFoundError(
"Default lakehouse not found, please add a lakehouse for the notebook."
)
else:
# Verify whether or not the required files are already in the lakehouse, and if not, download and unzip
if not os.path.exists(f"{TAR_FILE_PATH}{TAR_FILE_NAME}"):
os.makedirs(TAR_FILE_PATH, exist_ok=True)
os.system(f"wget {URL}{TAR_FILE_NAME} -O {TAR_FILE_PATH}{TAR_FILE_NAME}")
os.makedirs(CSV_FILE_PATH, exist_ok=True)
os.system(f"tar -zxvf {TAR_FILE_PATH}{TAR_FILE_NAME} -C {CSV_FILE_PATH}")
To extend the MLflow logging capabilities, autologging automatically captures the values of input parameters and output metrics of a machine learning model during its training.
This information is then logged to the workspace, where the MLflow APIs or the corresponding experiment in the workspace can access and visualize it.
# Set up the MLflow experiment
import mlflow
mlflow.set_experiment(EXPERIMENT_NAME)
mlflow.autolog(disable=True) # Disable MLflow autologging
Note: If you want to disable Microsoft Fabric autologging in a notebook session, call mlflow.autolog() and set disable=True.
df = (
spark.read.format("csv")
.option("header", "true")
.load("Files/NYC_Property_Sales_Dataset/csv")
)
A manual review of the dataset leads to some early observations:
Instances of $0.00 sales prices. According to the Glossary of Terms, this implies a transfer of ownership with no cash consideration. In other words, no cash flowed in the transaction. We need to remove sales with $0.00 sales_price values from the dataset.
The dataset covers different building classes.
However, this notebook will only focus on residential buildings which, according to the Glossary of Terms, are marked as type "A".
We need to filter the dataset to include only residential buildings.
To do this, include either the building_class_at_time_of_sale or the building_class_at_present columns.
We must only include the building_class_at_time_of_sale data.
The dataset includes instances where total_units values equal 0, or gross_square_feet values equal 0. We will remove all the instances where total_units or gross_square_units values equal 0.
Some columns - for example, apartment_number, tax_class, build_class_at_present, etc. - have missing or NULL values. Assume that the missing data involves clerical errors, or non-existent data. The analysis does not depend on these missing values, so will ignore them.
The sale_price column is stored as a string, with a prepended "$" character. To proceed with the analysis, represent this column as a number. You should cast the sale_price column as integer.
To resolve some of the identified issues, import the required libraries.
# Import libraries
import pyspark.sql.functions as F
from pyspark.sql.types import *
Cast the sales data from string to integer
Use regular expressions to separate the numeric portion of the string from the dollar sign (for example, in the string "$300,000", split "$" and "300,000"), and then cast the numeric portion as an integer.
Next, filter the data to only include instances that meet all of these conditions:
The sales_price is greater than 0
The total_units is greater than 0
The gross_square_feet is greater than 0
The building_class_at_time_of_sale is of type A
df = df.withColumn(
"sale_price", F.regexp_replace("sale_price", "[$,]", "").cast(IntegerType())
)
df = df.select("*").where(
'sale_price > 0 and total_units > 0 and gross_square_feet > 0 and building_class_at_time_of_sale like "A%"'
)
The data resource tracks property sales on a daily basis, but this approach is too granular for this notebook. Instead, aggregate the data on a monthly basis.
First, change the date values to show only month and year data. Note that the date values would still include the year data. You could still distinguish between, for example, December 2005 and December 2006.
Additionally, only keep the columns relevant to the analysis. These include sales_price, total_units, gross_square_feet and sales_date. You must also rename sales_date to month.
monthly_sale_df = df.select(
"sale_price",
"total_units",
"gross_square_feet",
F.date_format("sale_date", "yyyy-MM").alias("month"),
)
display(monthly_sale_df)
Aggregate the sale_price, total_units and gross_square_feet values by month. Then, group the data by month, and sum all the values within each group.
summary_df = (
monthly_sale_df.groupBy("month")
.agg(
F.sum("sale_price").alias("total_sales"),
F.sum("total_units").alias("units"),
F.sum("gross_square_feet").alias("square_feet"),
)
.orderBy("month")
)
display(summary_df)
Pyspark DataFrames handle large datasets really well. However, due to data aggregation, the DataFrame size is much smaller. This suggests that you can now use pandas DataFrames.
This code casts the dataset from a pyspark DataFrame to a pandas DataFrame.
import pandas as pd
df_pandas = summary_df.toPandas()
display(df_pandas)
You can examine the property trade trend of New York City to better understand the data. This leads to insights into potential patterns and seasonality trends.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
f, (ax1, ax2) = plt.subplots(2, 1, figsize=(35, 10))
plt.sca(ax1)
plt.xticks(np.arange(0, 15 * 12, step=12))
plt.ticklabel_format(style="plain", axis="y")
sns.lineplot(x="month", y="total_sales", data=df_pandas)
plt.ylabel("Total Sales")
plt.xlabel("Time")
plt.title("Total Property Sales by Month")
plt.sca(ax2)
plt.xticks(np.arange(0, 15 * 12, step=12))
plt.ticklabel_format(style="plain", axis="y")
sns.lineplot(x="month", y="square_feet", data=df_pandas)
plt.ylabel("Total Square Feet")
plt.xlabel("Time")
plt.title("Total Property Square Feet Sold by Month")
plt.show()
The data shows a clear recurring pattern on a yearly cadence; this means the data has a yearly seasonality
The summer months seem to have higher sales volumes compared to winter months
In a comparison of years with high sales and years with low sales, the revenue difference between high sales months and low sales months in high sales years exceeds - in absolute terms - the revenue difference between high sales months and low sales months in low sale years.
For example, in 2004, the revenue difference between the highest sales month and the lowest sales month is about
$900,000,000 - $500,000,000 = $400,000,000
and for 2011, that revenue difference calculation is about
$400,000,000 - $300,000,000 = $100,000,000
This becomes important later, when you must decide between multiplicative and additive seasonality effects.
Prophet input is always a two-column DataFrame.
One input column is a time column named ds, and
one input column is a value column named y.
The time column should have a date, time, or datetime data format (e.g., YYYY_MM). The dataset here meets that condition.
The value column must be a numerical data format.
For the model fitting, we must only
rename the time column to ds and
value column to y,
and pass the data to Prophet.
df_pandas["ds"] = pd.to_datetime(df_pandas["month"])
df_pandas["y"] = df_pandas["total_sales"]
Prophet follows the scikit-learn convention.
First, create a new instance of Prophet,
set certain parameters (e.g.,seasonality_mode), and then
fit that instance to the dataset.
Although a constant additive factor is the default seasonality effect for Prophet, you should use the 'multiplicative' seasonality for the seasonality effect parameter.
The analysis in the previous section showed that because of changes in seasonality amplitude, a simple additive seasonality won't fit the data well at all.
Set the weekly_seasonality parameter to off, because the data was aggregated by month. As a result, weekly data is not available.
Use Markov Chain Monte Carlo (MCMC) methods to capture the seasonality uncertainty estimates.
By default, Prophet can provide uncertainty estimates on the trend and observation noise, but not for the seasonality.
MCMC require more processing time, but they allow the algorithm to provide uncertainty estimates on the seasonality, as well as the trend and observation noise.
Tune the automatic change point detection sensitivity through the changepoint_prior_scale parameter.
The Prophet algorithm automatically tries to find instances in the data where the trajectories abruptly change. It can become difficult to find the correct value.
To resolve this, you can try different values and then select the model with the best performance.
from prophet import Prophet
def fit_model(dataframe, seasonality_mode, weekly_seasonality, chpt_prior, mcmc_samples):
m = Prophet(
seasonality_mode=seasonality_mode,
weekly_seasonality=weekly_seasonality,
changepoint_prior_scale=chpt_prior,
mcmc_samples=mcmc_samples,
)
m.fit(dataframe)
return m
Prophet has a built-in cross-validation tool. This tool can estimate the forecasting error, and find the model with the best performance.
The cross-validation technique can validate model efficiency.
This technique trains the model on a subset of the dataset, and runs tests on a previously-unseen subset of the dataset.
This technique can check how well a statistical model generalizes to an independent dataset.
For cross-validation, reserve a particular sample of the dataset, which was not part of the training dataset. Then, test the trained model on that sample, prior to deployment. However, this approach does not work for time-series data, because if the model has seen data from the months of January 2005 and March 2005, and you try to predict for the month February 2005, the model can essentially cheat, because it could see where the data trend leads. In real applications, the aim is to forecast for the future, as the unseen regions.
To handle this, and make the test reliable, split the dataset based on the dates. Use the dataset up to a certain date (e.g., the first eleven years of data) for training, and then use the remaining unseen data for prediction.
In this scenario, start with eleven years of training data, and then make monthly predictions using a one-year horizon.
Specifically, the training data contains everything from 2003 through 2013.
Then, the first run will handle predictions for January 2014 through January 2015.
The next run handles predictions for February 2014 through February 2015, and so on.
Repeat this process for each of the three trained models, to see which model performs the best.
Then, compare these predictions with real-world values, to establish the prediction quality of the best model.