Handling outliers

An outlier is a data point whose value is substantially different from the other data points in a sample. The presence of outliers can lead to large errors in estimates of regime statistics and substantially affect the timing of regime shifts. When dealing with outliers, it is desirable to leave the data intact if it falls within a “normal” range of variation, and assign it a small weight if it is outside that range (Huber, 1981). In STARS, each data point (x_i) is assigned a Huber-type weight (w_i), which is defined as

where h is a tuning constant and x_i* is a deviation from the regime mean ( ̅x ) normalized by scale s:

Equation (1) shows that a data point x_i is assigned weight w_i = 1, if its normalized value x*_i lies within the range [–h, h]. Outside that range, the weights decrease inversely proportional to the distance from that range. A typical value for h is 2. In a standard normal distribution, the probability p (|X| > 2) ~ 0.05, that is, only about 5% of the data points are considered outliers purely by chance.

In the case of shifts in the mean, scale s is the average standard deviation for all subsamples of the size of cut-off length, and the regime mean is

where n is the number of observations in the regime.

Here we have a sort of catch-22 situation. In order to calculate the weighted regime mean, we need to know the weights, but the values of the weights depend on the distance from that weighted regime mean. To resolve this issue, the weights are first calculated using the median as a robust estimator of regime location. Then those weights are used to calculate the weighted mean, which in turn is used to recalculate the weights. This cycle is repeated twice to increase accuracy of the estimates.

Outliers can play even more havoc with shift detection in variance. Since the deviations from regime means are squared, a single outlier could contribute largely to estimates of the variance and distort the timing of regime shifts. To reduce the influence of outliers, the regime variance is calculated as

where x_i’ is a residual after removing a stepwise trend (regime means) from the data, and

As with regime means, regime variances are calculated using a recursive procedure. First, the weights are calculated using the median absolute deviation (MAD) as a robust estimate of regime scale

This allows calculating the first approximation of regime variance using Eq. 4. Then the cycle is repeated twice using increasingly more accurate estimates of weights and regime scale.

Since the detection of regime shifts in correlation is performed for the time series of sums x + y and differences x - y of the normalized anomalies in the dependent and independent variables, the weights for those series are combined as w_i = (w_xi + w_yi) / 2. The formula for the weighted correlation coefficient is

Since x_i* and y_i* are normalized anomalies, there is no need for a recursive procedure to estimate r. The weights are calculated directly using Eq. 1.

It is important to underscore, that there is no rigid mathematical definition of what constitutes an outlier. Ultimately, it is up to a researcher to decide whether an observation is an outlier and what to do about it. The software can only provide a flexible way to treat those outliers. Here are a few examples illustrating the role of outliers in regime shift detection:

Page updated

Google Sites

Report abuse