Home page‎ > ‎

time series outlier detection

TIME SERIES OUTLIER DETECTION (WASHER)

This remarks are based on the work:

Autori: Venturini, Andrea  Titolo: Time series outlier detection: a new non parametric methodology (washer)
Periodico: Statistica  Anno: 2011 Volume: 71 Fascicolo: 3 pagg. 329-344 ( time series outlier detection )

Fig.1

 

 


In the figures above you can see 20 time series, with t=,1,2,3,4, regarding measurements about temperature and rain respectively. Every series represents a place in which the measurement was made. On the left the trajectory is almost linear, while on the right there is a peak at time 2 and a downturn at time 3. (…,by the way, data are completely invented!).

Now we can put an outlier for one of the series of temperature and another one for those ones of Rain measurements (fig, 2).

The graphical analysis of time series in figure 2 permits to find out the two outliers easily.

We can apply the “washer” methodology by implementing in R the code of file “esempio.R” (esempio.R). (all version after R 2.8.1, but perhaps also before versions work)


Fig. 2


   

Data are recorded in data.frame “dati” in the “long” structure of relational databases. So phenomena is the first column, time is the second one (ordered sequence of numbers), zone the third one and the values are in the last column.

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

> dati

   phen      time  zone    value
1 Temperature 1     a01      2.0
2 Temperature 1     a02     20.0
3 Temperature 1     a03     25.0
4 Temperature 1     a04      7.0
5 Temperature 1     a05     16.0
6 Temperature 1     a06     20.0
7 Temperature 1     a07     17.0


----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


So in general we can analyze several number of phenomena (p=1,...,P; with P=1,2,...) with time series wider than two periods (t=1,...,T ; with T3) and, finally, with several number of time series (i=1, ..., n; with n ≥20-25).

The data set {ypit} must have positive values (if negative you must translate it all!) and it is analyzed by means of a measure of linearity of three values at time (yp,i,t-1 , yp,i,t , yp,i,t+1) with a rolling pace that starts from t= 2 and ends with t=T-1. Missing values are treated dropping (yp,i,t-1 , yp,i,t , yp,i,t+1) if at least one of the three is a missing value.

For a fixed p and a fixed t=2, we have n measures of linearity for i=1, …, n and Si =yi1+yi2+yi3 :


This AV index  measures the three points linearity or not linearity .

The R function washer.AV() returns the output (first 5 rows and rows from 21 to 25) in table 1:

TABLE 1

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

           fen t.2 series  y.1  y.2  y.3     test.AV          AV  n  median.AV   mad.AV madindex.AV
1         Rain   2    a01  1.4  3.4  2.4  1.00079003  13.0718954 20  7.5798114 5.487749    36.58499
2         Rain   2    a02  7.1  8.7  7.6  0.12450637   6.8965517 20  7.5798114 5.487749    36.58499
3         Rain   2    a03  2.8  4.8  3.5  0.85840147  12.2905028 20  7.5798114 5.487749    36.58499
4         Rain   2    a04 10.6 13.5 10.8  0.63349425  11.0562685 20  7.5798114 5.487749    36.58499
5         Rain   2    a05  0.5  2.6  1.1  1.90703006  18.0451128 20  7.5798114 5.487749    36.58499


21        Rain   3    a01  3.4  2.4  3.3  1.06974472  -7.2796935 20 -4.9781600 2.151479    14.34319
22        Rain   3    a02  8.7  7.6  8.0  0.62570850  -3.6319613 20 -4.9781600 2.151479    14.34319
23        Rain   3    a03  4.8  3.5  3.9  0.39217567  -5.8219178 20 -4.9781600 2.151479    14.34319
24        Rain   3    a04 13.5 10.8 11.1  0.34721731  -5.7251908 20 -4.9781600 2.151479    14.34319
25        Rain   3    a05  2.6  1.1  1.9  2.41639858 -10.1769912 20 -4.9781600 2.151479    14.34319

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

If you look at figure 2, you can notice that at time 2 ( t.2 = 2 ) the three points have a peak, while at time 3 ( t.2 = 3 ) there is a drop. This behavior produces a value of AV positive in the first case and negative in the second one (table 1).

The method works because of the tendency of time series to behave in the same way in term of linearity/non linearity. In fact this tendency, generally, is independent from positive or negative slope of the general trend.

The step after regards the distribution of AV values in order to find outlier with only one dimensional data. The non parametric test is that of Sprent:


where AV^it are the observed values of index AVit , MADt=median(i=1,…,n)|AV^it – median(AV^it)|   and    MAX = 5.

So if test.AVit > 5 then probably there is an outlier in (yp,i,t-1 , yp,i,t , yp,i,t+1).

In term of p-value, test.AVit > 5 means that the null hypothesis: “No AV^it is an outlier” is verified with a p-value < 4 per cent, while test.AVit > 10 means that p-value < 1 per cent.

So a value of test greater of 10 is a good indication for outlier presence. In the example there are 4 rows in which test.AV > 5:

TABLE 2

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

> ## let's take a look at anomalous time series
> out[out[,7]>5,]
           fen t.2 series  y.1  y.2  y.3   test.AV        AV  n  median.AV   mad.AV madindex.AV
18        Rain   2    a18  5.5  6.3 17.0  5.430649 -22.22222 20  7.5798114 5.487749    36.58499
38        Rain   3    a18  6.3 17.0  5.9 24.245788  47.18615 20 -4.9781600 2.151479    14.34319
59 Temperature   2    a19 22.0 21.0  9.0  5.247867  10.73171 20  0.0000000 2.044966    13.63310
79 Temperature   3    a19 21.0  9.0 18.0 14.920553 -21.21212 20 -0.9174312 1.360183     9.06789
>

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Row 18 identifies the points (5.5; 6.3; 17.0) where outlier is at the end of the three points. In this case the test AV is less sensitive for detection, while in row 38, according to three points (6.3; 17.0; 5.9), the sensitiveness of test is full (test.AV=24.2).

The value 17.0 is too big because the other series don't behave “like that”. If all the other series regarding rain measurements had a peak at time 3 the anomaly of time series “a18” would disappear at all.

Last but not least there is in the output a measure of the tendency of time series to behave “like that” among one another: it's madindex!


The role of thumb is the following:



If madindex is lower than 50 there is a good behavior of series among one another in order to detect outliers. If madindex is greater than 50 then the values of AV are not very informative for outlier detection.






Č
ċ
esempio.R
(7k)
Andrea Venturini,
05/mar/2012 01:34
Comments