# TIME SERIES OUTLIER DETECTION (WASHER)

This remarks are based on the work:

Autori: Venturini, Andrea  Titolo: Time series outlier detection: a new non parametric methodology (washer)
Periodico: Statistica  Anno: 2011 Volume: 71 Fascicolo: 3 pagg. 329-344 ( time series outlier detection )

Graph 1

In the graphs above you can see 20 time series, with t=,1,2,3,4, regarding measurements about temperature and rain respectively. Every series represents a place in which the measurement was made. On the left the trajectory is almost linear, while on the right there is a peak at time 2 and a downturn at time 3. (…,by the way, data are completely invented!).

Now we can put an outlier for one of the series of temperature and another one for those ones of Rain measurements (fig, 2).

The graphical analysis of time series in figure 2 permits to find out the two outliers easily.

We can apply the “washer” methodology by implementing in R the code of file “esempio.R” (esempio.R). (all version after R 2.8.1, but perhaps also before versions work)

A new and faster code for "washer" R function is this:  (esempio2) In the last example you can find how to use "washer" for a single time series.

Graph 2

Data are recorded in data.frame “dati” in the “long” structure of relational databases. So phenomena is the first column, time is the second one (ordered sequence of numbers), zone the third one and the values are in the last column.

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

> dati

phen      time  zone    value
1 Temperature 1     a01      2.0
2 Temperature 1     a02     20.0
3 Temperature 1     a03     25.0
4 Temperature 1     a04      7.0
5 Temperature 1     a05     16.0
6 Temperature 1     a06     20.0
7 Temperature 1     a07     17.0

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

So in general we can analyze several number of phenomena (p=1,...,P; with P=1,2,...) with time series wider than two periods (t=1,...,T ; with T3) and, finally, with several number of time series (i=1, ..., n; with n ≥20-25).

The data set {ypit} must have positive values (if negative you must translate it all!) and it is analyzed by means of a measure of linearity of three values at time (yp,i,t-1 , yp,i,t , yp,i,t+1) with a rolling pace that starts from t= 2 and ends with t=T-1. Missing values are treated dropping (yp,i,t-1 , yp,i,t , yp,i,t+1) if at least one of the three is a missing value.

For a fixed p and a fixed t=2, we have n measures of linearity for i=1, …, n and Si =yi1+yi2+yi3 :

This AV index  measures the three points linearity or a sort of distance from not-linearity .

The R function washer.AV() returns the output (first 5 rows and rows from 21 to 25) in table 1:

TABLE 1

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

fen t.2 series  y.1  y.2  y.3     test.AV          AV  n  median.AV   mad.AV madindex.AV
1         Rain   2    a01  1.4  3.4  2.4  1.00079003  13.0718954 20  7.5798114 5.487749    36.58499
2         Rain   2    a02  7.1  8.7  7.6  0.12450637   6.8965517 20  7.5798114 5.487749    36.58499
3         Rain   2    a03  2.8  4.8  3.5  0.85840147  12.2905028 20  7.5798114 5.487749    36.58499
4         Rain   2    a04 10.6 13.5 10.8  0.63349425  11.0562685 20  7.5798114 5.487749    36.58499
5         Rain   2    a05  0.5  2.6  1.1  1.90703006  18.0451128 20  7.5798114 5.487749    36.58499

21        Rain   3    a01  3.4  2.4  3.3  1.06974472  -7.2796935 20 -4.9781600 2.151479    14.34319
22        Rain   3    a02  8.7  7.6  8.0  0.62570850  -3.6319613 20 -4.9781600 2.151479    14.34319
23        Rain   3    a03  4.8  3.5  3.9  0.39217567  -5.8219178 20 -4.9781600 2.151479    14.34319
24        Rain   3    a04 13.5 10.8 11.1  0.34721731  -5.7251908 20 -4.9781600 2.151479    14.34319
25        Rain   3    a05  2.6  1.1  1.9  2.41639858 -10.1769912 20 -4.9781600 2.151479    14.34319

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

If you look at graph 2, you can notice that at time 2 ( t.2 = 2 ) the three points have a peak, while at time 3 ( t.2 = 3 ) there is a drop. This behavior produces a value of AV positive in the first case and negative in the second one (table 1).

The method works because of the tendency of time series to behave in the same way in term of linearity/non linearity. In fact this tendency, generally, is independent from positive or negative slope of the general trend.

The step after regards the distribution of AV values in order to find outlier with only one dimensional data. The non parametric test is that of Sprent:

where AV^it are the observed values of index AVit , MADt=median(i=1,…,n)|AV^it – median(AV^it)|   and    MAX = 5.

So if test.AVit > 5 then probably there is an outlier in (yp,i,t-1 , yp,i,t , yp,i,t+1).

In term of p-value, test.AVit > 5 means that the null hypothesis: “No AV^it is an outlier” is verified with a p-value < 4 per cent, while test.AVit > 10 means that p-value < 1 per cent.

So a value of test greater of 10 is a good indication for outlier presence. In the example there are 4 rows in which test.AV > 5:

TABLE 2

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

> ## let's take a look at anomalous time series
> out[out[,7]>5,]
fen t.2 series  y.1  y.2  y.3   test.AV        AV  n  median.AV   mad.AV madindex.AV
18        Rain   2    a18  5.5  6.3 17.0  5.430649 -22.22222 20  7.5798114 5.487749    36.58499
38        Rain   3    a18  6.3 17.0  5.9 24.245788  47.18615 20 -4.9781600 2.151479    14.34319
59 Temperature   2    a19 22.0 21.0  9.0  5.247867  10.73171 20  0.0000000 2.044966    13.63310
79 Temperature   3    a19 21.0  9.0 18.0 14.920553 -21.21212 20 -0.9174312 1.360183     9.06789
>

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Row 18 identifies the points (5.5; 6.3; 17.0) where outlier is at the end of the three points. In this case the test AV is less sensitive for detection, while in row 38, according to three points (6.3; 17.0; 5.9), the sensitiveness of test is full (test.AV=24.2).

The value 17.0 is too big because the other series don't behave “like that”. If all the other series regarding rain measurements had a peak at time 3 the anomaly of time series “a18” would disappear at all.

Last but not least there is in the output a measure of the tendency of time series to behave “like that” among one another: it's madindex!

The role of thumb is the following:

If madindex is lower than 50 there is a good behavior of series among one another in order to detect outliers. If madindex is greater than 50 then the values of AV are not very informative for outlier detection.

Citations:

### - Network intrusion detection and visualization using aggregations in a cyber security data warehouse BD Czejdo, EM Ferragut, JR Goodall… - International Journal of …, 2012 - search.proquest.com

- An individual-based model for the migration of pike (Esox lucius) in the river Yser, Belgium JM Baetens, S Van Nieuland, IS Pauwels… - Ecological …, 2013 - Elsevier

- A dynamical systems approach to the discrimination of the modes of operation of cryptographic systems
J Machicao, JM Baetens, AG Marco, B De Baets… - … in Nonlinear Science …, 2015 - Elsevier

- "Applied Data Mining"  Guandong Xu,Yu Zong,Zhenglu Yang  - CRC Press, 17/6/2013 - 284 pagg.   pag. 52

- Time series outlier detection (a simple R function) - July 8, 2015 By Tal Galili  - Time series outlier detection (a simple R function) | R-bloggers

- Algorithms for Time Series Anomaly Detection - How to do things - Algorithms for Time Series Anomaly Detection - How to do things

ċ
esempio.R
(7k)
Andrea Venturini,
5 mar 2012, 01:34
ċ
esempio2.r
(7k)
Andrea Venturini,
17 ott 2016, 03:37