[4] dos Reis, D., Maletzke, A., Batista, G.: Unsupervised context switch for classification tasks on data streams with recurrent concepts. In: ACM/SIGAPP, vol. 33. ACM, Pau, France (2018)
Algorithms
Maletzke et al. (2017)1 proposed the algorithms described below. We used in our experimental comparison.
- Baseline: this algorithm ignores the non-stationarity present in data stream environments. It builds a quantifier with the first examples from the stream and does not update it over time. This approach generates a quantification every w events. Therefore, it requires only a small portion of labeled data from the beginning of the stream. It is also very efficient since it builds a quantifier only once - code in R.
- Topline: in this approach, we regularly update the quantifier every w events, attempting to track the most recent changes in the stream. After the quantification of the $w$ events is estimated, their actual labels become available, allowing the quantifier to be updated. This setting represents a null-verification latency scenario - code in R.
- SQSI: as an initial step, it learns a classifier δ from a labeled training set. After this initialization, the method issues a quantification whenever a pool achieves w events. To this end, the SQSI generates classification scores for each event in the pool using the classifier δ. Then, we verify whether these scores and the ones estimated in the training set (calculated with cross-validation) come from the same distribution (using a statistical test), i.e., we sense the presence or absence of a drift only between the training set scores and the scores of the pool of recent instances. If the null hypothesis is not rejected (i.e., both samples come from the same distribution), we apply the quantification method, and the result is issued. However, if the null hypothesis is rejected, we perform a linear transformation in the recent pool. After applying the linear transformation, SQSI generates new scores for the transformed recent data pool. Afterward, it applies the statistical test to the new scores, and if the null hypothesis is rejected once again, the algorithm requests the labels of the events in the pool and updates the classifier δ. Otherwise, the quantification is performed and the result is issued. The statistical test used in our algorithm is the Kolmogorov-Smirnov test with a significance level of 0.001. Such low significance level is necessary to minimize the number of false-positives due to consecutive reapplications of the test - code in R.
1MALETZKE, ANDRE G.; DOS REIS, DENIS M. ; BATISTA, GUSTAVO E.A.P.A. . Quantification in Data Streams: Initial Results. In: 2017 Brazilian Conference on Intelligent Systems (BRACIS), 2017, Uberlândia. 2017 Brazilian Conference on Intelligent Systems (BRACIS), 2017. p. 43. link
Proposal
Algorithm - SQSI with Instance Selection (SQSI-IS) - code in R