Combining instance selection and self-training to improve data stream quantification

This repository contains supplementary material for the paper.

Datasets

Bike [1] contains hourly records of a bicycle-sharing system with the corresponding weather and seasonal information between years 2011 and 2012. The goal is to predict whether there is high or low demand. Thus, we expect concept drift due to seasonality. It contains 17,379 instances; Download
Mosquitoes has laboratory data with information of mosquitoes passing through a photosensitive sensor. The sensor data comprise seven features for each event: the Wing-Beat Frequency (WBF) and the frequencies of the first six harmonics. The temperature varies during the stream, influencing the insect metabolism, and, consequently, changing their wing-beat frequency. We consider the temperature as a latent variable for the quantification and classification tasks. The objective is to distinguish between events of female Aedes aegypti and Aedes albopictus mosquitoes from Culex quinquefasciatus and Anopheles aquasalis. This dataset contains 13,410 instances; Download
Insects contains events generated by the same insect sensor. Differently, from the Mosquitoes dataset, the features are the wing-beat frequency and the 92 first coefficients of the frequency spectrum obtained with a 1024-point Fast Fourier Transform. The object is to differentiate the Aedes aegypti mosquitoes from the insects Musca domestica, Culex quinquefasciatus, Culex tarsalis and Drosophila melonagaster. It contains 83,339 instances; Download
NOAA [2] composed of meteorological conditions registered by the U.S. National Oceanic and Atmospheric Administration (Bellueve-Nebraska) for 50 years. This dataset contains eight features and 18,159 daily registers; Download
Arabic-Digit [3] A modified version of Arabic-Digit, described by a fixed number of MFCC values for the human speech of Arabic digits (among 10). The spoken digit defines the context, and the task is to predict the sex of the speaker. The dataset contains 26 features and 14,380 registers; Download
QG [4] is a version of the dataset Handwritten [10], constrained to the handwritten letters g, and q. The context is defined by the author (among 10), and the objective is to predict the letter. This dataset is composed of 63 features and 13,279 registers. Download

References

[1] Fanaee-T, H., Gama, J.: Event labeling combining ensemble detectors and background knowledge. Prog Artif Intell 2(2), 113–127 (2014). doi:10.1007/s13748-013-0040-3

[2] Dyer, K.B., Capo, R., Polikar, R.: Compose: A semisupervised learning framework for initially labeled nonstationary streaming data. IEEE Trans Neural Netw Learn Syst 25(1), 12–26 (2014).

[3] Hammami, N., Bedda, M.: Improved tree model for arabic speech recognition. In: ICCSIT, vol. 5, pp. 521–526 (2010)

[4] dos Reis, D., Maletzke, A., Batista, G.: Unsupervised context switch for classiﬁcation tasks on data streams with recurrent concepts. In: ACM/SIGAPP, vol. 33. ACM, Pau, France (2018)

Algorithms

Maletzke et al. (2017)¹ proposed the algorithms described below. We used in our experimental comparison.

Baseline: this algorithm ignores the non-stationarity present in data stream environments. It builds a quantifier with the first examples from the stream and does not update it over time. This approach generates a quantification every w events. Therefore, it requires only a small portion of labeled data from the beginning of the stream. It is also very efficient since it builds a quantifier only once - code in R.
Topline: in this approach, we regularly update the quantifier every w events, attempting to track the most recent changes in the stream. After the quantification of the $w$ events is estimated, their actual labels become available, allowing the quantifier to be updated. This setting represents a null-verification latency scenario - code in R.
SQSI: as an initial step, it learns a classifier δ from a labeled training set. After this initialization, the method issues a quantification whenever a pool achieves w events. To this end, the SQSI generates classification scores for each event in the pool using the classifier δ. Then, we verify whether these scores and the ones estimated in the training set (calculated with cross-validation) come from the same distribution (using a statistical test), i.e., we sense the presence or absence of a drift only between the training set scores and the scores of the pool of recent instances. If the null hypothesis is not rejected (i.e., both samples come from the same distribution), we apply the quantification method, and the result is issued. However, if the null hypothesis is rejected, we perform a linear transformation in the recent pool. After applying the linear transformation, SQSI generates new scores for the transformed recent data pool. Afterward, it applies the statistical test to the new scores, and if the null hypothesis is rejected once again, the algorithm requests the labels of the events in the pool and updates the classifier δ. Otherwise, the quantification is performed and the result is issued. The statistical test used in our algorithm is the Kolmogorov-Smirnov test with a significance level of 0.001. Such low significance level is necessary to minimize the number of false-positives due to consecutive reapplications of the test - code in R.

¹MALETZKE, ANDRE G.; DOS REIS, DENIS M. ; BATISTA, GUSTAVO E.A.P.A. . Quantification in Data Streams: Initial Results. In: 2017 Brazilian Conference on Intelligent Systems (BRACIS), 2017, Uberlândia. 2017 Brazilian Conference on Intelligent Systems (BRACIS), 2017. p. 43. link

Proposal

Algorithm - SQSI with Instance Selection (SQSI-IS) - code in R