Results

Graceful Degradation

The initial approach to visualizing Streaming of Big Data was by using the concept called Graceful Degradation, as seen below on the prototype called VisMillion. The key idea is to have data presented in several modules, each showing information with a different aggregation level. Information flows from the right to the left as it gets older, showing various insights at the glance. The leftmost visual idiom aggregates all the data since the beginning of the data stream, and each module to its right shows data for a specific range using a different aggregation technique.

In parallel with VisMillion, we also tried to apply graceful degradation to geolocated data. We conducted a user study to understand if people can perceive geographic data at three different periods. In this example, we can see the traffic density in the city of Porto, Portugal. As data gets older, the encoding differs to allow people to see which streets have been having more traffic along the day. Our results indicated that people accurately detected the three periods, and were able to see trends.

Quantitative Horizontal Transitions

After creating the first prototype, our next step was to find a way to depict how data gets aggregated between modules. Although we could already see data evolving at different aggregation levels, abrupt cuts still connected the modules. Therefore, we decided to study horizontal transitions which are animation techniques that are applied between two visual idioms depicting how data goes from one into another. Our results allowed to propose specific transitions between the scatter plot and several idioms.

Between the Scatter Plot and the Heat Map

Between the Scatter Plot and the Line Chart

Between the Scatter Plot and the Bar Chart

Between the Scatter Plot and the Stream Graph

Quantitative Vertical Transitions

After proposing horizontal transitions, we started to design vertical transitions. Like we said, horizontal transitions depict how data flows between idioms, but vertical transitions are used to allow people to perceive specific data changes. For example, a line chart is good to detect a trend, but a heatmap is good to perceive flow variations.

Between the Line Chart and the Heat Map

Between the Line Chart and the Stream Graph

Between the Heat Map and the Line Chart

Between the Stream Graph to the Line Chart

Between the Heat Map and the Stream Graph

Between the Stream Graph and the Heat Map

Streaming of Ordinal Big Data and Ordinal Horizontal Transitions

Having proposed horizontal transitions for streaming of quantitative big data, we moved to the next phase, now with ordinal data. This led to two implementation phases. First, we had to adapt VisMillion to support ordinal data by suggesting visual idioms that supported this type of data. Below you can see the final version. Then, we designed ordinal horizontal transitions between the chosen idioms.

Between the Ordinal Scatter Plot and the Histogram

Between the Ordinal Scatter Plot and the Ordinal Line Chart

Between the Ordinal Line Chart and the Histogram

Between the Ordinal Line Chart and the Heat Map

Between the Heat Map and the Histogram

Between the Heat Map and the Ordinal Line Chart

Streaming Analysis Engine

Despite the advances in data science, and the plethora of data analysis tools, unsupervised analysis techniques continue to pose some challenges, which are even harder when dealing with data streams.

Contributions were done at two different levels: on the identification of patterns and anomalies, and on the automation of the discovery process.

Patterns’ discovery and management was done using the matrix profile framework, due to its innovative nature and ability to deal with both patterns and anomalies, in an integrated manner. The first contribution was to propose a new unsupervised methodology to manage the discovered elements over time, allowing for its continuous update.

This line of work was then developed to identify classification models over time series, in particular able to detect anomaly events in them. This new approach used the most modern ML tools, namely CNNs, transfer and active learning.

In the automation line of work, a framework for processing time series and training forecasting models was developed, providing an integrated environment to identify common behaviors, and to predict future values, with minimal human intervention. The framework provides the means to perform the exploratory analysis of time series and the application of the most common preparation techniques, along with the possibility of training LSTMs for forecasting.

The last contribution produced was the proposal of a new algorithm for the automatic generation of features based on domain knowledge (DANKFE algorithm). This algorithm uses ER diagrams as the formalism to express existing domain knowledge. From each relation in those diagrams generates a new variable, in a more efficient and effective way than the existing automation mechanisms in use, for example on the auto-sklearn package.


VisMillion Final Framework

VisMillion is divided into several modules. The Data Collector/Processor is the component that is responsible for collecting the data. During this project, this module was fed data from Pentaho Data Integration Tool (PDI) from Hitachi Vantara. The Data Streamer is a python server created to serve as a middleman between the gathering and processing of the data and the visualization system. Although it was designed to work with PDI, it can be easly customized to support any other data source. This AI module suggests the idiom that best represents the arriving data by prompting a change to the system. The final component is the Dashboard. This component simply serves as the point of entry to the VisMillion System, for the data that was processed in the Data Collector/Processor and sent through the Data Streamer, as well as the recommendations sent by the AI Module.