We have a sequence of events, where each event is a nested JSON object of predominantly categorical and ordinal key-value pairs
Each event has a name and there are 100 possible event names
Each event has a type and there can be 5 possible types
We have sequences of varying lengths some of them just 1 event long some of them 1000 events long
Events within a sequence can repeat
Sequences are available as data streams in a kafka topic and the rate can vary depending on the time of the day or day of the week
Problem-1
There are sequences being emitted to a kafka topic. The kafka consumer must process each sequence and determine if this sequence is similar to a sequence seen in the past one hour. The algorithm must also explain why it thinks that this sequence is similar to a sequence observed earlier.
Questions
Looks like a case of similarity match but not an exact match. But how do we mathematically define similarity for sequences with characteristics described above?
Once similarity is defined how do we computationally represent the same?
We need to find if the sequence occurred in the last one hour. But how do we store the last one hour data and in what form? How does the storage impact the similarity matching process?
How do we build an explanation module for the sequence similarity problem that can scale to the data rate?
Problem-2
The kafka consumer stores sequences to a persistent store. On the hour, a cron job must run in order to group these sequences based on some similarity characteristic.
Questions
Looks like a clustering problem. What clustering algorithm can we use given the characteristics mentioned above?
Every hour new sequences seem to be added. This means the clustering job must progressively handle more and more data. How can we scale the clustering job? Will the clustering algorithm and the similarity function impact the solution?
Do we need a different similarity function for clustering or can we use the same function used in problem-1?
Problem-3
We have a visualization screen where we need to plot each sequence in a 3D coordinate space. That means each sequence will be a point in a cube. Anytime a user reloads the screen the last one hour of sequences should be rendered in this 3D space with minimum refresh delay.
Questions
How can we convert sequences into points in 3D space especially when each sequence can be of a different length and can contain different events?
Will this tie back to problem-2 and problem-1 ?
Problem-4
Another kafka topic emits event streams (not sequences). We need to subscribe to this topic, and at time t we need to predict what the event will be at t+1, t+2
Questions
It feels like a time-series prediction problem involving discrete events. Looks like the input tensors are not only going to be huge, but also sparse. Is there an alternative to standard models for time-series?
Generalizing Machine Representation of Sequential Information