Natural language is unique in the sense, it allows us to express the same fact in multiple ways while being bound by an underlying grammar. Although it easy for humans to figure out that different articulations of the same fact are indeed referring to the same fact, it is very hard for computational algorithms to do so. There are modern techniques under the deep learning paradigm that are able to learn these variations. However, they require humongous amounts of data for training. In fact, if all examples of articulations are available, machine learning algorithms are merely committing everything to 'memory' so to speak. It is just that the memory is built and retrieved using additions and multiplications.
In 2018, my co-researcher Mrityunjay Kumar, and I published an article where the problem of understanding question articulation was formulated as an objective-driven optimization problem under conditions where examples of complementary objectives were not available. It was shown that the optimization problem can be solved as a fingerprinting technique using auto-encoders. We were exploring some ideas involving multiple question articulation styles mapping to the same answer generation system. Our plan was to focus on a particular answer generation logic and then determine what types of question articulations would support this answer generation logic. As we develop a more complex answer generation logic, we could add more classes of articulations to the training data set. It was a strategy to incrementally build an answer generation system without the use of very large datasets and expensive deep learning infrastructure. During this research work, we discovered that there may be techniques to generalize sequence representation in a form favourable to machine learning algorithms.
It is interesting to note that the principles used in the article referenced above in the context of natural language can also be extended to use-cases involving other forms of data. There are many forms of data that exhibit characteristics of grammar. However, they cannot be codified as grammar like the way we have done for natural language. There are computational techniques that can be adopted to process other types of sequential data without the need for elaborate tooling common in natural language. In the following section we shall be elaborating on this thought.
In the question articulation case the underlying rules for composition was grammar. In general, sequential data is made up of a sequence of building blocks, where each block has a specific function or meaning. As these building blocks are organized in a sequence, new meaning or functionality can be accorded to that sequence. Sometimes, building blocks have a contextual meaning when combined with another building blocks. We see similar parallels in natural language as well. However, the underlying rules for composition vary with the domain. Let us elaborate on this thought. Think of a security event log. Each log statement is the output of a logger the format for which was determined by the engineering team. This means there is an implicit grammar subconsciously invented by the engineering team, the output of which is the log statement. Each security vendor may have their own logging format and hence their own grammar. Unlike natural language there is no formal standard or structure.
In real world applications, algorithms are expected to automatically discover and represent these rules for composition, and this is a data-driven exercise that can be executed using machine learning algorithms. It is of interest to note that we cannot blindly use standard machine learning algorithms. We need to go to the guts of these algorithms, mix-and-match various techniques so as to build a proprietary approach relevant to the data and the use-cases for that particular domain.
The sequence of building blocks is actually a sequence of hierarchical basic blocks. In natural language these basic blocks can be considered as words, and the hierarchy is nothing but the different type of phrases. However, note that there are compositional rules governing how these basic blocks and building blocks can be put together. This concept of building blocks and basic blocks exist in many domains:
The nitrogen bases, specific sugar structures, and phosphates are building blocks. When organized in a specific sequence of inter-connectivity they result in the genetic code. The genetic code is akin to 'meaning' accorded to that domain
Systemcall sequences accord specific functionality to a software program by virtue of being called in a specific order with input/output linking them. The sequence accords meaning to that software or in other words reveals a 'business logic'
In distributed micro-services based applications, APIs are the basic building blocks. A sequence of such API calls governed by an underlying software glue causes a specific meaning to be revealed which we refer to as 'business logic'
In the logistics world, way-points form basic building blocks, a specific sequence of which can lead to meaningful and optimized routing of goods and services
It is tricky to find a representation for sequential information in a form that not only captures the semantics but is also friendly for machine learning. There is certainly not a single best practice or approach for such a representation. One needs to discover the representation best suited for the specific nature of sequential data. Having to deal with sequential information of discrete and categorical nature compounds this challenge. Nevertheless, there are certain approaches that can generalize across different types of discrete sequences. For that we need to spell out characteristics of real-world csequential data, and here are some of them:
Sequences are of variable length
Each event in a sequence is a data-point and all events need not have the same features (or attributes)
There can be a hierarchical overlay of relationships between events in a sequence, such as one event can occur only when a sub-sequence has preceded this event
There can be an element of inter-event time interval
It is enticing to model sequence as a tensors and use any of the modern neural network based techniques. However there are serious pitfalls. Tensors for variable length sequences with variable number of features result in huge sparse tensors. Training sparse tensors is a significant effort impacting the underlying optimization algorithms. It is very expensive in terms of computational infrastructure and the time spent in tuning and understanding the best suited model architectures.
Building product features around sequential information is not like a datascience challenge where the accuracy of prediction is the ultimate metric. We never get a nicely formatted spreadsheet to work with. Most problems cannot be formulated as MODEL.fit() followed by MODEL.predict(). One needs to account for the simplest approach to get to the product feature even if this approach does not give the most accurate solution. Occam's razor needs to be applied, and at every step we need to build a product feature that has just enough machine learning for that stage of the product. In addition, we need to integrate with different data sources, data formats, and use-case nuances. Sometimes user experience dictates what algorithms can/cannot be used. Further we should be able to pinpoint to sections of the sequential data that led to the current decision (this is not the same as feature importance). The starting point, is to list out a generalized set of problem statements involving sequential data. Then we need to look into data characteristics, issues of scale, and expected UX. Finally we investigate algorithms that satisfy these conditions. In the following article we shall see some problem statements involving sequential data.
Problem Statements Involving Sequential Data