"Machine learning allows computers to learn and discern patterns without actually being programmed. When Statistical techniques and machine learning are combined together they are a powerful tool for analysing various kinds of data in many computer science/engineering areas including, image processing, speech processing, natural language processing, robot control, as well as in fundamental sciences such as biology, medicine, astronomy, physics, and materials"
2016, Masashi Sugiyama, Introduction to Statistical Machine Learning
In this project I used advances statistics methods, Machine learning and Artificial Intelligence trying to answer the following questions:
1) Are there regularities in finance historical data? if so
2) Is it possible to express historical finance data as a series of regularities or small/simple patterns? If so what could be the right way to represent them (in length and shape), mathematical models? computational ones?
3) if there is repetition of patterns along the time, is there a chance to make predictions?
4) if is possible to make predictions, how far in the future can they be performed?
This project has as background philosophical questions that find solution in research on complexity and neural sciences, such as:
1) how regularities take place in phenomena that seems random? Do these regularities can be represented as a pattern of regularities and then to have a single pattern of them?
2) What kind of computational and mathematical models realize such patterns
3) If is possible to model such behavior, how much we can trust in such predictions?
4) Is there a better way to approximate such complex behaviors?
Code is here
I will start with statistical description of data such as correlation, then I will study the possibility of finding patterns using machine learning such as clustering. If I find them I will try to find a "agile" way to represent them such as mathematical models or parametric functions (Neural networks). If all runs successfully, I will try to make predictions over it
Figure 1 shows the global behavior of ask and bid data prices. In grey is the difference between them.
At fist glance, in very short times, if we chunk the whole plot in little pieces, we can be noted repetition of patterns with high degree of similarity. In terms of longer time, there is a moderate variation of an evident trend: up-down, up-down, whereas variations of difference between ask-bid remains mostly constant.
Fig. 1 Global data behaviour. Ask prices in brown, bid prices in blue. Difference bid-ask in grey.
Firstly, "naively" I will make partitions of data of different length, 5, 6, ..., 10, 15, 30 days and compare correlations between these "n" components. Also, I will use K-means to try make a non-supervised clustering, trying to distinguish any kind of grouping of data by average at the end of a period.
Figure 2 shows comparison of each component of a pattern of arbitrary length, in this specific case, length = 10. It's worth saying that independently length of pattern (from 3 to 30) very similar plots are obtained.
Figure 2(a) shows strong correlation between components of a pattern. Something obvious since here I'm using average prices between ask and bid.a pattern is the average between ask and bid prices. But most interesting is the clustering that shows what could be considered one single spacial cluster despite variation of colors (average of period).
This just corroborates visually one first supposition:
Since in general a patter is clustered, this means that there is not strong variability, or in other words, in general, behavior of data, in short in monotonous, and
In long terms of time, since there is a strong correlation between components this means a repetition of the same behavior into the same patterns.
A partial conclusion, or a first thesis based on the above 2 statements is that there is not many different patterns in data, and that definition of patterns should be more or less short and not too long. In this sense, I will consider arbitrarily patterns of 30 components A very large, and 10 components seems to be good size to define, at first glance.
Fig. 2 Histogram of pattern of size 10 and arbitrary clustering
Now, let's let Machine Learning to suggest and find clusters by itself. Let's use elbow and dendograms to find how many clusters can be found.
Fig. 3 Elbow method and dendogram for clustering number findings
Figure 3 suggest 4 or 5 clusters. Lets plot both in figure 4
Fig. 4. Clustering in 4 and 5 clusters.
Figure 4 shows clustering in 4 and 5 clusters. Visually we can see that the 5 clusters, in yellow seems not to be representative, so I might work under the idea that my data set can be divided in 4 groups. Lets try with artificial intelligence techniques to corroborate such supposition
First, for comparison I will create a heat map of the distribution of values. Heat map, as a map where land elevations are represented uses light colours to define high elevations (high value of similarity) and darker colours define valleys, or low elevations or lower values of similarity. But in general, same colour indicate a group, a set, or a cluster.
Figure 4 shows roughly 4 different colours with different degrees: red, orange, yellow and blue. Does this means that there are only 4 different patterns in dataset? Let's explore a bit more, but onece again we have another suggestion that talks about 4 possible clusters
Fig. 5. Heat map of data set
Figure 6 corroborate the presence of 4 patterns using SOM (Self-organized Maps). SOM was trained using a highest value of possible clusters (arbitrary 19, only to exaggerate the search).
SOM finds classes by similarity and groups them by color
Fig. 6 SOM classification
SOM rougly shows 4 big and consistent groups, and maybe two lightly spread over all classification. I will take for granted that there are 4 groups
Now I really know that there are a few number of possible patters. Now I will train my computer to automatically figure out if is possible to find them. in other words, I want to know how these patters looks like to have an idea of how to express or define them (mathematically? a neural network?).
For this I will take as criteria variations over the time in periods of 30 data points. The idea here is let the computer automatically calculate this patterns and to find them exhaustively
Fig. 7 finding patters by variations
Figure 7 really corroborate that there are patters of behavior. Lines in the right side show how similar are the patters that are grouped in each of the plots. The cyan color plot is the patter base, and all others in colors all similar patters around the base.
The right line show the most of the patterns are between 0.2 and -0.2 of similarity which means they are acceptably similar.
As a second partial conclusion i might say, yes, there are repetition of patters, and there can be a manner (average for example) to represent it.
I'm going to go a little more further with statistical analysis before make predictions
When samples are grouped by day or by hour periodicity and seasonality is clear. This can be observed in figure 8.
Fig. 8. Repetitive patters in seasonal analysis
In figure 9 we can see that correlation between a patter and a pattern that is considered base (see cyan plots in fig 7) fluctuate from positive to negative, this only means that being 'sitting' in a point and looking directly towards future data, and we ‘walk’ forward in data looking for correlations, is positive or negative when they shown same trend, this is, they are very correlated or have the same trend. In other hand, if correlation = 0 means no co-relation between data, or no trends. Here, even repetitive pattern can be seen in correlation.
Fig. 12. Autocorrelation analysis
Until here I'm more sure about the existence of patters, by machine learning techniques and statistical analysis. Now I will try to model this regularities. First finding mathematical expressions of each pattern and second, using neural networks to finally to make predictions.
So, the most naive way to make a formula to mathematically reproduce a pattern is by regression. Then simply I will take each of the possible patters found by my computer shown partially in image 7 and to make a regression. Here are the results
Fig. 13. Mathematical representation of all possible patterns
In figure 13 left, all possible found patter are shown in cyan attached to a set of its most similar patterns contained in my data set. All trends of each pattern are shown in the right. In the middle each single pattern is expressed as an equitation or regression. Each equation is shown in figure 14
Fig. 14. Equations of patterns
Figure 15 shows predictions made using facebook's prophet workbench. Figure 16 shows predictions made using Long-Short Term Memory Recurrent Neural Networks. Both with very good results
Fig. 15, Preditions with facebook's prophet. Black dots are true data, and blue lines predicted trends
Fig. 16. Predictions by LSTM. In orage true data, in blue predicted
In following images comparison between past mathematical models are used to predict future patterns of behaviour. Past models corresponds to those shown in figure 8 and 9
In general, mathematical models resulted in a very low granulation approach that allows to see the detail of the behaviors with high degree of accuracy. But one of the disadvantage is that we NEED to find for the best (past) model that describe in the best possible way the future. What if there is not any?. Even more This is an task that requires of brute force (comparison one by one), which cannot be the best options when the set grows.
A little more feasible approach to describe future by past, resulted to be I.A implementations, they where faster and relatively easier to implement. The disadvantage is that they are like black boxes, where we cannot see (at all) criteria of predictions and likely.
Both representations, mathematical and parametric have high degree of accuracy over validation phase (79% and 84% respectively) in short term predictions (3 future steps)