Introduction
Machine learning in finance and insurance
Machine learning in finance and insurance
The recent history of finance and banking can be split into two eras, before and after the 2008 financial crisis. While in the pre-financial crisis era the financial system mostly witnessed innovations in creating new financial tools and financialization (e.g., exotic options, CDS, CDO, etc.), in the post-2008 crisis era the financial institutions were busy dealing with the new regulations introduced after 2008. After the 2008 financial crisis, also known as the “Credit Crisis”, supervision heavily increased and the regulators required tougher monitoring measures. That gave rise to a different type of valuation adjustments (XVA e.g., CVA) on one hand and systemic risk assessment on the other hand. In addition, financial institutions have recognized the importance of behavioral finance, like herding and moral hazard. As such, the introduction of the financial innovations slowed down post-2008 for almost 10 years.
At the same time, technologies have been growing at an exponential rate. This made banks and insurances think about investing more in new technologies. For instance, they developed online services, introduced new applications, and invested in AI and data. This is an area that machine learning (ML) made a great impact on, as a new toolbox to more efficiently analyze the (big) data. This resulted in making ML a popular subject in finance, giving further room for further ML application in solving problems and proposing new concepts beyond the big data applications.
In recent years, there was always a question: Can ML significantly impact finance and insurance? There is no doubt that ML has already made a substantial impact on many areas including natural language processing (NLP), image processing (face recognition), self-driving cars, translation, and many others. But, in finance and insurance, we need to be more careful when we talk about ML's impact since these industries are heavily regulated and any change needs to be carefully assessed. The impact of ML on finance can be categorized in the following two:
Using ML to improve the current banking or insurance business
Introducing new useful problems that would not be possible to answer without ML
Examples of the first category include introducing better hedging strategies, portfolio management trading or investment algorithms, improved CVA methods, etc.
Examples of the second category include the application of NLP in organizing financial news, sentiment analysis to make sense of the market news, using chatbots to better manage the insurance company's customers, analyzing big data of the customers' characteristics, and managing demand elasticity.
ML is where computer science and statistics meet. Even though some identify ML as part of statistics (or statistical learning), however, ML would not be as popular as it is today without the progress in computer science. ML, has many subgroups, but the most known are:
Supervised learning: In supervised learning, there are inputs and outputs, where any input is associated with an output. The main task here is to find a mapping that can correctly map a new input to a "correct" output, based on the existing data.
Unsupervised learning: In unsupervised learning, there are no outputs and there are only inputs. Therefore, the aim of unsupervised learning is to find patterns inside the data that can well explain it.
Reinforcement learning (RL): Reinforcement learning tries to find correct actions while facing a new situation in a running business. The power of RL is that it is a kind of model-free approach that does not require binding or unrealistic assumptions.
ML takes a different standpoint towards data than statistics. This means in order not to reduce the forecasting power we need to let the data speak for its own. This is unlike what usually we see in statistics where modeling is the major task.
Here we present two tables of an example of a movie rating analysis based on the movie review. The classification is based on the words in the review. On the left-hand side, there are words that are presumably thought to be creating a good classifier while on the right-hand side there are words that a data analysis algorithm has identified as ones that create a good classifier. While the accuracy of a classifier based on the words on the left-hand side is only 60 percent (close to 50), the other one has given an accuracy of 95 percent.
This essentially means in ML the data is used for finding the best model and the starting point is the data, not the model. This is unlike what is usually done in statistics where the model comes first and then data will follow to fit the parameters of the model. The process of choosing a good model is called validation.
ML is mainly about TECH and not only about statistics. However, this is not the first time that TECH has gained popularity in finance and insurance. Simulation and Monte-Carlo methods have been largely applied in economic scenario generators (insurance), valuation, and credit risk (finance). However, it is important to know that the popularity of ML has happened at the same time as Big Data technologies emerged. This is not by accident, since ML methods can efficiently manage big data. So, the new pieces that make things different are TECH and big data.
Conceptually, ML's closest counterpart is statistics. That is why ML is somehow identified as part of statistics, or as it is called statistical learning. However, it is important to note that ML, and in particular supervised learning, is more about making good forecasts. So, we can propose the following differences between statistics and ML
Unlike statistics, ML is about forecasting and not necessarily about "inference", but "generalization".
In ML, unlike inferential statistics, dimension is not of great importance. The reason is that ML is about forecasting and estimated parameters are not a priority.
One can increase the number of parameters in ML, which is extremely helpful when dealing with large and big data with complex multidimensional features. One can see this effect in the so-called adjusted R-squared, where the negative impact (or penalty due to) an increasing number of parameters can be canceled out with larger data set.
In ML we are more concerned with the overfitting/underfitting, regularization, and bias/variance trade-off. We will be discussing all in the future. Whereas in statistics we are more concerned with a correct statistical inference.
While in statistics, a model wants to frame the scope of the modeling, in ML it can be made flexible by introducing new parameters to increase and by regularization to reduce the scope of the modeling.
An increasing number of parameters happens naturally in ML models. Here is an example. Consider a credit classification problem based on three characteristics of the clients: employability, marriage status, and education. First, a linear regression uses the major factors as the independent variables, and henceforth the model is three-dimensional. In ML there is a tendency to look at models with a much higher number of variables. Consider a simple decision tree. The main independent variables are education, employment, and marital status. In the first question we have three options, asking whether the person is married, educated, or employed. If we first ask about the marriage status then we have two remaining questions on any branch (for married and unmarried people). So one can see that the number of variables can easily go to 3×2×2=12. While we started with three features, we have now 12 choices of different cases. Now imagine if the number of features is 10 (which is small by the way), the number of choices is easily getting to millions.
As it was mentioned earlier, the estimated parameters in ML are not of the top priority (unlike statistics). Indeed, the top priority in ML is the efficiency of the forecast. For that reason, there might be more than one model that can make efficient predictions. The solutions to the estimations also do not need to be unique as only the forecasting result matters.
Now let us give a bigger picture of statistics vs ML at three levels.
Parametric statistics: the number of parameters cannot increase a lot, as the estimated parameters and inference matter. So the dimension is usually kept low.
Non-parametric statistics: we deal with functions rather than parameters. This can be considered infinite-dimensional statistics.
ML: In the middle and more towards infinity, we have ML which is the statistics of many dimensions. That is why, not the estimated parameters, but the value of the dimension is essential. The question is what is the best dimension for our model? If we consider the dimension value represents the level of complexity, one can then ask: what is the best level of complexity for the model? As a result, a complex high-dimensional problem reduces to deciding about the value of a dimension, which essentially is a one-dimensional problem! This is the beauty of ML.
To summarize:
In ML dimension is essential, not the parameter.
Decisions are made about the dimension value, rather than parameter values.
Reducing a multi-dimensional problem to a one-dimensional problem.
As we discussed earlier, we take the model complexity (i.e., model dimension) as the main parameter that needs to be optimally identified. The value of dimension needs to be tuned so that two types of errors reach a good balance. The training error (blue curve), is relevant to the error that we make when we construct the model (or when training the model), and the validation error (the red line), is an error we make when validating the model with unseen data. The total error is the summation of the two, which needs to be minimized.
Despite all the positive points about ML, one needs to be careful when it is used on real data. Here it is shown that in forecasting the daily commodity prices, for the copper price (in the list above) ML (particularly RNN or recurrent neural network) cannot give a better forecasting result than an autoregression model. There are many reasons for that, among which one can mention the size of the data. Actually, in this case, we can see that when the number of data is not so large the normal statistical models can perform better.
Here is the forecasting time series created for copper prices by RNN (left) and ARIMA (right), which clearly shows no advantage of using ML.
Since in this online book we are interested in finance and insurance applications, it is important to understand how a model can be interpreted. There are different ways to think about this.
Decision trees
Fuzzy classifiers
Linear regression
Logistic regression
Given a set of features, the contribution of any feature to the difference between the actual prediction and the mean prediction is the estimated Shapley value. The idea of Shapley interpreter comes from game theoretical concept of corporate games.
Let's ask an important question: can any ML method be interpreted? This question can be answered as follows: while interpretability in general means how the features can explain a phenomenon, ML does not need to have such characteristics. Indeed, for an ML model, the interpretation can come after a forecast is made. For instance, if we use a classifier to identify a particular disease, the classifier can explain to any particular patient why she/he is identified as ill, based on her/his test results. That can be an authentic reason for the patient and the people closest to her/him.
Following the same idea one then can think of a procedure to interpret any algorithm as follows:
A good classifier or regression needs to fit the data.
For any given point, using the classifier or the regression, we generate new (or fake) data around the point. We use an interpretable method (from the list above, e.g., linear regression), to fit a model to the new (fake) data just for that local area. Then we use this local model to interpret.
In this table, we bring the most popular supervised and unsupervised methods. On the left-hand side, we have the list of most popular supervised learning methods, including Regression, logistic regression, Linear regression, Logistic regression, Support Vector Machine, Classification and Regression Trees, Random Forest, Linear Discriminant Analysis, k Nearest Neighbor (kNN), Naïve Bayes, Gradient boost, Artificial Neural Network, Recurrent Neural Networks. In the middle, we have the list of the most popular unsupervised learning methods including, Principal Component Analysis, Hierarchical Clustering, and K-mean.
Here is a list of tasks one needs to do for a data science job:
Data gathering
Feature engineering
Model building
Model evaluation
Model deployment
Model serving
Model monitoring
Model maintenance