Fitting a model implies learning from data representation of the rules that generated the data in the first place. From a mathematical perspective, setting up model is analogous to guessing an unknown function of the kind you faced in high school, such as Y=4X^2+ 2X, just by observing its y results. Therefore, under the hood, you expect that machine learning algorithms generate mathematical formulations by determining how reality works based on the examples provided. Demonstrating whether such formulations are real is beyond the scope of data science. What is most important is that they work by producing exact predictions. For example, even though you can describe much of the physical world using mathematical functions, You often count describes social and economic dynamics of this way but people try guessing them anyway.
To summarize, as a data scientist, you should always strive to approximate the real, unknown functions underlying the problems your face using the best information available. The result of your work is evaluated based on your capacity to predict specific outcomes (the target outcome) given certain premises (the data) thanks to a useful range of algorithm.
Earlier in the lesson, you see something akin to a real function or law when the book presents linear regression, which has its own formulation. The linear formula y=BX+a, which mathematically represents a line on a plane, can often approximate training data as well, even if the data is not representing a line or something similar to a line. As with linear regression, all other machine learning algorithms have an internal formulation themselves. The linear regression's formulation is one of the simplest ones formulation from other learning algorithms can appear quite complex. You don't need to knows exactly how they work. You do need to have an idea of how complex they are, whether they represent a line or a curve, and whether they can stand outliers or noisy data. When planning to learn from data, you should address this problematic aspects base on the formulation you intend to use.
Whether the learning algorithm is the best one that can approximate the unknown function that you imagine behind the data you are using. In order to make such a decision, you must consider the learning algorithms formulation performance on the data at hand and compare it with other, alternative formulations from other algorithms.
Whether the specific formulation of the learning algorithm is too simple, with respect to the hidden function, to make an estimate (this is called a bias problem).
Whether the specific formulation of the learning algorithm is too complex, with respect to the hidden function to be guessed (leading to the variance problem).
Not all algorithms are suitable for every data problem. If you don't have enough data or the data is full of erroneous information, it may be too difficult for some formulations to figure out the real function.
If your chosen learning algorithm can't learn properly from data and is not performing well, the cost is bias or variance in its estimates.
Bias: Given the simplicity of formulation, your algorithm tends to overestimate or underestimate the real rules behind the data and is systematically wrong in certain situations. Simple algorithms have high bias; having few internal parameters, they tend to represent only simple formulations well.
Variance: Given the complexity of formulation, your algorithm tends to learn too much information from the data and detect rules that don't exist, which causes it predictions to be erratic when faced with new data. You can think of a variance as a problem connected to memorization. Complex algorithms can memorize data features thanks to the algorithms high number of internal power meters. However, memorization doesn't imply any understanding about the rules.
The StratifiedKFold class provides a simple way to control the risk of building malformed samples during cross-validation procedures. It can control the sampling so that certain features, or even certain outcomes (when the target classes are extremely unbalanced), will always be present in your folds in the right proportion. You just need to point out the variable you want to control by using the y parameter, as shown in the following code.
SVD on Homes Database
Using homes.csv, try to do the following:
Set the matrix A to be all the columns in homes. (You can use .values to make it numpy array). Then print it.
Perform SVD on matrix A. Then print out the matrix U, s, and Vh.
Try to delete the last 3 columns of matrix U. Adjust s and Vh accordingly. Then try to multiply all of them and see the difference with the original homes table.