1. Concepts & Definitions
1.1. Continous random distribution of probability
1.2. Normal distribution of probability
1.3. Standard normal distribution of probability
1.4. Inverse standard normal distribution
1.6. Inverse Student's T distribution
2. Problem & Solution
2.1. Weight, dimension, and value per HS6
2.2. How to fit a distribution
2.3. Employing standard deviation
2.4. Total time spent in a system
To better understand the pattern that emerges when a data set is a composition of two normal distributions. The following Python commands create a bimodal curve fitting on a dataset using the Gaussian function.
from pylab import *
from scipy.optimize import curve_fit
data=concatenate((normal(1,.2,2500),normal(2,.2,5000)))
y,x,_=hist(data,100,alpha=.3,label='data')
x=(x[1:]+x[:-1])/2 # for len(x)==len(y)
def gauss(x,mu,sigma,A):
return A*exp(-(x-mu)**2/2/sigma**2)
def bimodal(x,mu1,sigma1,A1,mu2,sigma2,A2):
return gauss(x,mu1,sigma1,A1)+gauss(x,mu2,sigma2,A2)
expected=(1,.2,250,2,.2,125)
params,cov=curve_fit(bimodal,x,y,expected)
sigma=sqrt(diag(cov))
x_fit = np.linspace(x.min(), x.max(), 500)
#plot combined...
plt.plot(x_fit, bimodal(x_fit, *params), color='red', lw=3, label='model')
plt.legend()
plt.show()
Here is a step-by-step explanation of the code:
The data variable is created by concatenating two normal distributions using the concatenate() function.
The first normal distribution has a mean of 1, a standard deviation of 0.2, and consists of 2500 samples. The second normal distribution has a mean of 2, a standard deviation of 0.2, and consists of 5000 samples. This dataset represents a bimodal distribution.
The hist() function is used to create a histogram of the data. It returns three values: y (the histogram values), x (the bin edges), and _ (a container for additional information, which is ignored in this code). The histogram is displayed with transparency (alpha=.3) and labeled as 'data'.
The next line calculates the midpoint of each bin by averaging the consecutive bin edges. This is done to ensure that len(x) == len(y) for the subsequent curve fitting.
The gauss() function is defined, which represents a Gaussian curve. It takes three parameters: x (the input values), mu (the mean), sigma (the standard deviation), and A (an amplitude factor). It returns the value of the Gaussian function at each x value.
The bimodal() function is defined, which represents a bimodal distribution. It combines two Gaussian curves using the gauss() function. It takes six parameters: x (the input values), mu1 and mu2 (the means of the two Gaussian components), sigma1 and sigma2 (the standard deviations of the two Gaussian components), and A1 and A2 (amplitude factors for the two Gaussian components). It returns the sum of the two Gaussian functions evaluated at each x value.
The expected variable is set to a tuple representing the initial guess for the parameters of the bimodal function. This helps the curve fitting algorithm to converge faster.
The curve_fit() function is called to perform the actual curve fitting. It takes four arguments: the function to fit (bimodal), the x values (x), the y values (y), and the initial guess for the parameters (expected). It returns two values: params (the optimized values for the parameters) and cov (the estimated covariance of the optimized parameters).
The standard deviations (sigma) of the optimized parameters are calculated using the sqrt() function applied to the diagonal elements of the covariance matrix (cov).
A new array x_fit is created to represent a range of x values for plotting the fitted curve. It is generated using np.linspace() function, which creates 500 evenly spaced values between the minimum and maximum values of x.
The fitted curve is plotted using plt.plot() with x_fit as the x values and bimodal(x_fit, *params) as the corresponding y values. The color is set to 'red', line width to 3 (lw=3), and it is labeled as 'model'.
The legend is displayed using plt.legend() to show the labels 'data' and 'model'.
Finally, the plot is shown using plt.show().
The next code help to classify the data inside the variable x.
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=2, random_state=42)
gmm.fit(x.reshape(-1, 1))
target_class = gmm.predict(x.reshape(-1, 1))
target_class
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
The code explanation is:
An instance of the GaussianMixture class is created with n_components=2 and random_state=42.
n_components specifies the number of components (clusters) to fit in the GMM. In this case, n_components=2 indicates that the algorithm will try to find two clusters in the data.
random_state sets the random seed for reproducibility. It ensures that the random initialization of the algorithm is the same each time the code is run.
The fit() method is called on the gmm object, passing x.reshape(-1, 1) as the input.
x.reshape(-1, 1) reshapes the 1-dimensional array x into a 2-dimensional array with a single column. GMM expects a 2D array as input, where each row represents a sample and each column represents a feature. In this case, there is only one feature (x), so reshaping is necessary.
The fit() method fits the GMM to the data, estimating the parameters of the Gaussian components based on the input x. It learns the means, covariances, and weights of the Gaussian components that best represent the data distribution.
predict() assigns each sample in x to one of the learned Gaussian components (clusters) based on the highest posterior probability. It returns an array of labels indicating the assigned component for each sample.
The resulting array of labels is assigned to the variable target_class. Each label represents the predicted cluster (component) to which the corresponding sample in x belongs.
Finally, the code outputs target_class, which contains the predicted cluster labels for each sample in x.
In summary, the code fits a Gaussian Mixture Model with two components to the data x and assigns cluster labels to each sample based on the learned model.
The next code help to draw the data inside the variable target_class according to the classification given by GaussianMixture.
x1 = []
y1 = []
x2 = []
y2 = []
k = 0
for elem in target_class:
if (elem == 0):
x1.append(x[k])
y1.append(y[k])
else:
x2.append(x[k])
y2.append(y[k])
k = k+1
#plt.figure(figsize=(10,6))
plt.scatter(x1, y1, color='red', label='N_1')
plt.scatter(x2, y2, color='blue',label='N_2')
plt.show()
The previous complete code is available in the following link:
https://colab.research.google.com/drive/12oQwv7mvUQ7Bz5a789eI_8IsRlBcY_y-?usp=sharing