Project 6

Kernel principal component analysis for stochastic input model generation

Implicit in our previous developments is the assumption that the uncertainties in the SPDEs have been accurately characterized as random variables through Karhunen-Loeve (K-L) expansion, which is also known as linear principal component analysis or PCA. The computation of K-L expansion requires the analytic expression of the covariance matrix of the underlying random field. In addition, the probability distribution of the random variables is always assumed known a priori. These two assumptions are obviously not feasible in realistic engineering problems. In most cases, only a few experimentally obtained realizations of the random field are available. Reconstruction techniques are then applied to generate samples of the random field after extracting its statistical properties through these limited experimental measurements. These processes are quite expensive and numerically demanding if thousands of samples are needed. This leads to the problem of probabilistic model identification or stochastic reduced-order modeling, where the purpose is to find a parametric representation of the random field through only limited experimental data.

To this end, PCA is the most popular model reduction method due to its uniform mean-square convergence. However, it only projects the samples onto an optimal linear subspace, which results in an unreasonable representation of the original data if they are non-linearly related to each other. In other words, it only preserves up to the second-order statistics (covariance) of a random field, which is insufficient for reproducing complex structures. This project applied kernel principal component analysis (KPCA) to construct a reduced-order stochastic input model for the material property variation in heterogeneous media. KPCA can be considered as a nonlinear version of PCA. Through use of kernel functions, KPCA further enables the preservation of high-order statistics of the random field, instead of just two-point statistics as in teh standard PCA. Thus, this method can model non-Gaussian, non-stationary random fields. In this work, we also propose a new approach to solve the pre-image problem involved in KPCA. In addition, polynomial chaos (PC) expansion is used to represent the random coefficients in KPCA which provides a parametric stochastic input model. Thus, realizations, which are consistent statistically with experimental data, can be generated in an efficient way. Our problem is to find a reduced-order polynomial chaos representation of the random field in a form y = f(x), where vector x, of dimension much smaller than the original input stochastic dimension M, is a set of independent random variables with a specified distribution. Therefore, by drawing samples x in this low-dimensional space, we obtain different realizations of the underlying random field.

Fig. 1. Basic idea of KPCA

Fig. 1 demonstrates the basic idea behind nonlinear kernel PCA. Consider a random field y = (y₁, y₂)^T. If y is non-Gaussian, y₁ and y₂ can nonlinearly related to each other. In this case, linear PCA attempts to fit a linear surface such that the reconstruction error is minimized. This clearly results in a poor representation of the original data. Now, consider a nonlinear mapping that relates the input space to the feature space F. In the right figure of Fig. 1, the realizations that are nonlinearly related in original space becomes linearly related in the feature space F. Linear PCA can now performed in F in order to determine the principal directions in this space. For more details, please refer to my paper.

The simulated realizations of the random field are in the feature space F. However, we are interested in obtaining realizations in the physical input space. Therefore, an inverse mapping is required. This is known as the pre-image problem. We can find the pre-image using local linear interpolation within n-nearest neighbors. The basic idea of the method is shown in Fig. 2. For an arbitrary realization Yr in F, we first calculate its distances to the nearest neighbors in the feature space. Then the distances in the input space are recovered. Finally, the input-space distances are used as the local linear interpolation weights:

Fig. 2. Basic idea of the proposed pre-image method.

We apply KPCA on modeling random permeability field of complex geological channelized structures. This structure is characterized by multipoint statistics (MPS), which expresses the joint variability at more than two locations. Therefore, two-point statistics are not enough to capture the underlying uncertainty of the random field and thus linear PCA is expected to fail. Using the 'snesim' algorithm, 1000 realizations of a channelized permeability field, of dimension 45 x 45, are created from the training image in Fig. 3. These serve as the training samples (experimental data) of the random field for the construction of KPCA. Additional set of realizations, which is not included in the training samples, will serve as the test samples to verify the accuracy of the constructed reduced-order model using KPCA algorithm. Some of the realizations created are also shown on the right of Fig. 3.

Fig. 3. Left: Large-scale training image for the 'snesim' algorithm. It is a binary image where 1 (black region) designates channel sand while 0 (white region) designates background mud. Right: Some of the realizations generated.

The number of kept largest eigenvectors is r = 30 for both linear PCA and KPCA. The reconstruction results are shown in Fig. 4.

Fig. 4. Reconstruction results of different test sampes (top row) using linear PCA (middle row) and kernel PCA (bottom row) with r = 30.

In Fig. 4, the reconstructed permeability value from linear PCA is not correct. On the other hand, kernel PCA consistently captures the geological pattern and the reconstruction sample are more like the original binary images. Although linear PCA is not optimal, it still more or less provides us the desired geological structure. However, what we are interested in is the generation of realizations from the underlying random field. Next, we construct the PC representation of the random field for both linear PCA and kernel PCA. The results are shown in Fig.5 and Fig. 6.

Fig. 5. Realizations of the random field by sampling the corresponding PC representation using linear PCA.

Fig. 6. Realizations of the random field by sampling the corresponding PC representation using kernel PCA.

From Fig. 5, the failure of linear PCA is more pronounced in this case. The realizations definitely do not reflect the original channelized structure of the permeability field. In addition, the permeability value is not correctly predicted. However, using kernel PCA, it is seen in Fig. 6 that the generated realizations clearly show channelized geological structure with correct permeability values.

Fig. 6. Realizations of the random field by sampling the corresponding PC representation using kernel PCA.

Fig. 7 Marginal PDF of the permeability at grid block (11, 24) and (25, 25) from data (left), KPCA (middle) and linear PCA (right), respectively.

To better understand these results, we also plot in Fig. 7 the marginal PDF (relative frequency diagram) of the permeability at two arbitrarily chosen grid blocks (11, 24) and (25, 25) from data, KPCA and linear PCA. The PDF is normalized such that the sum of the area is 1. It is obvious that the marginal PDF from binary image only has two bars at 0 and 1. Although the result from KPCA is not exactly the same as that of the original data, it does resemble the main characteristics, namely most of the values are concentrated near 0 which indicates the binary nature of the image. That partially explains why the reconstruction result looks like a binary image more than that of linear PCA does. On the other hand, as expected, the marginal PDF from linear PCA looks like a Gaussian since it can only capture the mean and variance of the random field which determines a Gaussian distribution.

Next, the generated stochastic input model is used as an input to the single-phase flow problem we used in Project 5. Monte Carlo (MC) simulation is conducted using both experimental samples directly and generated PC realizations from linear PCA and kernel PCA. 50,000 PC realizations are generated by sampling the low-dimensional space. For the purpose of better comparison, 8,000 experimental samples are generated and used in the MC simulation. However, the stochastic input model is the same as the one used above, i.e. only the previous 1,000 data samples are used in constructing the stochastic input model. In this way, it is able to verify if our model can better resemble the statistics of the underlying random field even with less data which is usually the case in real engineering problem.

Fig. 8. Contours of skewness (top row) and kurtosis (bottom row) of saturation at 0.2 PVI from experimental data (left), kernel PCA (middle) and linear PCA.

The contour plots of the saturation at 0.2 PVI are given in Fig. 8. As expected from the discussion before, the contours of skewness and kurtosis from KPCA from KPCA compare well with thos obtained using experimental sample directly even when only 1,000 data are used in the non-linear model construction. On the other hand, linear PCA gives us completely wrong results.

Fig. 9. Comparison of marginal PDF of saturation at grid block (11, 24).

We plot the marginal PDF at grid block (11, 24) in Fig. 9 where it has the highest standard deviation. Similar to Fig. 7, PDF from KPCA looks more like the PDF of experimental data than that of linear PCA. This explains why linear PCA fails to capture the higher-order statistics of the underlying saturation field.

The paper and PowerPoint presentation can be downloaded from here.

Report abuse