C-mapss Dataset Download

I was studying a dual-speed, high-bypass ratio turbofan engine dataset which I happened to obtain from NASA's website. This dataset was generated from C-MAPSS simulator where dataset has nominal and fault files. I studied the user guide doc and some other documents related to the same but I failed to understand the difference between these two files (ie. at the value/ parameter level).

PHM08 Challenge Dataset is now publicly available at the NASA Prognostics Respository + Download An online evaluation utility is also provided to let users evaluate their results and get feedback on test dataset.

Download Zip 🔥 https://urloso.com/2y2Rl2 🔥

According to the README that comes with the dataset, the format of training and test sequences (for each engine with a separate id) looks like the schematics bellow. This also implies that we have more observations of systems that are critically close to their failure than those farther away from it. This could be our curse or blessing, depending on what we are interested in. If we are interested in spotting a failing system as soon as possible (but possibly with a larger margin of accuracy), this is not how we want our training and test set to look like. However, if we are interested in learning the exact behavior of the system after the first fault and the growth of the system failure, we could embrace this training set. Although the test set then may not be perfect for us. This will get clearer if we look at the actual distribution of RUL values (in cycle) for the training and test sets.

I am trying to get my hands on some real aircraft engine data to do analysis on. From the glimpse I got to have, the format the real data come in is already hard to figure out. Besides, I have not yet figured out how to use the data in manually-collected logs to account for maintenance and possible replacements done in certain parts of the engine and use data that do not include failure events! But as for this data set, I would like to get back to it and try out the seq2seq LSTM on it, now that I already feel comfortable with this dataset. But who knows when a newer, cooler, shinier dataset comes my way and distracts me?

The remaining useful life estimation is a key technology in prognostics and health management (PHM) systems for a new generation of aircraft engines. With the increase in massive monitoring data, it brings new opportunities to improve the prediction from the perspective of deep learning. Therefore, we propose a novel joint deep learning architecture that is composed of two main parts: the transformer encoder, which uses scaled dot-product attention to extract dependencies across distances in time series, and the temporal convolution neural network (TCNN), which is constructed to fix the insensitivity of the self-attention mechanism to local features. Both parts are jointly trained within a regression module, which implies that the proposed approach differs from traditional ensemble learning models. It is applied on the Commercial Modular Aero-Propulsion System Simulation (C-MAPSS) dataset from the Prognostics Center of Excellence at NASA Ames, and satisfactory results are obtained, especially under complex working conditions.

The Complex Systems Monitor for Advanced Propulsion System Simulation (C-MAPSS) dataset is a widely used benchmark in the field of RUL estimation and prognostics14. It consists of four sub-datasets (FD001, FD002, FD003, and FD004) multivariate time series data collected from aircraft engines, representing a challenging problem due to the complex nature of the data and the presence of multiple operating conditions and failure modes. Feature engineering and feature selection are essential steps in preprocessing the C-MAPSS dataset which is utilized in this article as well to obtain accurate RUL predictions15.

Feature engineering techniques have been widely explored in the context of RUL estimation. Heimes et al.29 proposed a technique to extract features from time series data using time-domain, frequency-domain, and wavelet-domain methods. Similarly, Lei et al.30 conducted a systematic review of machinery health prognostics, focusing on the entire process from data acquisition to RUL prediction. Their review covered various methods, including wavelet packet decomposition, artificial neural networks, and other machine learning techniques applied to different types of machinery and datasets, such as rolling bearings31 and the C-MAPSS dataset.

Wang et al.32 introduced a similarity-based prognostics approach for estimating the remaining useful life (RUL) of engineered systems. The authors proposed a method that calculates the similarity between the current health state of the system and the historical health states of other systems by using a weighted dynamic time warping algorithm. The RUL is then estimated based on the most similar historical health states. The proposed approach was validated using the C-MAPSS dataset, demonstrating its effectiveness in predicting the RUL of systems based on the similarities between their health states.

Sheng et al.35 presented a novel self-adapting deep learning network (CSDLN) for predicting the RUL of aero-engines and wind turbines, overcoming limitations of prior methods such as inflexible feature learning patterns and low prediction accuracy. Their concise self-adapting deep learning network (CSDLN) model, integrating a multi-branch 1D involution neural network (MINN) for feature extraction and a trend recognition unit (TRU) for degradation trend identification, has demonstrated superior prediction accuracy and generalization in comparative and ablation experiments on their confidential WTGB dataset.

In this article, the impact of various feature engineering and feature selection methods was investigated on RUL estimation performance using the C-MAPSS dataset. Specifically, the effectiveness of rolling window aggregation, TSFresh features extraction, GA, RFE, Lasso, and FIRF for selecting the most informative features for RUL prediction was explored. Furthermore, evaluating the performance of several machine learning algorithms, including neural networks, on the selected features to determine the optimal combination of preprocessing techniques and prediction models for RUL estimation. Additionally, a novel interpretation of Principal Component Analysis (PCA) loadings was introduced. This approach illuminates the intricate relationships between sensor readings, uncovering new narratives in the data that contribute to our understanding of engine behavior.

The methodology employed in this study, illustrated in Fig. 1, follows a structured data processing pipeline encompassing several stages. The first step is Data Preprocessing, involving the removal of low variance features and scaling of sensor data. This is followed by the Rolling Time Series Windows phase, wherein time-series features are extracted. Principal Component Analysis (PCA) is then applied to reduce the dimensionality of the feature set. Subsequently, Feature Selection is performed using five distinct techniques. This comprehensive data processing and modeling approach is finalized with a training and testing phase. The developed machine learning models are trained on a substantial portion of the dataset, and subsequently tested on unseen data. Model performance is evaluated using RMSE and the coefficient of determination (R2 score), providing a comprehensive understanding of their prediction capabilities and generalizability.

It starts by removing features with low variance. Features with variance below a specified threshold are considered less informative for the model and are removed from the dataset. This step helps in reducing noise and computational complexity during the subsequent steps.

PCA loadings47 provide insight into how each original feature in the dataset contributes to the newly created principal components. Specifically, a loading represents the correlation between a particular original feature and a principal component, thereby informing us about the degree and direction of the influence of each original feature on each component.

The C-MAPSS (Commercial Modular Aero-Propulsion System Simulation) dataset is a comprehensive collection of data that simulates the performance and degradation of large commercial turbofan aircraft engines. The dataset is composed of four distinct sub-datasets (FD001, FD002, FD003, and FD004), which represent data collected from 21 sensors simulating the degradation of large commercial turbofan aircraft engines, as provided by NASA48. This dataset documents various engine flight conditions and fault modes, with each sub-dataset containing both a training and test set. Table 3 outline the specific composition of the C-MAPSS dataset.

Feature selection serves as a critical parameter for achieving accurate predictions while minimizing computational time. The appropriateness of feature selection is contingent upon the specific dataset being utilized. In the context of the FD001 dataset, there are 15 principal components that could potentially function as features. Table 4 delineates the features chosen in accordance with various techniques, including GA, RFE, LASSO, FIRF, and AFICv.

The performance of the five feature selection techniques employed in this study is found to be quite similar when applied to the FD001 sub-dataset. In order to streamline the presentation of results for the other sub-datasets (FD002, FD003, and FD004), as shown in Table 8, only GA technique, which resulted in the highest number of selected features, was compared with the AFICv technique, which identified the lowest number of features. This comparison provides valuable insights into the impact of feature selection on model performance.

This observation suggests that AFICv can effectively identify a smaller subset of the most important features, leading to more efficient and potentially more interpretable models without sacrificing prediction accuracy. In practical applications, the reduced number of features may lead to faster training and prediction times, as well as lower computational costs. The comparable performance between the two methods, despite the difference in the number of selected features, highlights the effectiveness of AFICv as a feature selection technique in the context of the C-MAPSS datasets and the four machine learning models considered in this study. ff782bc1db