1] Data Understanding
Identifying and differentiating between various types of data such as categorical, numerical, and time-series. Interpreting various data formats including CSV, JSON, and databases, and understand their respective advantages and use cases.
2] Data Cleaning and Preprocessing
Handling missing values, outliers, and inconsistencies by transforming raw data into a suitable format for analysis. This involves techniques such as imputation, scaling, and normalization to ensure the data is ready for further exploration and modeling.
3] Descriptive Statistics and Visualization
Applying measures of central tendency (mean, median, mode) and measures of dispersion (range, variance, standard deviation). Utilize visualization tools like histograms, box plots, scatter plots, and bar charts to create informative and insightful visual representations of the data.
4] Pattern and Relationship Identification
Use statistical techniques to identify correlations, trends, seasonality, and anomalies within the data. Techniques such as correlation matrices, autocorrelation plots, and time-series decomposition help uncover these patterns and relationships.
5] Feature Selection
Implementing feature selection techniques to identify the most relevant features for model building. Methods such as recursive feature elimination (RFE), feature importance from tree-based models, and mutual information help in selecting features that contribute most to the predictive power of the model.
6] Target Encoding
Using target encoding to transform categorical variables into numerical values based on the target variable. This technique helps improve model performance, especially in cases where there are high-cardinality categorical features.
7] Model Building
We built and tuned CatBoost, XGBoost, Random Forest, Decision Tree, and LightGBM models. Hyperparameter tuning involved techniques like Grid Search and cross-validation, optimizing for parameters such as depth, learning rate, and number of estimators to enhance model accuracy, precision, and overall performance.