MILESTONE 3
MODEL IMPLEMENTATION
In milestone two, we web-scraped the data, cleaned it, and preprocessed it. Then we downloaded the dataset for the model implementation. As the structure of the data includes categorical columns((prev, curr, type) and numerical columns (n).
Drive link to Dataset: https://drive.google.com/file/d/1uVnp5R53D11NwXas5wvC2xbR4hCU-w9X/view?usp=sharing
Collab link: https://colab.research.google.com/drive/1ivd35mTSK3QMqT-D7QJwkQugKobtbtAa?usp=sharing
Following are the machine learning models and preprocessing steps:
Use case: we aim to classify relationships based on the interaction type (type) or predict if a certain interaction (prev, curr) exists based on other features.
Variables:
Features: You can one-hot encode the categorical columns prev, curr, and type, and then use these as features.
Target: The target could be binary classification (e.g., whether an interaction exists, depending on your specific task).
Model details:
Training and testing data split: 80% training, 20% testing.
One-hot encodes the categorical columns prev, curr, and type.
n might be used as an additional feature (e.g., interaction strength)
Logistic Regression
Random Forest
XGBoost
SGD Classifier
Neural network(MLP Classifier)
Step 1: Imported necessary libraries and loaded the data. Also did the data exploration.
Below is the snippet
Step 2: Finding frequent itemsets using Apriori or FP-growth to find. It reveals patterns in Wikipedia article transitions.
Converting the dataset into transactions (pairs of prev and curr columns). This w is likely for performing frequent pattern mining.(e.g., finding common patterns between the articles visited by users).By converting the data into a transaction format
Step 3: Transition Probabilities
Calculating the transition probabilities between articles (prev to curr). By determining the probability of transitions, you can analyze the flow of Wikipedia user behavior—how likely users are to move from one article to another. This is a common technique in predictive modeling and recommendation systems.
Step 4: Dividing training and testing data and preprocessing
Converting categorical data (prev column) into numeric labels. Standard Scaling: Normalizes numerical data (n column) to a standard scale #(mean=0, std=1), which is essential for many machine learning models.
Encoding transforms non-numeric data into numeric format, and scaling ensures all features contribute equally to the model, preventing bias toward larger-scale features.the, Split the dataset into training (80%) and testing (20%) subsets. X #contains features, and y is the target variable i.e. type.This scales the 'n' column (presumably a numeric feature) so that it has a mean of 0 and a standard deviation of 1.
StandardScaler(): Standardizes the data by removing the mean and scaling to unit variance, which is particularly important when using models like neural networks.
fit_transform(): Fits the scaler to the 'n' column and then applies the scaling transformation.
Step 5: Model 1: Logistic regression
Findings: It yields an accuracy of 76.36%, meaning that it correctly predicts the class for approximately 76% of the test samples, which is not extremely high but indicates reasonably strong performance. Overall, the model works great in general for accuracy, with the biggest class being Class 0. However, the Logistic Regression model performs terribly with respect to the minority classes, Classes 1 and 2. These imbalances could be dealt with through resampling, class weighting, or employing other algorithms.
Step 5: Model 2: Random Forest
Findings: Its overall accuracy was very high, indicating that on average, the model correctly predicted the class labels for about 99% of the test samples. While accuracy is useful, sometimes it may not capture class imbalances well. Precision is also high, which shows good performance across the classes in terms of correctly identifying positive samples. In summary, although the Random Forest model shows high accuracy with good performance for most of the classes, the class imbalance of Class 2 would most probably further improve the overall model performance.
Step 5: Model 3: XGBoost
A highly efficient and popular boosting algorithm for classification tasks.
Findings: The accuracy of the XGBoost model is 98.86% , indicating a very high level of overall performance. The model is rightly predicting around 99% of the total test data points. The model performs very well for the big classes, namely Class 0 and Class 1, for which both precision and recall are very high. However, Class 2 receives a significant drop in recall-the model is inefficient at correctly identifying instances from Class 2.XGBoost works really well for most classes, especially the dominant classes of Class 0 and Class 1, with near perfect classification metrics.
Step 5: Model 4: SGD Classifier (SVM)
This is a linear classifier that uses Stochastic Gradient Descent to fit the model. It's efficient for large datasets and supports various loss functions, such as 'hinge' for linear SVMs (Support Vector Machines).
Findings: The high value of precision says that the overall model is performing well since it correctly classified the instances in the test set. While the model does well in terms of accuracy and the classification of the majority class
Step 5: Model 5: Neural Network (MLP)
We used more to balance the dataset.
Findings: The accuracy of the Neural Network (MLP) model is 72.81%. This means that roughly 73% of the test samples were classified correctly by the model. The macro average gives the unweighted mean performance across all classes. The values are relatively low because of poor performance in class 2. Since the model performs well on the majority class, class 0, with a high recall of 1.00, it suggests that almost all instances of class 0 were correctly identified by the model, capturing the pattern of the dominant class effectively.
Step 6: Comparing model performances and evaluating multiple machine learning models based on key classification metrics and visualizing their performance.
Step 7: Hyperpamarater Tuning
GridSearchCV and RandomizedSearchCV for logistic regression, SGD, and MLP models.
Step 8: Hyperpamarater Tuning: Model 1: Logistic Regression
Findings: The tuned Logistic regression significantly improves in accuracy, increasing from 76.3% to 82.2%. This improvement indicates that hyperparameter tuning helps the model converge better and perform more accurately.
Step 8: Hyperpamarater Tuning: Model 1: Neural networks(MLP)
Findings: The tuned Neural Network (MLP) significantly improves in accuracy, increasing from 72.81% to 83.18%. This improvement indicates that hyperparameter tuning helps the model converge better and perform more accurately.
Step 9: Data Preprocessing and K-Means Clustering
Findings: The results of the clustering feature show that 'prev', 'curr', 'type', and 'n' have been separated into 5 clusters based on their similarities. The feature values for Clusters 0, 1, 2, and 4 all appear very similar; however, each cluster has certain unique characteristics, especially regarding the three 'prev', 'currently, with values much lower across most of these features. The feature values for Clusters 0, 1, 2, and 4 all appear very similar; however, each cluster has certain unique characteristics, especially regarding the three 'prev', 'curr', and 'type' features.
Step 10: 3D PCA Visualization of K-Means Clustering
PCA projects the data into fewer dimensions (3 in this case) while retaining as much variance as possible. It helps visualize high-dimensional data in a 3D space.
Step 11: Network Analysis with Centrality Measures
nx.degree_centrality(subgraph): Calculates the degree centrality for each node in the subgraph. Degree centrality measures how many direct connections (edges) a node has. A higher value means the node is connected to more other nodes.
nx.pagerank(subgraph, alpha=0.85): Calculates the PageRank of each node in the subgraph. PageRank is an algorithm that ranks nodes based on the number and quality of links (edges) that point to them, with the alpha parameter controlling the damping factor (probability of jumping to a random node).
nx.betweenness_centrality(subgraph): Calculates betweenness centrality for each node. Betweenness centrality measures how often a node acts as a bridge along the shortest path between two other nodes. A higher value indicates that a node is an important connector in the network.
By using Degree Centrality, the most well-connected node considering direct links is 'other-empty'. PageRank identifies 'Margaret_Qualley' as the most important node based on the quality of its incoming links.
Betweenness Centrality indicates '2024_in_film' as the important connector node.
The overall low clustering coefficient reflects that the network has a low clustering level, meaning the nodes are somewhat sparsely interconnected in forming triangles.
These different centrality measures provide insight into the structure and relative importance of nodes within a network, enabling the identification of key pages and their relationships in the data.
Step 12: Generating a basic visualization of a subgraph of the Wikipedia clickstream network, focusing on the top 5000 nodes.
The resulting plot displays the top 5000 nodes of the Wikipedia clickstream network, arranged using the spectral layout. Key nodes (based on degree centrality) are labeled, and the graph shows the interconnections between nodes, helping visualize the network's structure.
Visualized the top 5000 nodes in your graph G using a spring layout for better clarity. The resulting plot shows us a network visualization of the Wikipedia clickstream consisting of 5000 nodes, arranged using the spring layout.
The graph is connected, and most nodes in the subgraph have a high degree of centrality, indicating that they are well-connected within the network. However, the PageRank values are low, meaning the nodes are not highly important in terms of network influence, and betweenness centrality is minimal, suggesting the subgraph might not have significant intermediary nodes. These findings can be used to understand some of the structural properties of the graph and where the most influential or important nodes might be in terms of centrality. Finally, the degree_centrality value is 0.982298 for all rows in the sample output, indicating that all of the nodes have a relatively high number of connections in the subgraph with respect to the total number of nodes. This seems to indicate that these nodes are some of the more connected nodes within the network. The PageRank value is very low-0.000064 for all the nodes in the sample. This is typical for nodes that are not highly influential, meaning not many important incoming connections from other nodes. The betweenness_centrality value is 0.0 for all nodes, showing that none of these nodes act as significant bridges or even intermediaries between other nodes in this subgraph. This implies that the subgraph is more decentralized, or the nodes are not placed in strategic positions with respect to connectivity.