Data Exploration

Link to notebook : https://colab.research.google.com/drive/1RN0QElyvRCGYW_4qOv2EbTJTAVqu1yRb#scrollTo=3ktycpeirce0

Data Collection/Preparation

For this analysis, a dataset derived from the public clickstream data made available on Wikipedia by Wikimedia was used called ClickStream Data Dump. These clickstream datasets are organized both by month and language and fetched using web scraping tools.

Data Sources:

Wikipedia clickstream dumps from 2017,2024.

Language code mappings to obtain the information of language where available in the dataset.

Challenges:

It was also a large dataset (~69 GB), and for the purpose of analysis possible we had to down-sample data only pertaining to the English language.

In order to address the anomaly associated with missing NaN values and them being interpreted as valid page names.

Data Cleaning: In order to ensure the quality of the data, several cleaning processes were applied, including the removal of duplicated entries from the dataset and the substitution of missing values in critical columns, such as those for page names, with suitable representations, such as converting "NaN" into a valid string for processing.
Outlier Detection: Z-score analysis was done to identify the outliers in page traffic. Also, scatter and box plots were used to visualize them. From this, 219 outliers were identified and removed for better accuracy in the analyses that followed.

Data Cleaning Steps involved :

Step 1: Importing necessary libraries :

For process of data cleaning and exploration we need some python libraries like numpy, pandas, power law also sklearn for checking normalization
For web scrapiing, we used some python libraries like re., request , bs4, xml
For visualization we used matplotlib, seaborn

Step 2: Web scraping

We web scarped entire wikipedia dataset using webscrapping libraries of python.

Step 3 : Here , We noticed that data which we web scrapped is huge in size , so we wanted explore the possibilities of narrowing down the dataset to specific niche . So below snippet you can observe that we analyzed whole data to get the idea before cleaning out.

Snippet before data cleaning

Step 4 : Here , in this step we decided to moe forward with English dataset of wikipedia.Exploration of English dataset .

Step 5 : Data cleaning/Exploration of english dataset

Webscraaping of english dataset and getting rough idea of data count , description and info
finding out datatypes, null values
Inference : Seems that the phrase "NaN" in the raw TSV data indicates a legitimate Wikipedia article title rather than a missing value in the columns: the "NaN" wiki page is real. We must convert NaN to a string format since Pandas does not accept it as a normal category value, which means it is not included in counts, groups, and other similar actions
Dropping duplicates from dataset
Data Reduction : Data Reduction from X(clickstream update ) million to 1 Million as we have limited computational power to run otherwise system is crashing with ram usage . Since the system memory was not sufficient to handle the entire dataset, only 1 million rows of data were used for further analysis.
Format of dataset : The current data includes the following 4 fields:

prev: the result of mapping the referrer URL to the fixed set of values described above

curr: the title of the article the client requested

type: describes (prev, curr)

link: if the referrer and request are both articles and the referrer links to the request

external: if the referrer host is not en(.m)?.wikipedia.org

other: if the referrer and request are both articles but the referrer does not link to the request. This can happen when clients search or spoof their refer.

n: the number of occurrences of the (referrer, resource) pair ( The column contains numerical data representing "counts" (the number of clicks or navigations between pages).)

Outlier Detection: We found very little outliers in column " n " using boxplot and scatter plot. Thus, Removed the outliers.

We checked the frequency distribution of curr column unique values
Then checked the count occurrences of unique values in prev column .

Inference: The above table for the description of the dataset ans visualization , seems to be in good alignment with the following details about the format of the data: "no missing values, a substantial amount of unique values both for the 'prev' and 'curr' columns, three variants of type for the link category, and a numerical field for link traffic volume ('n') with a minimum threshold. Lastly, we will check the distribution of the link traffic volume numbers to ensure that they make sense and eventually find anomalies in the data. Here , we will identify attribute types (categorical, numerical etc.), have a deeper data understanding (central tendency, dispersion, qq plot), data similarity, normalization etc.

Step 6 : Deeper Understanding Of Data :

Here we are doing :

Checking the Categorical and Numerical columns
Central Tendency: Calculate the average and median of n
Dispersion: Calculate the variance and standard deviation of n
Q-Q plot: Assess whether n follows a normal distribution
Data Similarity: The Jaccard similarity between Prev and curr
Normalisation: Normalize a value to n within the range of 0 to 1
Transformation: Log Transformation of n.

We could compare categorical variables for their similarity using metrics such as Cosine Similarity or the Jaccard Index. In order to simplify our operation we can focus on using the unique values from the prior and curr columns to get the Jaccard similarity of one another.

QQ plot

Step 7 : 10 Data Visualization for udnerstanding data with didfernt columns and their inferences are included in notebook

Heatmap
Histogram
Bar plot -2 of different columns
Wordcloud
Piechart
Lineplot
Scatterplot
Barchart
Stacked plot
violin plot

Page updated

Report abuse