Wikipedia is used by millions of people each day, making it one of the largest and most-visited websites in the world. However, this is interesting in that between the way visitors navigate from one Wikipedia page to another lies an interesting network that not many people are aware of.
Clickstream statistics, which are quite frequently released by Wikipedia-these track aggregated page-to-page user visits-capture this interrelated nature. This allows us not only to study single-page visits but also user navigation within the huge information network of Wikipedia. A deeper understanding of the pattern of flow of data, behaviors among users, and information discovery on one of the largest
knowledge systems in the world could be obtained by analysis of these linkages.Wikipedia periodically publishes clickstream datasets that capture aggregated user navigation between articles.These datasets are substantial, and while traditional statistical methods can provide traffic volumes and identify the most visited articles, they overlook the deeper insights found in the relationships between pages.
Clickstream is known as a great technique for monitoring, examining, and analyzing the behavior of users on websites. Businesses gain deeper insights into user behavior, preferences, and navigation patterns using the clickstream data. By making data-driven decisions this information plays a huge role in enhancing user experience, optimizing digital platforms, and thus, raising engagement and conversion rates.
Clickstream analysis records every customer contact, creating a virtual trail that is better known as a "clickstream." It records information on sites visited, links clicked and watched, the amount of time a customer spends on each page, and extra activities such as downloads or form submissions. Thorough data could allow companies to identify areas they would need to expand to know exactly how people interact with their websites.
The main objective of clickstream analysis is to provide useful information or insights about user preferences, trends, and navigations in the user journey. It provided us insights about the most used page, the path taken by the user to complete the task, and how much time the user takes on certain pages. Using this information, companies enhance their technical sets of websites to make a more effective and interactive user experience. Thus, attracting more users to the website.
For this project, we will be using a Wikipedia data set for analyzing the clickstream and how it is benefiting business. In this project, we apply network analysis to uncover insights from these connections. We represent the clickstream data as a network, analyze the structure and key nodes of the graph, perform community detection and natural language processing to discover themes or topics, and use shell decomposition to explore more hidden patterns of browsing behavior on Wikipedia.
10 Analyses we will be answering using the Wikipedia Clickstream Dataset:
1. Which Wikipedia pages are linked to the most by other pages?
2. How many readers are shifting from broad topics to specific topics?
3. What pattern of behavior users are following when navigating between sites related to the same topics?
4. How many outbound links does the page have, and how likely is it that users would follow those links to continue their navigation?
5. Are subject communities - understood here as clusters of articles that are routinely read together - observable in Wikipedia's clickstream data?
6. How does the clickstream network compare specialist or scholarly topics - quantum physics, philosophy - to those subjects that are more broadly covered - sports or entertainment?
7. What is the function of the non-prominent, not rather visited sites in the whole network of Wikipedia?
8. When many readers are reading about the same topics, is predictable switching by users taking place between the language versions of Wikipedia?
9. What knowledge is most readily available through the accumulated pool of external links in Wikipedia, and how do other search sources-primarily search engines direct users to access the site?
10. How deep does a consumer’s session with Wikipedia continue? How often does a user start at a topic and drill down into highly visited less visited pages?