Conclusion:
1. Key Findings:
Most Connected Node: Pages like "other-empty" and "other-search" represent high user traffic but indirect navigation hubs.
Important Nodes: "Margaret Qualley" emerged as the most influential page by quality of incoming links (PageRank).
Key Connector Pages: The "2024 in film" page acts as a critical bridge between topics (Betweenness Centrality).
Sparse Connections: The low clustering coefficient (0.121) indicates that Wikipedia's structure is relatively sparse, forming fewer tightly interconnected groups.
In addition to these centrality measures, the word cloud of current pages highlights the prominence of certain topics, particularly films, TV series, and actors. The frequency of terms like "film," "actor," "TV series," and "Main Page" reflects the strong focus on entertainment-related content within the dataset.
2. Visualization Insights:
- The network visualization reinforces the centrality analysis. Large nodes, such as `2024_in_film`, `other-empty`, and `Donald_Trump`, dominate the center due to their high connectivity and influence, forming a densely connected hub.
- Peripheral nodes are less connected, reflecting their minor influence on the network structure.
- The sparse interconnections observed in the visualisation align with the low clustering coefficient, pointing to a radial and hierarchical network structure.
3. Significance of Findings:
- These insights provide a deeper understanding of the dataset's structure, helping identify key entities and their relationships within the network.
- The analysis can be applied to various real-world scenarios, such as improving navigation on websites, enhancing recommendation systems, and optimizing the structure of online platforms.
4. Future Improvements:
- Further analysis could incorporate dynamic metrics to explore changes in the network over time.
- Introducing additional datasets or attributes could provide more nuanced insights into the interactions within the network.
5. Potential Use Cases:
- Optimising website structure to improve user experience and engagement.
- Developing more efficient search algorithms using PageRank-style ranking.
- Identifying influential nodes in social networks for targeted marketing or information dissemination.
This comprehensive analysis, coupled with visualization, underscores the utility of network analysis in uncovering key patterns and insights, ultimately contributing to the achievement of the project objectives.
RESULTS
WordCloud
Words like “film,” “2024,” and “Margaret Qualley” OR keywords like "presidential election"are highlighted, showing that these are central themes in the data. This could indicate high activity around these films or individuals.
Smaller Words: These represent less frequent but still significant topics, potentially niche or emerging issues.
Visual Patterns: Words that appear close together in the wordcloud suggest related themes, such as films, celebrities, and political events, pointing to key areas of interest within your network.
As we can the trending pages are mainpage ,united states, uk, youtube,india,worldwar 2, china, donald trump, newyork city
'other-empty' appears as the most significant page, with the longest bar, indicating it has the highest centrality or connectivity in the dataset.
Other notable pages include 'other-search' and 'other-internal', indicating they are also important nodes, though with significantly lower centrality.
Key Insights:
The plot shows which pages are highly connected, with 'other-empty' being the most central node in terms of direct connections.
Pages like 'Main_Page', 'List_of_American_films_of_2024', and 'Joker:_Folie_à_Deux' also appear, though with much lower scores.
Network analysis graph(node to node connection)
High Connectivity: Pages within the blue clusters are highly interconnected, suggesting they share similar themes, topics, or importance in the network.
Page Grouping: The clusters could show how certain pages are related to each other, such as those within specific topics (e.g., movies, political events) or having shared characteristics like frequent referencing or similar categories.
Central Nodes: Within the blue cluster, some nodes will likely be more central, meaning they play a more significant role in connecting other nodes in the network.
Clusters Reflecting Themes: The graph could represent groups of pages related to similar subjects or categories, like movies, events, or people. For example, all pages related to 2024 films or US elections may form one cluster.
PieChart
According to the pie chart, links takes the greatest percentage of interactions in the dataset with 62.6% . This implies that rather than visiting other websites, visitors are more likely to move inside Wikipedia and connect to other pages.External connections make up a significant amount of the dataset after internal linkages. Given how frequently visitors are sent to external content for further information, this emphasizes the value of outside resources in the context of Wikipedia articles.
Model Comparison:
Random Forest is the best-performing model with an outstanding accuracy of 99.28%. It also achieves near-perfect precision, recall, and F1 score, making it the most reliable for this classification task.
-2nd, XGBoost runs with an excellent performance, too, with values very similar to the Random Forest. The slight drop in performance compared to Random Forest makes it the second choice.
-The SGD classifier also performs very well but lags a little behind Random Forest and XGBoost in recall; hence, it may miss some of the true positives.
-Neural Network - by far the worst performing, with significantly reduced accuracy, F1 score, and recall. This may require further tuning or a different approach; in this instance, though, it was outperformed by all the other models.
-Random Forest would be an ideal model for the best performance, considering its accuracy and excellent balance between precision and recall. XGBoost can be considered a good alternative, mainly if one wants to experiment with gradient boosting algorithms-maybe other tasks will yield even better results.
Answers to questions of milestone 1 (10 questions)
1. Which Wikipedia pages are linked to the most by other pages?
- This can be identified by looking at degree centrality in a network analysis. Pages with the highest degree centrality are the ones that are most frequently linked to by other pages in Wikipedia. These pages serve as major hubs in the network of knowledge.
- Example Insight: Pages like “Main Page” or “2024 United States Presidential Election” might be highly linked, as they cover broad topics with widespread relevance.
2. How many readers are shifting from broad topics to specific topics?
- This behaviour can be observed by tracking clickstream data where users start on broad pages (e.g., “Sports”) and navigate to more specific ones (e.g., “2024 Summer Olympics”).
- Example Insight : A user starting on the general “Films” page and clicking through to specific film articles like “Joker: Folie à Deux” indicates a shift from broad to specific.
3. What pattern of behaviour are users following when navigating between sites related to the same topics?
- User behaviour can be analyzed by path analysis . This reveals how users move between pages within specific topics, like transitioning from an article about a country to articles about its major cities or history.
- Example Insight : Users may often follow a predictable pattern, like moving from a general event overview (e.g., “2024 in film”) to specific films or actors within that category.
4. How many outbound links does the page have, and how likely is it that users would follow those links to continue their navigation?
- This can be measured by examining the outbound links on a page and the likelihood of those links being clicked. High PageRank and betweenness centrality nodes may have more attractive outbound links that guide user navigation.
- Example Insight: A page about “Quantum Physics” may have fewer outbound links, but highly specific and trusted links to specialized content, which users are likely to follow.
5. Are subject communities - understood here as clusters of articles that are routinely read together - observable in Wikipedia's clickstream data?
- Cluster analysis of the clickstream data will reveal subject communities, where pages on related topics are grouped together based on how often they are read in succession. For example, a cluster might be formed by all pages about a specific actor or film series.
- Example Insight: A cluster of articles around “Hollywood” or “Oscar-winning films” might be frequently navigated together.
6. How does the clickstream network compare specialist or scholarly topics - quantum physics, philosophy - to those subjects that are more broadly covered - sports or entertainment?**
- The network structure for specialist topics like “quantum physics” or “philosophy” tends to have lower connectivity and fewer links between pages, while broad topics like “sports” and “entertainment” often show dense interconnections with high degree centrality and frequent user navigation between articles.
- Example Insight : Specialist topics tend to have fewer cross-links, whereas entertainment topics often have a more interconnected web of pages, reflecting the broad interest in these subjects.
7. What is the function of the non-prominent, not highly visited sites in the whole network of Wikipedia?
- Peripheral pages may still serve an important function in connecting less popular topics to more widely read pages. These pages help maintain the integrity of the network , providing connections that ensure the flow of information across diverse topics.
- Example Insight : Less popular pages might link to niche topics or articles about specific sub-genres of entertainment or obscure scientific concepts, serving as bridges in the network.
8. When many readers are reading about the same topics, is predictable switching by users taking place between the language versions of Wikipedia?
- Switching patterns between different language versions can be tracked by looking at users’ transitions from one language version of an article to another. This often happens when users want to explore a topic in more detail or from a different cultural perspective.
- Example Insight : Users reading about global events, such as the “2024 U.S. presidential election,” may switch between the English, Spanish, and French versions of the article based on their language preferences.
9. What knowledge is most readily available through the accumulated pool of external links in Wikipedia, and how do other search sources—primarily search engines—direct users to access the site?
- External links provide pathways to reliable sources and detailed information. Search engines direct users to specific articles, often based on popular topics or high-quality content.
- Example Insight : External links from news outlets, journals, or academic sources can drive users to pages like “Artificial Intelligence” or “2024 Elections,” showing how knowledge is disseminated outside Wikipedia.
10. How deep does a consumer’s session with Wikipedia continue? How often does a user start at a topic and drill down into highly visited less-visited pages?
- Session depth can be analysed by tracking how far users dig into Wikipedia after initially visiting a topic. Users might start with general pages and gradually explore more specialized content, sometimes digging deep into less-visited articles.
- Example Insight: A user reading about “Philosophy” may drill down into articles about “Philosophy of Science” or specific philosophers, often visiting pages that are less popular but highly specialised.