Measuring the Gender Gap: Attribute-based Class Completeness Estimation
Led by Gianluca Demartini and Lei Han The University of Queensland, Australia
Implications of ChatGPT for knowledge integrity on Wikipedia
Led by Assoc Prof Heather Ford, University of Technology Sydney (UTS) and colleagues
Wiki Histories Project, UTS
Led by Assoc Prof Heather Ford, University of Technology Sydney (UTS) and colleagues
Research: Analyzing sources on Wikipedia
Led by Isaac Johnson, Wikimedia Foundation
Research:Language-Agnostic Topic Classification
Led by Isaac Johnson, Wikimedia Foundation
Research:Machine Learning Assisted Wikipedia Editing
Sebastian Riedel, Facebook AI Research and UCL, 2022-January – 2023-December
Our core hypotheses are that a) for language models to be useful in the editing process, humans will need fine-grained control over the behavior of these models and b) language models will need to be able to retrieve relevant information from the web (that is, we need retrieval augmented models). We are working on several work streams that test and develop these hypotheses at the moment.
Project on article quality in multiple language WPs based on large-scale reference analysis - see BestRef for an interactive dataset
2022, Wenceslao Arroyo-Machado , Daniel Torres-Salinas , Rodrigo Costas
WikiProject source reliability, aka the Credibility Ratings + Assessments Project, is an effort to identify and aggregate online sources of assessments of the reliability and credibility of sources. These assessments include estimates of bias, verifiability of claims, level of editorial oversight or peer review, and expertise in specific topics. Assessments may be of individual documents, or aggregate assessments of authors or domains.
Types of assessments include:
Compilations of self-assessments. Examples include the TRANSPOSE project for collating self-reported peer-review practices of journals).
Evaluations by groups created specifically to evaluate source reliability. Examples include Media Bias Fact Check, other fact checking sites, and sites like Ad Fontes Media that produce visuals referenced by others.
Evaluations by communities of practice, as a by-product of their work reviewing or sourcing information to others. Examples include Perennial Sources lists on various language Wikipedias, topical Reliable Sources lists from individual WikiProjects, and newsrooms that publish their internal measures of source reliability.
Compilations of secondary assessments, including the above. Examples include Iffy.news.
Credibility Coalition is a research community that fosters collaborative approaches to understanding the veracity, quality and credibility of online information. We incubate activities and initiatives that bring together people and institutions from a variety of backgrounds. *Doesn't seem to have been active since 2019.
The Iffy Index of Unreliable Sources compiles credibility ratings by Media Bias/Fact Check. Mainly media outlets and websites. Last updated January 2023.
"The goal of Abstract Wikipedia is to let more people share more knowledge in more languages. Abstract Wikipedia is a conceptual extension of Wikidata.[1] In Abstract Wikipedia, people can create and maintain Wikipedia articles in a language-independent way. A particular language Wikipedia can translate this language-independent article into its language. Code does the translation. Wikifunctions is a new Wikimedia project that allows anyone to create and maintain code. This is useful in many different ways. It provides a catalog of all kinds of functions that anyone can call, write, maintain, and use. It also provides code that translates the language-independent article from Abstract Wikipedia into the language of a Wikipedia. This allows everyone to read the article in their language. Wikifunctions will use knowledge about words and entities from Wikidata. This will get us closer to a world where everyone can share in the sum of all knowledge."
In 2023, The Wellcome Trust awarded funds to build the Open Global Data Citation Corpus to dramatically transform the data citation landscape. Through this award, DataCite has partnered with Chan Zuckerberg Initiative, EMBL-EBI, and other organizations that scrape and assert data citations.
Wikimedia Research https://research.wikimedia.org/
Pages on Wikipedia and Meta about Wikimedia Research
Research Projects: The canonical directory of Wikimedia research projects that are planned, underway or have recently been completed.
Research Index: A list of current research projects and research resources
Zotero Library - the reference library for this project. PDFs aren't hosted but most references should have links. Email me if you can't find the article or report
Other lists and resources of interest
Wiki Research Bibliography - a bibliography of research publications.
Wikipedia in academic studies (en) on Wikipedia
Wikipedia research and tools: Review and comments, review by Finn Årup Nielsen[1]
WikiPapers, a wiki research literature compilation (conference papers, journal articles, theses, datasets and tools)
Works about Wikimedia projects known to Wikidata
Mapping Wikipedia (en) - various maps using geocoded Wikipedia pages by floatingsheep
*New From hell to HTML: releasing a Python package to easily work with Wikimedia HTML dumps - Feb 2023
Announcing mwparserfromhtml, a new library that makes it easy to parse the HTML content of Wikipedia articles
Beyond normal reading of pages, access to Wikimedia content by reusers is currently achieved through:
Scraping of web pages
Wikimedia data dumps: Dumps are produced monthly for a specific set of namespaces and wikis, and then made available for public download.
Wikimedia Enterprise: Enterprise-grade APIs Built for Search, Voice Assistants, AI & more. The Wikimedia Enterprise API is a new service introduced in 2022 focused on use cases of high-volume for-profit reusers of Wikimedia projects, that those entitites can use at scale, and for which they will be charged.
Wikimedia Enterprise HTML Dumps This partial mirror of Wikimedia Enterprise HTML dumps is an experimental service.
API Portal - currently available as a proof of concept.
API gateway is in its alpha release.
Analytics Datasets: Clickstream: a monthly-generated clickstream for wikipedia in English, Russian, German, Spanish, and Japanese
Best Ref - (Liew2020-2021):
Shows popularity and reliability scores for sources in references of Wikipedia Articles in different languages. Data extraction based on complex method using Wikimedia dumps in July 2020. To find the most popular and reliable sources we used information about over 200 million references of Wikipedia articles. More details in the research "Modeling Popularity and Reliability of Sources in Multilingual Wikipedia". Values for PR-score and AR-score were additinaly increased 100 times (to distinguish smaller values in the ranking).
Wikipedia Knowledge Graph dataset, 2022, Arroyo-Machado, Wenceslao; Torres-Salinas, Daniel; Costas, Rodrigo, Zenodo
In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range of researchers, such as informetricians, sociologists or data scientists.
Wikinfometrics: informetric analysis of the English Wikipedia
This R Shiny app provides interactive visualizations of the top Wikipedia articles by indicator to better understand the analytical dimension of these metrics.
Wikitech: Wikitech is the home of technical documentation for Wikimedia Foundation infrastructure and services. This includes production clusters, Wikimedia Cloud Services, Toolforge hosting, and the Beta Cluster.
Data services: includes services that allow for direct access to databases and dumps, as well as web interfaces for querying and programmatic access to data stores. Data services currently include: Wiki Replicas, ToolsDB, Wikilabels Postgres, Wikimedia Dumps, Shared Storage, CirrusSearch Elasticsearch replicas, Quarry and PAWS.
List of Wiki software - software that uses the Wiki format
Wikidata tools - See also the list of Wikidata tagged tools on Toolhub.
Pywikibot is a Python library and collection of scripts that automate work on MediaWiki sites. Originally designed for Wikipedia, it is now used throughout the Wikimedia Foundation's projects and on many other wikis.
Networkx is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks (Networkx, 2022).