Artificial Intelligence

Background

Artificial intelligence (AI) refers to computer systems designed to perform tasks that typically require human intelligence, such as recognizing images, understanding language, making decisions, and generating text or other content. Although the field dates to the 1950s, AI has experienced rapid growth in recent years, driven by advances in machine learning — particularly deep learning — and the availability of large-scale datasets and computing power (LeCun et al., 2015). Today, AI systems are embedded in a wide range of technologies, from medical diagnosis and language translation to recommendation systems and autonomous vehicles.

Research on AI spans many disciplines, including computer science, cognitive science, economics, sociology, law, and ethics. Researchers study not only how AI systems work technically, but also how they are adopted across different industries, how they affect employment, whether they replicate or amplify human biases, and what their long-term societal implications may be. The rapid pace of AI development has made open data particularly important: shared benchmarks, datasets, and research outputs allow independent researchers to evaluate AI systems and hold developers accountable (Jobin et al., 2019; Floridi et al., 2020).

References:

Floridi, L., Cowls, J., King, T. C., & Taddeo, M. (2020). How to design AI for social good: Seven essential factors. Science and Engineering Ethics, 26(3), 1771–1796. https://doi.org/10.1007/s11948-020-00213-5

Jobin, A., Ienca, M., & Vayena, E. (2019). The global landscape of AI ethics guidelines. Nature Machine Intelligence, 1(9), 389–399. https://doi.org/10.1038/s42256-019-0088-2

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539

Data Sources

AI Research and Publication Data

1. Papers With Code https://paperswithcode.com/

Tracks machine learning research papers alongside their code and benchmark results
Includes a large collection of publicly available datasets used in AI research
Organized by task area (image recognition, natural language processing, robotics, etc.)
Benchmark leaderboards show how AI models compare over time
Good for: Understanding progress in AI capabilities; finding datasets used in academic research

2. Semantic Scholar https://www.semanticscholar.org/

Free academic search engine covering AI and computer science literature
Open Research Corpus provides bulk download of tens of millions of papers with metadata
Good for: Bibliometric research on AI publishing trends, citations, and collaboration patterns

3. arXiv (cs.AI and cs.LG sections) https://arxiv.org/list/cs.AI/recent https://arxiv.org/list/cs.LG/recent

Open-access preprint server widely used in AI and machine learning research
Most AI research is posted here before or alongside journal publication
Metadata (titles, abstracts, authors, dates) available for bulk download via API
Good for: Tracking research trends, identifying active research areas, citation analysis

AI Use, Adoption, and Industry Trends

4. Stanford HAI – AI Index Report https://aiindex.stanford.edu/report/

Annual report produced by Stanford University's Human-Centered AI Institute
Covers AI research output, investment, adoption, policy, and ethical incidents worldwide
Accompanying datasets are available for download
Good for: Country-level comparisons of AI activity; trends in AI investment and deployment

5. Our World in Data – Artificial Intelligence https://ourworldindata.org/artificial-intelligence

Data visualizations and downloadable datasets on AI capabilities, research output, and societal impacts
Topics include AI performance benchmarks, compute trends, and adoption rates
Good for: Historical trend analysis; understanding growth in AI capabilities over time

6. OECD.AI Policy Observatory https://oecd.ai/en/data

Data on AI policies, strategies, and national investments across OECD member countries
Tracks government AI strategies and the adoption of AI principles
Good for: Comparing national AI policy approaches and public sector investment

AI and Society: Employment, Bias, and Ethics

7. AI Incident Database https://incidentdatabase.ai/

Searchable database of real-world incidents involving AI systems causing harm
Each incident is documented with sources, descriptions, and affected groups
Covers bias, privacy violations, safety failures, and more
Good for: Research on AI risks, AI governance, and ethical implications of AI deployment

8. Gender Shades / Algorithmic Bias Datasets https://www.media.mit.edu/projects/gender-shades/overview/

Research project examining bias in commercial facial recognition systems
Accompanying paper and findings freely available
Related datasets on algorithmic bias are available through academic repositories
Good for: Research on fairness, discrimination, and demographic gaps in AI performance

9. World Economic Forum – Future of Jobs Report Data https://www.weforum.org/publications/the-future-of-jobs-report-2025/

Data on the expected impact of AI and automation on employment across industries
Survey data from employers in 55+ countries
Reports available for download; some underlying data accessible
Good for: Studying how AI is expected to affect labor markets in different sectors

AI in Japan and Asia

10. Ministry of Economy, Trade and Industry (METI) – AI Research and Policy Datahttps://www.meti.go.jp/english/policy/mono_info_service/joho/index.html

Reports and data from Japan's government on AI strategy and industry adoption
AI-related policy documents and business surveys available in English and Japanese
Good for: Understanding Japan's national AI policy and industry trends

11. National Institute of Informatics (NII) – Research Data https://www.nii.ac.jp/en/

Japanese academic data infrastructure supporting open science
NII provides datasets and tools used in AI and information science research in Japan
Good for: Japan-based students seeking datasets linked to Japanese academic contexts

Benchmark and Model Evaluation Data

12. Hugging Face Datasets https://huggingface.co/datasets

Large repository of open datasets used to train and evaluate AI models
Includes datasets for natural language processing, computer vision, audio, and more
Many datasets are accompanied by evaluation metrics and leaderboards
Good for: Understanding what data AI systems are trained on; evaluating model performance across tasks

13. UCI Machine Learning Repository https://archive.ics.uci.edu/

Long-standing repository of datasets used in machine learning research
Over 600 datasets covering many topic areas (health, economics, environment)
Good for: Finding structured, well-documented datasets suitable for introductory AI analysis

Example Research Questions

To answer some of these questions, you might need to combine AI datasets with other data sources (e.g., country economic data, population figures, employment statistics, or education data).

How has the number of AI research publications changed over time, and which countries or institutions are most active?
Is there a relationship between a country's investment in AI and its output of AI research papers?
How does AI performance on benchmark tasks compare across different demographic groups, and what does this reveal about bias?
How have reported AI incidents changed over time, and what types of harms are most frequently documented?
How do different countries compare in terms of their national AI strategies and stated ethical principles?
What is the relationship between AI adoption in an industry and expected changes in employment in that sector?
How has the computational power required to train large AI models changed over the past decade?

Tips for Using AI Data

Getting Started:

Start with Our World in Data or the Stanford AI Index — both provide clear, pre-processed data with good documentation
The AI Incident Database is a good choice for qualitative-quantitative research on AI ethics and governance
Define your research question first: AI data is diverse, and the right dataset depends heavily on what you want to study
For publication or research trend analysis, arXiv metadata and Semantic Scholar are strong starting points

Understanding the Data:

Benchmark: A standardized test used to compare the performance of AI systems (e.g., accuracy on image recognition tasks)
Parameters: The numerical values inside an AI model; more parameters often (but not always) mean more capability
Training data: The dataset used to build an AI model; its characteristics strongly affect model behavior
Compute (FLOP): A measure of computational work, often used to describe the cost of training AI systems
Incident: A documented case in which an AI system caused or contributed to harm

Data Quality Considerations:

Benchmark performance does not always reflect real-world AI behavior — context matters
AI incident databases rely on public reporting, so under-reporting is a significant issue
Survey-based data (such as employer surveys) reflect stated intentions and perceptions, not observed behavior
Research output data may overrepresent English-language publications and underrepresent work published in other languages

Making Comparisons:

Normalize research output data by country population or GDP to make international comparisons fair
When comparing AI system performance, ensure benchmarks were applied under the same conditions
Be cautious about making causal claims from trend data — correlation between AI investment and economic performance may reflect many factors

Combining Datasets: AI research often benefits from combining multiple sources:

Research publication data + country R&D spending + GDP
AI incident data + industry sector + country regulation
Benchmark performance + training dataset demographics + error rates by group
Employment projections + AI adoption rates + education attainment levels

Useful Additional Data Sources

When studying AI topics, you may also want to use:

Economic data: GDP, R&D investment, employment by sector (OECD, World Bank)
Education data: STEM enrollment, computer science graduation rates (UNESCO, OECD)
Technology adoption data: Internet access, digital infrastructure (ITU, World Bank)
Policy and legal data: AI laws and regulations (EU AI Act documentation, OECD.AI)
Labor market data: Employment by occupation, wage data (ILO, national statistics agencies)

Questions? Need Help?

Start with the Stanford AI Index or Our World in Data for accessible, well-structured AI data
The AI Incident Database is especially useful for ethics and governance research topics
For Japan-specific AI topics, METI and NII are good starting points
Hugging Face Datasets is the best source for finding data used in machine learning model training and evaluation
Remember that AI data changes rapidly — check when your source was last updated
When analyzing research publications, be aware that arXiv is widely used in AI but not all fields; coverage may be uneven