Sources for Text-Mining

Due to license agreements, users are NOT allowed to download excessive amount of content from library subscribed resources regardless of downloading methods.

  • Content includes articles, book chapters, images, among other materials

  • Violation will trigger automatic lockouts and prevent other users from accessing the same database

However, some library subscribed and open access resources DO allow data or text mining but certain terms and conditions apply.

  • Some resources (mostly open access ones) allow you to directly harvest their data

  • Some require you to use the data mining tools they provide

  • Some only allow if they conduct the process for you

The Library also develops increasing number of digital scholarship projects with a view to facilitating public access to the original research data and materials collected or created by HKBU faculty. In most cases, we can share these materials in a way that makes data / text mining possible. Contact us if you are interested in this.

Library Subscribed Resources (Text-mining Allowed)

The following resources allow data-mining with or without asking users to seek approval in advance. Please take note of the terms of use that came from either the license agreement with the Library or their corresponding websites.

Open Access Resources

Book Data

HathiTrust Digital Library FREE

HathiTrust makes the texts of public domain works in its corpus available for research purposes. The works fall into two categories: non-Google-digitized volumes, which are freely available, and Google-digitized volumes, which are available through an agreement with Google.

HathiTrust+Bookworm FREE

A tool for visualizing and analyzing word usage trends in the HathiTrust Digital Library. No login is required.

Google Books FREE

Search full text of books in many languages. Download books in the public domain. The Advanced Search allows you to filter for "full-view". Texts are in American English, British English, French, German, Italian, Spanish, Russian, Hebrew, and Chinese.

Google Books Ngram Viewer FREE

Charts the frequencies of any word or short sentence using yearly count of n-grams found in the sources printed between 1500- present. If you are interested in performing a large scale analysis on the underlying data, download of the corpora is available.

Cultoromics Bookworm Viewer FREE

Developed by Culturomics at Harvard, it is an interface tool for queries in the Google Books corpus. Users can run queries in highly selective corpora based on subject (books on world history, American books on science, etc.) though these corpora are much smaller than those in the full Google Books collection.

Chinese Text Project FREE

With over thirty thousand titles and more than five billion characters, the Chinese Text Project is the largest database of pre-modern Chinese texts in existence. The system also provides API, text tools, and more to facilitate online text mining.

Internet Archive: eBooks and Texts FREE

Offers over 10,000,000 fully accessible books and texts. Includes texts, audio, moving images, and software as well as archived web pages in their collection. Instructions for downloading in bulk.

Project Gutenberg FREE

The first producer of free electronic books and currently provides over 60,000 titles. Here is the Project's Terms of Use.

Online Books Page FREE

Lists over 3 million free books on the internet (includes Project Gutenberg, Hathi Trust, Google Books, publisher and institutional archives, etc). Provides a section on non-English language texts.


Wikidata FREE

It acts as central storage for the structured data of its Wikimedia sister projects including Wikipedia, Wikivoyage, Wiktionary, Wikisource, and others. Statistics shows what types of information can be provided, and Data Access provides instructions to download data.

Regional Based

Australian Data Archive FREE

ADA provides a national service for the collection and preservation of digital research data and to make these data available for secondary analysis by academic researchers and other users. Download of data requires approval.

Digital Public Library of America FREE

DPLA offers a single point of access to millions of items from libraries, archives, and museums around the United States. Data is available for bulk download in JSON files.

Chronicling America: Historical American Newspapers FREE

Collection of digitized historical newspapers from 1789-1924. OCR batch downloads available.

Europeana APIs FREE

Europeana is a digital library with millions of books, paintings, films, museum objects and archival records that have been digitized by more than 2,000 institutions across Europe.

Taiwan History Digital Library FREE

THDL provides tagged full-text of primary historical material of Taiwan, focusing on Qing dynasty. The system also provides sophisticated context discovery tools for users to analyze chronological, geographic, and source information.

Taiwan Biographical Ontology FREE

TBO stores biographical information of 19,372 Taiwanese elites and people closely associated with them. Registered users can visualize the data, mine the data via more complicated online features, or even download the dataset for further analysis.

Subject Based

University of Oxford Text Archive FREE

OTA develops, collects, catalogs and preserves electronic literary and linguistic resources for use in Higher Education, in research, teaching and learning. Materials include Shakespeare's plays, public speeches, books, and more.

Arxiv FREE

Open access to 1,600,000 e-prints in physics, mathematics, computer science, quantitative biology, quantitative finance and statistics. Bulk access available.

BioMed Central FREE

Over 403,000 full-text, peer-reviewed science, technology and medicine articles are available for text and data mining.


Public Library of Science. Provides access to its peer-reviewed articles. Provides a specific Text Mining Collection.

PubMed Central Databases and Text Mining Tools FREE

Multiple text mining tools to analyze not only scholarly publications, but also other types of biomedical resources, such as Electronic Health Records.

Other Resources