Skip to Main Content

Text and Data Mining

Open Access / Freely Available Resources

The following are some Open Access / Freely available resources that support TDM activities. The list is by no means exhaustive. We will add more as we become aware of them.

Cross Ref

Crossref offers a REST API that enables researchers to harvest metadata and access full-text links from participating publishers. The API supports both open access and subscription content. Open access materials can be retrieved directly, while subscription-based content is delivered through the publisher's access control systems.

More info: Text and Data Mining for Researchers

HathiTrust Research Center

The HathiTrust Research Center (HTRC) offers tools for large-scale analysis of the HathiTrust Digital Library (HTDL), a vast collection of fiction, nonfiction, and scholarly works in various languages, to support non-profit research and education. An HTRC Analytics account is required to use most tools.

More info: HathiTrust Research Center Analytics

New York Times Developer Network

Provides access to ten public APIs: Archive, Article Search, Books, Community, Geographic, Most Popular, Semantic, Times Newswire, TimesTags, and Top Stories.

More info: The New York Times Developer Network

Project Gutenberg

Text and Data Mining (TDM) is permitted on public domain texts from Project Gutenberg. These include novels, poetry, reference works, and more, available in multiple languages and formats.

While there is no official API, bulk downloads can be done via their mirror sites. Project Gutenberg discourages scraping their main website to avoid server overload. For guidance on bulk access and respectful use, see their Robot Access Policy.

More info: Visit their Terms of Use and Permissions, Licensing and other Common Requests pages for more info.

PubMed

PubMed Central (PMC) offers automatic retrieval of articles in machine-readable formats for several large datasets in PMC and NCBI Bookshelf, including PMC Open Access Subset and Author Manuscript Dataset. However, not all articles are available for text mining and other reuse. Users are directly and solely responsible for compliance with copyright restrictions and are expected to adhere to the terms and conditions defined by the copyright holder.

More info: Go to the For Developers page for details on their Cloud Service and APIs, and the Text Mining Tools page for the tools they offer.