Skip to Main Content

Text and Data Mining

Open Access / Freely Available Resources

The following are some Open Access / Freely available resources that support TDM activities. The list is by no means exhaustive. We will add more as we become aware of them.

HathiTrust Research Center

The HathiTrust Research Center (HTRC) offers tools for large-scale analysis of the HathiTrust Digital Library (HTDL), a vast collection of fiction, nonfiction, and scholarly works in various languages, to support non-profit research and education. An HTRC Analytics account is required to use most tools.

More info: HathiTrust Research Center Analytics

New York Times Developer Network

Provides access to ten public APIs: Archive, Article Search, Books, Community, Geographic, Most Popular, Semantic, Times Newswire, TimesTags, and Top Stories.

More info: The New York Times Developer Network

Project Gutenberg

Text and Data Mining (TDM) is permitted on public domain texts from Project Gutenberg. These include novels, poetry, reference works, and more, available in multiple languages and formats.

While there is no official API, bulk downloads can be done via their mirror sites. Project Gutenberg discourages scraping their main website to avoid server overload. For guidance on bulk access and respectful use, see their Robot Access Policy.

More info: Visit their Terms of Use and Permissions, Licensing and other Common Requests pages for more info.