Skip to Main Content

Text and Data Mining

Introduction to Text and Data Mining and API use at Boston University.

Open Access TDM

Source About Details/API Access Links
arXiv arXiv is a freely accessible digital archive hosting over two million scholarly articles in various STEM fields including physics, mathematics, and computer science. The arXiv API allows for programmitc access to arXiv e-print content and metadata. Query results are returned in the XML-based Atom 1.0 format. arXiv API Basics
BioMed Central API BMC has an evolving portfolio of some 300 peer-reviewed journals, sharing discoveries from research communities in science, technology, engineering and medicine. A RESTful API for retrieving open access content published by BMC. Resources are represented in JSON and Prism Aggregate (PAM) formats. BMC Indexing, archiving and access to data
CORE CORE is an index of open access research papers from a global network of repositories and journals. The CORE API provides users with free access to metadata and full text content from a network of repositories. Data is delivered in JSON format. CORE API - How it works
HathiTrust HathiTrust provides a bibliographic API as well as searchable datasets for researchers. This API returns bibliographic, rights, and volume information when given a single or multiple standard identifiers (ISBN, LCCN, OCLC, etc.). It is intended for use to retrieve information about small numbers of items at a time. The HathiTrust Bibliographic API is not a search API (e.g., where you use a keyword to search across the collection). API Documentation
JSTOR Data for Research  JSTOR’s Data for Research (DfR) program accommodates text analysis and digital humanities research by providing datasets for the journals, books, research reports, and pamphlets in the digital library. Users can create datasets based on custom queries. This is not a true API, but allows computational analysis and selection of JSTOR's scholarly journal and primary resource collections. Data for Research Request form.
Public Library of Science PLOS is an open science publisher of content across the life science, health, and sustainability disciplines. PLOS provides access to a search API that allows for article metadatato be downloaded via SOLR query. Bulk and non-bulk article downloading is also permitted, though discouraged. Use of the search API is the preferred TDM method. Information about available PLOS APIs
PubMed Central PubMed Central is a free digital repository that archives open access full-text scholarly articles that have been published in biomedical and life sciences journals. PMC provides several APIs that provide programmatic access to various services that deal with PMC literature content, including file validation tools, Open Access web services, and an ID convertor that interconverts PMCID's, PMID's, Manuscript ID's, and DOI's. PubMed API Documentation
The World Bank API The World Bank is an international financial institution which provides free and open access global development data. The World Bank offers an API that allows for the search and retrieval of the public, Bank documents available in the Documents & Reports site.  Records can be retrieved in a format useful for research and for inclusion in web sites outside of Documents & Reports and the World Bank.  The World Bank Documents & Report API
Source About Details/API Access Links
Caselaw Access Project The Caselaw Access Project offers free, public access to over 6.5 million decisions published by state and federal courts throughout U.S. history. CourtListener provides a robust API for accessing CAP data, as well as a user-friendly interface for searching and browsing the data. Also available via bulk data download. CAP Documentation
The World Bank API The World Bank is an international financial institution which provides free and open access global development data. The World Bank offers an API that allows for the search and retrieval of the public, Bank documents available in the Documents & Reports site.  Records can be retrieved in a format useful for research and for inclusion in web sites outside of Documents & Reports and the World Bank.  The World Bank Documents & Report API
Source About Details/API Access Links
Congress.gov The Congress.gov API allows for public download of congressional data including bills, amendments, hearings, committee reports, and communications from the House and Senate. Users must register an API key for access. Responses are returned in XML or JSON formats. Congress.gov API
Library of Congress The LoC collection consists of 164 million items, most of which can browsed in their digital catalog. LoC has made datasets available for bulk download, and can also be queried with the Library of Congress API. This allows users to download collection content files and structured data (JSON/YAML) about collections. Accessing Digital Materials
MBTA Open Data Portal Detailed, historical data including ridership and performance data, schedules, and other system information. Datasets are downloadable and searchable. The V3 API uses the JSON API format, so you can get started quickly using any of the available libraries. MBTA Data Sources
NYC Open Data Open Data is free public data published by New York City agencies and other partners.  Data is available to browse and download. For API users, Open Data APIs provide rich query functionality through the “Socrata Query Language” (SoQL), which borrows heavily from Structured Query Language (SQL). Getting Started with Open Data
Source About Details/API Access Links
Adam Matthews Digital Adam Matthew is a digital publisher of unique primary source collections from archives around the world. Data Mining/Text Analysis performed by "Authorised Users" for fair use/academic research. Requires permission granted by submitting a request form. AM Text and Datamining
Art Institute of Chicago API The Art Institute of Chicago is one of the oldest and largest art museums in the United States, with a collection of nearly 300,000 works of art. The Art Institute of Chicago's API provides JSON-formatted data as a REST-style service that allows developers to explore and integrate the museum’s public data into their projects. Art Institute of Chicago API
Digital Public Library of America The DPLA archives content from libraries, archives, museums, and cultural heritage institutions from across the country. DPLA's API utilizes a RESTful approach to deliver search results. Users can query metadata at both the collection and item level. DPLA - API Basics
The Metropolitan Museum of Art API The largest museum in North America with over 1.5 million artworks. Since 2017 the Met collection has been Open Access. The Met’s Open Access datasets are available through our API. The API (RESTful web service in JSON format) gives access to all of The Met’s Open Access data and to corresponding high resolution images (JPEG format) that are in the public domain. Use the Met's API
The New York Times APIs NYT provides access to bibliometric data through several open-access APIs.

APIs use a RESTful style and a resource-oriented architecture. Calls are made via HTTPS requests. Your request URIs should be patterned after the examples in the API documentation, and you should always include your API key in a query string. See the documentation for each API for more details on request parameters and URI structure. 

NYT currently has ten public APIs: Archive, Article Search, Books, Most Popular, Semantic, Times Newswire, TimesTags, and Top Stories.

NYT Developer

Reference Assistant