Unpaywall

“Unpaywall is a project of Our Research, a nonprofit building tools to help make scholarly research more open, accessible, and reusable.” … they “harvest Open Access content from over 50,000 publishers and repositories, and make it easy to find, track, and use.” – Unpaywall website.

Unpaywall is an “open database of free scholarly articles.” It includes “data from open indexes like Crossref and DOAJ where it exists.” Data comes from “monitoring over 50,000 unique online content hosting locations, including Gold OA journals, Hybrid journals, institutional repositories, and disciplinary repositories.” “Unpaywall assigns an OA Status to every article.” “There are five possible values: closed, green, gold, hybrid, and bronze.” ” - source: Unpaywall and data details

This telescope uses the Unpaywall Data Feed service. If you wish to ingest Unpaywall using the free non subscription snapshots, use the Unpaywall Snapshot telescope instead.

The free Unpaywall snapshot service is only updated a few times a year. It is also difficult to find changes from snapshot to snapshot. The Data Feed service rectifies this by providing daily or weekly changefiles. To use the Data Feed:

  1. Get the current snapshot using the API service which is updated daily, and available with the Data Feed subscription.

  2. Get all changefiles starting with latest timestamp just before the snapshot date.

  3. Apply the changefiles to the snapshot in date order from oldest to newest.

The Data Feed service requires an API key in order to access the changefiles. See the data feed page for more information on obtaining a key.

On first run, this telescope tries to pull the snapshot on the telescope’s start date. Note that the snapshot hosted at https://api.unpaywall.org/feed/snapshot?api_key=YOUR_API_KEY is currently updated daily by Unpaywall, so make sure you set the telescope’s start date to be equal to that snapshot date.

Subsequent scheduled runs will download the daily changefile from TWO DAYS PRIOR to the scheduled execution date to update the dataset. You should set the scheduled_interval to “@daily” or some other equivalent interval which that results in scheduled runs on a daily basis. The telescope attempts to catch up on missed scheduled runs from the start date to the current execution date in case of interruption.

The reason that the changefile applied is from two days prior to each scheduled executin date is so that we can guarantee data integrity after applying the snapshot.

Unpaywall recommends applying changefiles starting from a timestamp just before the snapshot timestamp. This telescope uses daily updates. This is to minimise the amount of overlapping data downloaded. To simplify the update process, we opt to apply daily updates to snapshots starting from one day before the snapshot timestamp. For example, if the first execution date was (2021,7,2), the snapshot frm (2021,7,2) is downloaded. The next execution date on (2021,7,3) downloads the daily changefile from (2021,7,1). The next execution date downloads the daily changefile from (2021,7,2). The next one after that downloads the daily changefile from (2021,7,3), and so on.

The telescope maintains a single updated BigQuery table, that’s updated to 2 days before the latest scheduled execution date.

Airflow Connection

The telescope requires an Airflow connection named unpaywall with the password set to the API key from Unpaywall for accessing the Data Feed service. For example, the corresponding observatory config.yaml entry could be:

unpaywall: http://:API_KEY@localhost

The connection must be a valid URI supported by Airflow, but only the password component (the API key) is used by this telescope.

Summary

Average runtime

15 min

Average download size

100 MB

Harvest Type

URL

Harvest Frequency

Daily

Runs on remote worker

False

Catchup missed runs

True

Table Write Disposition

Append

Update Frequency

Daily

Credentials Required

Yes

Uses Telescope Template

Stream

Latest schema

name

type

mode

description

best_oa_location

RECORD

NULLABLE

The best OA Location Object we could find for this DOI. The “best” location is determined using an algorithm that prioritizes publisher-hosted content first (eg Hybrid or Gold), then prioritizes versions closer to the version of record (PublishedVersion over AcceptedVersion), then more authoritative repositories (PubMed Central over CiteSeerX). Returns null if we couldn’t find any OA Locations.

best_oa_location.evidence

STRING

NULLABLE

How we found this OA location. Used for debugging. Don’t depend on the exact contents of this for anything, because values are subject to change without warning.

best_oa_location.host_type

STRING

NULLABLE

The type of host that serves this OA location. There are two possible values: ‘publisher’ means this location is served by the article’s publisher (in practice, this usually means it is hosted on the same domain the DOI resolves to). ‘repository’ means this location is served by an Open Access repository. Preprint servers are considered repositories even if the DOI resolves there.

best_oa_location.is_best

BOOLEAN

NULLABLE

Is this location the best_oa_location for its resource. See the DOI object’s best_oa_location description for more on how we select which location is “best.”

best_oa_location.license

STRING

NULLABLE

The license under which this copy is published. We return several types of licenses: Creative Commons licenses are uniformly abbreviated and lowercased. Example: ‘cc-by-nc’. Publisher-specific licenses are normalized using this format: ‘acs-specific: authorchoice/editors choice usage agreement’. When we have evidence that an OA license of some kind was used, but it’s not reported directly on the webpage at this location, this field returns ‘implied-oa’

best_oa_location.oa_date

DATE

NULLABLE

When this document first became available at this location. oa_date is calculated differently for different host types and is not available for all oa_locations. See https://support.unpaywall.org/a/solutions/articles/44002063719 for details.

best_oa_location.pmh_id

STRING

NULLABLE

OAI-PMH endpoint where we found this location. This is primarily for internal debugging. It’s null for locations that weren’t found using OAI-PMH.

best_oa_location.updated

TIMESTAMP

NULLABLE

Time when the data for this location was last updated. Returned as an ISO8601-formatted timestamp. Example: 2017-08-17T23:43:27.753663

best_oa_location.url

STRING

NULLABLE

The url_for_pdf if there is one; otherwise landing page URL. When we can’t find a url_for_pdf (or there isn’t one), this field uses the url_for_landing_page, which is a useful fallback for some use cases.

best_oa_location.url_for_landing_page

STRING

NULLABLE

The URL for a landing page describing this OA copy. When the host_type is “publisher” the landing page usually includes HTML fulltext.

best_oa_location.url_for_pdf

STRING

NULLABLE

The URL with a PDF version of this OA copy.

best_oa_location.version

STRING

NULLABLE

The content version accessible at this location. We use the DRIVER Guidelines v2.0 VERSION standard (https://wiki.surfnet.nl/display/DRIVERguidelines/DRIVER-VERSION+Mappings) to define versions of a given article; see those docs for complete definitions of terms.

best_oa_location.repository_institution

STRING

NULLABLE

best_oa_location.endpoint_id

STRING

NULLABLE

best_oa_location.id

STRING

NULLABLE

data_standard

INTEGER

NULLABLE

Indicates the data collection approaches used for this resource. Possible values: ‘1’ First-generation hybrid detection. Uses only data from the Crossref API to determine hybrid status. Does a good job for Elsevier articles and a few other publishers, but most publishers are not checked for hybrid. ‘2’ Second-generation hybrid detection. Uses additional sources, checks all publishers for hybrid. Gets about 10x as much hybrid. data_standard==2 is the version used in the paper we wrote about the dataset.

doi

STRING

REQUIRED

The DOI of this resource. This is always lowercase.

doi_url

STRING

NULLABLE

The DOI in hyperlink form. This field simply contains “https://doi.org/” prepended to the doi field. It expresses the DOI in its correct format according to the Crossref DOI display guidelines.

genre

STRING

NULLABLE

The type of resource. Currently the genre is identical to the Crossref-reported type of a given resource. The “journal-article” type is most common, but there are many others.

is_paratext

BOOLEAN

NULLABLE

Is the item an ancillary part of a journal, like a table of contents? See here for more information on how we determine whether an article is paratext: https://support.unpaywall.org/support/solutions/articles/44001894783.

is_oa

BOOLEAN

NULLABLE

Is there an OA copy of this resource. Convenience attribute; returns true when best_oa_location is not null.

journal_is_in_doaj

BOOLEAN

NULLABLE

Is this resource published in a DOAJ-indexed journal. Useful for defining whether a resource is Gold OA (depending on your definition, see also journal_is_oa).

journal_is_oa

BOOLEAN

NULLABLE

Is this resource published in a completely OA journal. Useful for defining whether a resource is Gold OA. Includes any fully-OA journal, regardless of inclusion in DOAJ. This includes journals by all-OA publishers and journals that would otherwise be all Hybrid or Bronze OA.

journal_issns

STRING

NULLABLE

Any ISSNs assigned to the journal publishing this resource. Separate ISSNs are sometimes assigned to print and electronic versions of the same journal. If there are multiple ISSNs, they are separated by commas. Example: 1232-1203,1532-6203

journal_issn_l

STRING

NULLABLE

A single ISSN for the journal publishing this resource. An ISSN-L can be used as a primary key for a journal when more than one ISSN is assigned to it. Resources’ journal_issns are mapped to ISSN-Ls using the issn.org table, with some manual corrections.

journal_name

STRING

NULLABLE

The name of the journal publishing this resource. The same journal may have multiple name strings (eg, “J. Foo”, “Journal of Foo”, “JOURNAL OF FOO”, etc). These have not been fully normalized within our database, so use with care.

oa_locations

RECORD

REPEATED

List of all the OA Location objects associated with this resource. This list is unnecessary for the vast majority of use-cases, since you probably just want the best_oa_location. It’s included primarily for research purposes.

oa_locations.evidence

STRING

NULLABLE

How we found this OA location. Used for debugging. Don’t depend on the exact contents of this for anything, because values are subject to change without warning.

oa_locations.host_type

STRING

NULLABLE

The type of host that serves this OA location. There are two possible values: ‘publisher’ means this location is served by the article’s publisher (in practice, this usually means it is hosted on the same domain the DOI resolves to). ‘repository’ means this location is served by an Open Access repository. Preprint servers are considered repositories even if the DOI resolves there.

oa_locations.is_best

BOOLEAN

NULLABLE

Is this location the best_oa_location for its resource. See the DOI object’s best_oa_location description for more on how we select which location is “best.”

oa_locations.license

STRING

NULLABLE

The license under which this copy is published. We return several types of licenses: Creative Commons licenses are uniformly abbreviated and lowercased. Example: ‘cc-by-nc’. Publisher-specific licenses are normalized using this format: ‘acs-specific: authorchoice/editors choice usage agreement’. When we have evidence that an OA license of some kind was used, but it’s not reported directly on the webpage at this location, this field returns ‘implied-oa’

oa_locations.oa_date

DATE

NULLABLE

When this document first became available at this location. oa_date is calculated differently for different host types and is not available for all oa_locations. See https://support.unpaywall.org/a/solutions/articles/44002063719 for details.

oa_locations.pmh_id

STRING

NULLABLE

OAI-PMH endpoint where we found this location. This is primarily for internal debugging. It’s null for locations that weren’t found using OAI-PMH.

oa_locations.updated

TIMESTAMP

NULLABLE

Time when the data for this location was last updated. Returned as an ISO8601-formatted timestamp. Example: 2017-08-17T23:43:27.753663

oa_locations.url

STRING

NULLABLE

The url_for_pdf if there is one; otherwise landing page URL. When we can’t find a url_for_pdf (or there isn’t one), this field uses the url_for_landing_page, which is a useful fallback for some use cases.

oa_locations.url_for_landing_page

STRING

NULLABLE

The URL for a landing page describing this OA copy. When the host_type is “publisher” the landing page usually includes HTML fulltext.

oa_locations.url_for_pdf

STRING

NULLABLE

The URL with a PDF version of this OA copy.

oa_locations.version

STRING

NULLABLE

The content version accessible at this location. We use the DRIVER Guidelines v2.0 VERSION standard (https://wiki.surfnet.nl/display/DRIVERguidelines/DRIVER-VERSION+Mappings) to define versions of a given article; see those docs for complete definitions of terms.

oa_locations.repository_institution

STRING

NULLABLE

oa_locations.endpoint_id

STRING

NULLABLE

oa_locations.id

STRING

NULLABLE

first_oa_location

RECORD

NULLABLE

The OA Location Object with the earliest oa_date. Returns null if we couldn’t find any OA Locations.

first_oa_location.evidence

STRING

NULLABLE

How we found this OA location. Used for debugging. Don’t depend on the exact contents of this for anything, because values are subject to change without warning.

first_oa_location.host_type

STRING

NULLABLE

The type of host that serves this OA location. There are two possible values: ‘publisher’ means this location is served by the article’s publisher (in practice, this usually means it is hosted on the same domain the DOI resolves to). ‘repository’ means this location is served by an Open Access repository. Preprint servers are considered repositories even if the DOI resolves there.

first_oa_location.is_best

BOOLEAN

NULLABLE

Is this location the best_oa_location for its resource. See the DOI object’s best_oa_location description for more on how we select which location is “best.”

first_oa_location.license

STRING

NULLABLE

The license under which this copy is published. We return several types of licenses: Creative Commons licenses are uniformly abbreviated and lowercased. Example: ‘cc-by-nc’. Publisher-specific licenses are normalized using this format: ‘acs-specific: authorchoice/editors choice usage agreement’. When we have evidence that an OA license of some kind was used, but it’s not reported directly on the webpage at this location, this field returns ‘implied-oa’

first_oa_location.oa_date

DATE

NULLABLE

When this document first became available at this location. oa_date is calculated differently for different host types and is not available for all oa_locations. See https://support.unpaywall.org/a/solutions/articles/44002063719 for details.

first_oa_location.pmh_id

STRING

NULLABLE

OAI-PMH endpoint where we found this location. This is primarily for internal debugging. It’s null for locations that weren’t found using OAI-PMH.

first_oa_location.updated

TIMESTAMP

NULLABLE

Time when the data for this location was last updated. Returned as an ISO8601-formatted timestamp. Example: 2017-08-17T23:43:27.753663

first_oa_location.url

STRING

NULLABLE

The url_for_pdf if there is one; otherwise landing page URL. When we can’t find a url_for_pdf (or there isn’t one), this field uses the url_for_landing_page, which is a useful fallback for some use cases.

first_oa_location.url_for_landing_page

STRING

NULLABLE

The URL for a landing page describing this OA copy. When the host_type is “publisher” the landing page usually includes HTML fulltext.

first_oa_location.url_for_pdf

STRING

NULLABLE

The URL with a PDF version of this OA copy.

first_oa_location.version

STRING

NULLABLE

The content version accessible at this location. We use the DRIVER Guidelines v2.0 VERSION standard (https://wiki.surfnet.nl/display/DRIVERguidelines/DRIVER-VERSION+Mappings) to define versions of a given article; see those docs for complete definitions of terms.

first_oa_location.repository_institution

STRING

NULLABLE

first_oa_location.endpoint_id

STRING

NULLABLE

first_oa_location.id

STRING

NULLABLE

oa_status

STRING

NULLABLE

The OA status, or color, of this resource. Classifies OA resources by location and license terms as one of: gold, hybrid, bronze, green or closed. See here for more information on how we assign an oa_status: https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean-

published_date

DATE

NULLABLE

The date this resource was published. As reported by the publishers, who unfortunately have inconsistent definitions of what counts as officially “published.” Returned as an ISO8601-formatted timestamp, generally with only year-month-day.

publisher

STRING

NULLABLE

The name of this resource’s publisher. Keep in mind that publisher name strings change over time, particularly as publishers are acquired or split up.

title

STRING

NULLABLE

The title of this resource.

updated

TIMESTAMP

NULLABLE

Time when the data for this resource was last updated. Returned as an ISO8601-formatted timestamp. Example: 2017-08-17T23:43:27.753663

year

INTEGER

NULLABLE

The year this resource was published. Just the year part of the published_date

z_authors

RECORD

REPEATED

The authors of this resource. These are formatted as a list of Crossref Contributor objects, which are described in the Crossref API docs here: https://github.com/CrossRef/rest-api-doc/blob/master/api_format.md#contributor

z_authors.family

STRING

NULLABLE

z_authors.given

STRING

NULLABLE

z_authors.ORCID

STRING

NULLABLE

URL-form of an ORCID identifier

z_authors.authenticated_orcid

BOOLEAN

NULLABLE

If true, record owner asserts that the ORCID user completed ORCID OAuth authentication

z_authors.affiliation

RECORD

REPEATED

z_authors.affiliation.name

STRING

NULLABLE

z_authors.affiliation.id

RECORD

REPEATED

z_authors.affiliation.id.id

STRING

NULLABLE

z_authors.affiliation.id.id-type

STRING

NULLABLE

z_authors.affiliation.id.asserted-by

STRING

NULLABLE

z_authors.affiliation.place

STRING

REPEATED

z_authors.affiliation.department

STRING

REPEATED

z_authors.affiliation.acronym

STRING

REPEATED

z_authors.sequence

STRING

NULLABLE

z_authors.suffix

STRING

NULLABLE

z_authors.name

STRING

NULLABLE

z_authors.raw

STRING

NULLABLE

has_repository_copy

BOOLEAN

NULLABLE

Is a full-text available in a repository?

issn_l

STRING

NULLABLE

x_reported_noncompliant_copies

RECORD

REPEATED

x_reported_noncompliant_copies.blank

STRING

NULLABLE

x_error

BOOLEAN

NULLABLE

oa_locations_embargoed

RECORD

REPEATED

List of OA Location objects associated with this resource that are not yet available. This list includes locations that we expect to be available in the future based on information like license metadata and journals’ delayed OA policies. They do not affect the resource’s oa_status and cannot be the best_oa_location or first_oa_location.

oa_locations_embargoed.evidence

STRING

NULLABLE

How we found this OA location. Used for debugging. Don’t depend on the exact contents of this for anything, because values are subject to change without warning.

oa_locations_embargoed.host_type

STRING

NULLABLE

The type of host that serves this OA location. There are two possible values: ‘publisher’ means this location is served by the article’s publisher (in practice, this usually means it is hosted on the same domain the DOI resolves to). ‘repository’ means this location is served by an Open Access repository. Preprint servers are considered repositories even if the DOI resolves there.

oa_locations_embargoed.is_best

BOOLEAN

NULLABLE

Is this location the best_oa_location for its resource. See the DOI object’s best_oa_location description for more on how we select which location is “best.”

oa_locations_embargoed.license

STRING

NULLABLE

The license under which this copy is published. We return several types of licenses: Creative Commons licenses are uniformly abbreviated and lowercased. Example: ‘cc-by-nc’. Publisher-specific licenses are normalized using this format: ‘acs-specific: authorchoice/editors choice usage agreement’. When we have evidence that an OA license of some kind was used, but it’s not reported directly on the webpage at this location, this field returns ‘implied-oa’

oa_locations_embargoed.oa_date

DATE

NULLABLE

When this document first became available at this location. oa_date is calculated differently for different host types and is not available for all oa_locations. See https://support.unpaywall.org/a/solutions/articles/44002063719 for details.

oa_locations_embargoed.pmh_id

STRING

NULLABLE

OAI-PMH endpoint where we found this location. This is primarily for internal debugging. It’s null for locations that weren’t found using OAI-PMH.

oa_locations_embargoed.updated

TIMESTAMP

NULLABLE

Time when the data for this location was last updated. Returned as an ISO8601-formatted timestamp. Example: 2017-08-17T23:43:27.753663

oa_locations_embargoed.url

STRING

NULLABLE

The url_for_pdf if there is one; otherwise landing page URL. When we can’t find a url_for_pdf (or there isn’t one), this field uses the url_for_landing_page, which is a useful fallback for some use cases.

oa_locations_embargoed.url_for_landing_page

STRING

NULLABLE

The URL for a landing page describing this OA copy. When the host_type is “publisher” the landing page usually includes HTML fulltext.

oa_locations_embargoed.url_for_pdf

STRING

NULLABLE

The URL with a PDF version of this OA copy.

oa_locations_embargoed.version

STRING

NULLABLE

The content version accessible at this location. We use the DRIVER Guidelines v2.0 VERSION standard (https://wiki.surfnet.nl/display/DRIVERguidelines/DRIVER-VERSION+Mappings) to define versions of a given article; see those docs for complete definitions of terms.

oa_locations_embargoed.repository_institution

STRING

NULLABLE

oa_locations_embargoed.endpoint_id

STRING

NULLABLE