Unpaywall
“Unpaywall is a project of Our Research, a nonprofit building tools to help make scholarly research more open, accessible, and reusable.” … they “harvest Open Access content from over 50,000 publishers and repositories, and make it easy to find, track, and use.” – Unpaywall website.
Unpaywall is an “open database of free scholarly articles.” It includes “data from open indexes like Crossref and DOAJ where it exists.” Data comes from “monitoring over 50,000 unique online content hosting locations, including Gold OA journals, Hybrid journals, institutional repositories, and disciplinary repositories.” “Unpaywall assigns an OA Status to every article.” “There are five possible values: closed, green, gold, hybrid, and bronze.” ” - source: Unpaywall and data details
This telescope uses the Unpaywall Data Feed service. If you wish to ingest Unpaywall using the free non subscription snapshots, use the Unpaywall Snapshot telescope instead.
The free Unpaywall snapshot service is only updated a few times a year. It is also difficult to find changes from snapshot to snapshot. The Data Feed service rectifies this by providing daily or weekly changefiles. To use the Data Feed:
Get the current snapshot using the API service which is updated daily, and available with the Data Feed subscription.
Get all changefiles starting with latest timestamp just before the snapshot date.
Apply the changefiles to the snapshot in date order from oldest to newest.
The Data Feed service requires an API key in order to access the changefiles. See the data feed page for more information on obtaining a key.
On first run, this telescope tries to pull the snapshot on the telescope’s start date. Note that the snapshot hosted at https://api.unpaywall.org/feed/snapshot?api_key=YOUR_API_KEY
is currently updated daily by Unpaywall, so make sure you set the telescope’s start date to be equal to that snapshot date.
Subsequent scheduled runs will download the daily changefile from TWO DAYS PRIOR to the scheduled execution date to update the dataset. You should set the scheduled_interval to “@daily” or some other equivalent interval which that results in scheduled runs on a daily basis. The telescope attempts to catch up on missed scheduled runs from the start date to the current execution date in case of interruption.
The reason that the changefile applied is from two days prior to each scheduled executin date is so that we can guarantee data integrity after applying the snapshot.
Unpaywall recommends applying changefiles starting from a timestamp just before the snapshot timestamp. This telescope uses daily updates. This is to minimise the amount of overlapping data downloaded. To simplify the update process, we opt to apply daily updates to snapshots starting from one day before the snapshot timestamp. For example, if the first execution date was (2021,7,2), the snapshot frm (2021,7,2) is downloaded. The next execution date on (2021,7,3) downloads the daily changefile from (2021,7,1). The next execution date downloads the daily changefile from (2021,7,2). The next one after that downloads the daily changefile from (2021,7,3), and so on.
The telescope maintains a single updated BigQuery table, that’s updated to 2 days before the latest scheduled execution date.
Airflow Connection
The telescope requires an Airflow connection named unpaywall
with the password set to the API key from Unpaywall for accessing the Data Feed service. For example, the corresponding observatory config.yaml
entry could be:
unpaywall: http://:API_KEY@localhost
The connection must be a valid URI supported by Airflow, but only the password component (the API key) is used by this telescope.
Summary |
|
---|---|
Average runtime |
15 min |
Average download size |
100 MB |
Harvest Type |
URL |
Harvest Frequency |
Daily |
Runs on remote worker |
False |
Catchup missed runs |
True |
Table Write Disposition |
Append |
Update Frequency |
Daily |
Credentials Required |
Yes |
Uses Telescope Template |
Stream |
Latest schema
name |
type |
mode |
description |
---|---|---|---|
best_oa_location |
RECORD |
NULLABLE |
The best OA Location Object we could find for this DOI. The “best” location is determined using an algorithm that prioritizes publisher-hosted content first (eg Hybrid or Gold), then prioritizes versions closer to the version of record (PublishedVersion over AcceptedVersion), then more authoritative repositories (PubMed Central over CiteSeerX). Returns null if we couldn’t find any OA Locations. |
best_oa_location.evidence |
STRING |
NULLABLE |
How we found this OA location. Used for debugging. Don’t depend on the exact contents of this for anything, because values are subject to change without warning. |
best_oa_location.host_type |
STRING |
NULLABLE |
The type of host that serves this OA location. There are two possible values: ‘publisher’ means this location is served by the article’s publisher (in practice, this usually means it is hosted on the same domain the DOI resolves to). ‘repository’ means this location is served by an Open Access repository. Preprint servers are considered repositories even if the DOI resolves there. |
best_oa_location.is_best |
BOOLEAN |
NULLABLE |
Is this location the best_oa_location for its resource. See the DOI object’s best_oa_location description for more on how we select which location is “best.” |
best_oa_location.license |
STRING |
NULLABLE |
The license under which this copy is published. We return several types of licenses: Creative Commons licenses are uniformly abbreviated and lowercased. Example: ‘cc-by-nc’. Publisher-specific licenses are normalized using this format: ‘acs-specific: authorchoice/editors choice usage agreement’. When we have evidence that an OA license of some kind was used, but it’s not reported directly on the webpage at this location, this field returns ‘implied-oa’ |
best_oa_location.oa_date |
DATE |
NULLABLE |
When this document first became available at this location. oa_date is calculated differently for different host types and is not available for all oa_locations. See https://support.unpaywall.org/a/solutions/articles/44002063719 for details. |
best_oa_location.pmh_id |
STRING |
NULLABLE |
OAI-PMH endpoint where we found this location. This is primarily for internal debugging. It’s null for locations that weren’t found using OAI-PMH. |
best_oa_location.updated |
TIMESTAMP |
NULLABLE |
Time when the data for this location was last updated. Returned as an ISO8601-formatted timestamp. Example: 2017-08-17T23:43:27.753663 |
best_oa_location.url |
STRING |
NULLABLE |
The url_for_pdf if there is one; otherwise landing page URL. When we can’t find a url_for_pdf (or there isn’t one), this field uses the url_for_landing_page, which is a useful fallback for some use cases. |
best_oa_location.url_for_landing_page |
STRING |
NULLABLE |
The URL for a landing page describing this OA copy. When the host_type is “publisher” the landing page usually includes HTML fulltext. |
best_oa_location.url_for_pdf |
STRING |
NULLABLE |
The URL with a PDF version of this OA copy. |
best_oa_location.version |
STRING |
NULLABLE |
The content version accessible at this location. We use the DRIVER Guidelines v2.0 VERSION standard (https://wiki.surfnet.nl/display/DRIVERguidelines/DRIVER-VERSION+Mappings) to define versions of a given article; see those docs for complete definitions of terms. |
best_oa_location.repository_institution |
STRING |
NULLABLE |
|
best_oa_location.endpoint_id |
STRING |
NULLABLE |
|
best_oa_location.id |
STRING |
NULLABLE |
|
data_standard |
INTEGER |
NULLABLE |
Indicates the data collection approaches used for this resource. Possible values: ‘1’ First-generation hybrid detection. Uses only data from the Crossref API to determine hybrid status. Does a good job for Elsevier articles and a few other publishers, but most publishers are not checked for hybrid. ‘2’ Second-generation hybrid detection. Uses additional sources, checks all publishers for hybrid. Gets about 10x as much hybrid. data_standard==2 is the version used in the paper we wrote about the dataset. |
doi |
STRING |
REQUIRED |
The DOI of this resource. This is always lowercase. |
doi_url |
STRING |
NULLABLE |
The DOI in hyperlink form. This field simply contains “https://doi.org/” prepended to the doi field. It expresses the DOI in its correct format according to the Crossref DOI display guidelines. |
genre |
STRING |
NULLABLE |
The type of resource. Currently the genre is identical to the Crossref-reported type of a given resource. The “journal-article” type is most common, but there are many others. |
is_paratext |
BOOLEAN |
NULLABLE |
Is the item an ancillary part of a journal, like a table of contents? See here for more information on how we determine whether an article is paratext: https://support.unpaywall.org/support/solutions/articles/44001894783. |
is_oa |
BOOLEAN |
NULLABLE |
Is there an OA copy of this resource. Convenience attribute; returns true when best_oa_location is not null. |
journal_is_in_doaj |
BOOLEAN |
NULLABLE |
Is this resource published in a DOAJ-indexed journal. Useful for defining whether a resource is Gold OA (depending on your definition, see also journal_is_oa). |
journal_is_oa |
BOOLEAN |
NULLABLE |
Is this resource published in a completely OA journal. Useful for defining whether a resource is Gold OA. Includes any fully-OA journal, regardless of inclusion in DOAJ. This includes journals by all-OA publishers and journals that would otherwise be all Hybrid or Bronze OA. |
journal_issns |
STRING |
NULLABLE |
Any ISSNs assigned to the journal publishing this resource. Separate ISSNs are sometimes assigned to print and electronic versions of the same journal. If there are multiple ISSNs, they are separated by commas. Example: 1232-1203,1532-6203 |
journal_issn_l |
STRING |
NULLABLE |
A single ISSN for the journal publishing this resource. An ISSN-L can be used as a primary key for a journal when more than one ISSN is assigned to it. Resources’ journal_issns are mapped to ISSN-Ls using the issn.org table, with some manual corrections. |
journal_name |
STRING |
NULLABLE |
The name of the journal publishing this resource. The same journal may have multiple name strings (eg, “J. Foo”, “Journal of Foo”, “JOURNAL OF FOO”, etc). These have not been fully normalized within our database, so use with care. |
oa_locations |
RECORD |
REPEATED |
List of all the OA Location objects associated with this resource. This list is unnecessary for the vast majority of use-cases, since you probably just want the best_oa_location. It’s included primarily for research purposes. |
oa_locations.evidence |
STRING |
NULLABLE |
How we found this OA location. Used for debugging. Don’t depend on the exact contents of this for anything, because values are subject to change without warning. |
oa_locations.host_type |
STRING |
NULLABLE |
The type of host that serves this OA location. There are two possible values: ‘publisher’ means this location is served by the article’s publisher (in practice, this usually means it is hosted on the same domain the DOI resolves to). ‘repository’ means this location is served by an Open Access repository. Preprint servers are considered repositories even if the DOI resolves there. |
oa_locations.is_best |
BOOLEAN |
NULLABLE |
Is this location the best_oa_location for its resource. See the DOI object’s best_oa_location description for more on how we select which location is “best.” |
oa_locations.license |
STRING |
NULLABLE |
The license under which this copy is published. We return several types of licenses: Creative Commons licenses are uniformly abbreviated and lowercased. Example: ‘cc-by-nc’. Publisher-specific licenses are normalized using this format: ‘acs-specific: authorchoice/editors choice usage agreement’. When we have evidence that an OA license of some kind was used, but it’s not reported directly on the webpage at this location, this field returns ‘implied-oa’ |
oa_locations.oa_date |
DATE |
NULLABLE |
When this document first became available at this location. oa_date is calculated differently for different host types and is not available for all oa_locations. See https://support.unpaywall.org/a/solutions/articles/44002063719 for details. |
oa_locations.pmh_id |
STRING |
NULLABLE |
OAI-PMH endpoint where we found this location. This is primarily for internal debugging. It’s null for locations that weren’t found using OAI-PMH. |
oa_locations.updated |
TIMESTAMP |
NULLABLE |
Time when the data for this location was last updated. Returned as an ISO8601-formatted timestamp. Example: 2017-08-17T23:43:27.753663 |
oa_locations.url |
STRING |
NULLABLE |
The url_for_pdf if there is one; otherwise landing page URL. When we can’t find a url_for_pdf (or there isn’t one), this field uses the url_for_landing_page, which is a useful fallback for some use cases. |
oa_locations.url_for_landing_page |
STRING |
NULLABLE |
The URL for a landing page describing this OA copy. When the host_type is “publisher” the landing page usually includes HTML fulltext. |
oa_locations.url_for_pdf |
STRING |
NULLABLE |
The URL with a PDF version of this OA copy. |
oa_locations.version |
STRING |
NULLABLE |
The content version accessible at this location. We use the DRIVER Guidelines v2.0 VERSION standard (https://wiki.surfnet.nl/display/DRIVERguidelines/DRIVER-VERSION+Mappings) to define versions of a given article; see those docs for complete definitions of terms. |
oa_locations.repository_institution |
STRING |
NULLABLE |
|
oa_locations.endpoint_id |
STRING |
NULLABLE |
|
oa_locations.id |
STRING |
NULLABLE |
|
first_oa_location |
RECORD |
NULLABLE |
The OA Location Object with the earliest oa_date. Returns null if we couldn’t find any OA Locations. |
first_oa_location.evidence |
STRING |
NULLABLE |
How we found this OA location. Used for debugging. Don’t depend on the exact contents of this for anything, because values are subject to change without warning. |
first_oa_location.host_type |
STRING |
NULLABLE |
The type of host that serves this OA location. There are two possible values: ‘publisher’ means this location is served by the article’s publisher (in practice, this usually means it is hosted on the same domain the DOI resolves to). ‘repository’ means this location is served by an Open Access repository. Preprint servers are considered repositories even if the DOI resolves there. |
first_oa_location.is_best |
BOOLEAN |
NULLABLE |
Is this location the best_oa_location for its resource. See the DOI object’s best_oa_location description for more on how we select which location is “best.” |
first_oa_location.license |
STRING |
NULLABLE |
The license under which this copy is published. We return several types of licenses: Creative Commons licenses are uniformly abbreviated and lowercased. Example: ‘cc-by-nc’. Publisher-specific licenses are normalized using this format: ‘acs-specific: authorchoice/editors choice usage agreement’. When we have evidence that an OA license of some kind was used, but it’s not reported directly on the webpage at this location, this field returns ‘implied-oa’ |
first_oa_location.oa_date |
DATE |
NULLABLE |
When this document first became available at this location. oa_date is calculated differently for different host types and is not available for all oa_locations. See https://support.unpaywall.org/a/solutions/articles/44002063719 for details. |
first_oa_location.pmh_id |
STRING |
NULLABLE |
OAI-PMH endpoint where we found this location. This is primarily for internal debugging. It’s null for locations that weren’t found using OAI-PMH. |
first_oa_location.updated |
TIMESTAMP |
NULLABLE |
Time when the data for this location was last updated. Returned as an ISO8601-formatted timestamp. Example: 2017-08-17T23:43:27.753663 |
first_oa_location.url |
STRING |
NULLABLE |
The url_for_pdf if there is one; otherwise landing page URL. When we can’t find a url_for_pdf (or there isn’t one), this field uses the url_for_landing_page, which is a useful fallback for some use cases. |
first_oa_location.url_for_landing_page |
STRING |
NULLABLE |
The URL for a landing page describing this OA copy. When the host_type is “publisher” the landing page usually includes HTML fulltext. |
first_oa_location.url_for_pdf |
STRING |
NULLABLE |
The URL with a PDF version of this OA copy. |
first_oa_location.version |
STRING |
NULLABLE |
The content version accessible at this location. We use the DRIVER Guidelines v2.0 VERSION standard (https://wiki.surfnet.nl/display/DRIVERguidelines/DRIVER-VERSION+Mappings) to define versions of a given article; see those docs for complete definitions of terms. |
first_oa_location.repository_institution |
STRING |
NULLABLE |
|
first_oa_location.endpoint_id |
STRING |
NULLABLE |
|
first_oa_location.id |
STRING |
NULLABLE |
|
oa_status |
STRING |
NULLABLE |
The OA status, or color, of this resource. Classifies OA resources by location and license terms as one of: gold, hybrid, bronze, green or closed. See here for more information on how we assign an oa_status: https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean- |
published_date |
DATE |
NULLABLE |
The date this resource was published. As reported by the publishers, who unfortunately have inconsistent definitions of what counts as officially “published.” Returned as an ISO8601-formatted timestamp, generally with only year-month-day. |
publisher |
STRING |
NULLABLE |
The name of this resource’s publisher. Keep in mind that publisher name strings change over time, particularly as publishers are acquired or split up. |
title |
STRING |
NULLABLE |
The title of this resource. |
updated |
TIMESTAMP |
NULLABLE |
Time when the data for this resource was last updated. Returned as an ISO8601-formatted timestamp. Example: 2017-08-17T23:43:27.753663 |
year |
INTEGER |
NULLABLE |
The year this resource was published. Just the year part of the published_date |
z_authors |
RECORD |
REPEATED |
The authors of this resource. These are formatted as a list of Crossref Contributor objects, which are described in the Crossref API docs here: https://github.com/CrossRef/rest-api-doc/blob/master/api_format.md#contributor |
z_authors.family |
STRING |
NULLABLE |
|
z_authors.given |
STRING |
NULLABLE |
|
z_authors.ORCID |
STRING |
NULLABLE |
URL-form of an ORCID identifier |
z_authors.authenticated_orcid |
BOOLEAN |
NULLABLE |
If true, record owner asserts that the ORCID user completed ORCID OAuth authentication |
z_authors.affiliation |
RECORD |
REPEATED |
|
z_authors.affiliation.name |
STRING |
NULLABLE |
|
z_authors.affiliation.id |
RECORD |
REPEATED |
|
z_authors.affiliation.id.id |
STRING |
NULLABLE |
|
z_authors.affiliation.id.id-type |
STRING |
NULLABLE |
|
z_authors.affiliation.id.asserted-by |
STRING |
NULLABLE |
|
z_authors.affiliation.place |
STRING |
REPEATED |
|
z_authors.affiliation.department |
STRING |
REPEATED |
|
z_authors.affiliation.acronym |
STRING |
REPEATED |
|
z_authors.sequence |
STRING |
NULLABLE |
|
z_authors.suffix |
STRING |
NULLABLE |
|
z_authors.name |
STRING |
NULLABLE |
|
z_authors.raw |
STRING |
NULLABLE |
|
has_repository_copy |
BOOLEAN |
NULLABLE |
Is a full-text available in a repository? |
issn_l |
STRING |
NULLABLE |
|
x_reported_noncompliant_copies |
RECORD |
REPEATED |
|
x_reported_noncompliant_copies.blank |
STRING |
NULLABLE |
|
x_error |
BOOLEAN |
NULLABLE |
|
oa_locations_embargoed |
RECORD |
REPEATED |
List of OA Location objects associated with this resource that are not yet available. This list includes locations that we expect to be available in the future based on information like license metadata and journals’ delayed OA policies. They do not affect the resource’s oa_status and cannot be the best_oa_location or first_oa_location. |
oa_locations_embargoed.evidence |
STRING |
NULLABLE |
How we found this OA location. Used for debugging. Don’t depend on the exact contents of this for anything, because values are subject to change without warning. |
oa_locations_embargoed.host_type |
STRING |
NULLABLE |
The type of host that serves this OA location. There are two possible values: ‘publisher’ means this location is served by the article’s publisher (in practice, this usually means it is hosted on the same domain the DOI resolves to). ‘repository’ means this location is served by an Open Access repository. Preprint servers are considered repositories even if the DOI resolves there. |
oa_locations_embargoed.is_best |
BOOLEAN |
NULLABLE |
Is this location the best_oa_location for its resource. See the DOI object’s best_oa_location description for more on how we select which location is “best.” |
oa_locations_embargoed.license |
STRING |
NULLABLE |
The license under which this copy is published. We return several types of licenses: Creative Commons licenses are uniformly abbreviated and lowercased. Example: ‘cc-by-nc’. Publisher-specific licenses are normalized using this format: ‘acs-specific: authorchoice/editors choice usage agreement’. When we have evidence that an OA license of some kind was used, but it’s not reported directly on the webpage at this location, this field returns ‘implied-oa’ |
oa_locations_embargoed.oa_date |
DATE |
NULLABLE |
When this document first became available at this location. oa_date is calculated differently for different host types and is not available for all oa_locations. See https://support.unpaywall.org/a/solutions/articles/44002063719 for details. |
oa_locations_embargoed.pmh_id |
STRING |
NULLABLE |
OAI-PMH endpoint where we found this location. This is primarily for internal debugging. It’s null for locations that weren’t found using OAI-PMH. |
oa_locations_embargoed.updated |
TIMESTAMP |
NULLABLE |
Time when the data for this location was last updated. Returned as an ISO8601-formatted timestamp. Example: 2017-08-17T23:43:27.753663 |
oa_locations_embargoed.url |
STRING |
NULLABLE |
The url_for_pdf if there is one; otherwise landing page URL. When we can’t find a url_for_pdf (or there isn’t one), this field uses the url_for_landing_page, which is a useful fallback for some use cases. |
oa_locations_embargoed.url_for_landing_page |
STRING |
NULLABLE |
The URL for a landing page describing this OA copy. When the host_type is “publisher” the landing page usually includes HTML fulltext. |
oa_locations_embargoed.url_for_pdf |
STRING |
NULLABLE |
The URL with a PDF version of this OA copy. |
oa_locations_embargoed.version |
STRING |
NULLABLE |
The content version accessible at this location. We use the DRIVER Guidelines v2.0 VERSION standard (https://wiki.surfnet.nl/display/DRIVERguidelines/DRIVER-VERSION+Mappings) to define versions of a given article; see those docs for complete definitions of terms. |
oa_locations_embargoed.repository_institution |
STRING |
NULLABLE |
|
oa_locations_embargoed.endpoint_id |
STRING |
NULLABLE |