OpenAlex

OpenAlex is a fully open catalog of the global research system. It’s named after the ancient Library of Alexandria.

The OpenAlex dataset describes scholarly entities and how those entities are connected to each other. There are five types of entities:

  • Works are papers, books, datasets, etc; they cite other works

  • Authors are people who create works

  • Venues are journals and repositories that host works

  • Institutions are universities and other orgs that are affiliated with works (via authors)

  • Concepts tag Works with a topic

Together, these make a huge web (or more technically, heterogeneous directed graph) of hundreds of millions of entities and over a billion connections between them all.

See https://docs.openalex.org/ for more information.

This telescope transfers OpenAlex data from an AWS S3 bucket and loads it into multiple tables in BigQuery, with one table for each entity (Works, Authors, Venues, Institutions, Concepts).
The first run will process all the files that are available in the S3 bucket. A manifest file is used for later runs to keep track of which files have changed since the last run. Only the files that have changed will then be processed in this telescope.

The data for the Authors and Venues entities do not require any transformations before loading into BigQuery. This means that the files for these entities are directly transferred to the transform bucket.

The other entities do require some transformation and those files are transferred to the download bucket. After transforming the data the resulting files are then uploaded to the transform bucket.

The transformation that is required has to do with two fields that have nested fields with dynamic field names. These make it impossible to create a schema beforehand and upload the data straight into BigQuery. The two mentioned fields are ‘abstract_inverted_index’ (present in Work entity only) and ‘international’ (present in Concept and Institute entities).

As a workaround, these fields are transformed into a RECORD of two arrays of the same length. The first array contains all the original field names and the second array the corresponding values.

Summary

Average runtime

12-24h

Average download size

>100GB

Harvest Type

AWS transfer

Workflow Update Frequency

Weekly

Runs on remote worker

True

Catchup missed runs

False

Table Write Disposition

Append

Provider Update Frequency

Weekly

Credentials Required

No

Uses Workflow Template

Stream

Each shard includes all data

No

Using the transfer service

The files in the AWS bucket are transferred to a separate Google Cloud storage bucket using the storage transfer service. To use the transfer service it is required to enable the Storage Transfer API and to set the correct permissions on the Google Cloud Storage bucket as well as the AWS bucket.

Enabling the Storage Transfer API

The API should already be enabled from the Terraform set-up. If this is not the case, see the google support answer for info on how to enable an API. Search for the Storage Transfer API and enable this.

Setting permissions on Google Cloud bucket

The data is transferred to the standard download bucket and the following permissions are required on this Google Cloud bucket for the transfer service to work:

  • storage.buckets.get

  • storage.objects.list

  • storage.objects.get

  • storage.objects.create

The roles/storage.objectViewer and roles/storage.legacyBucketWriter roles together contain the permissions that are always required. These roles or permissions need to be assigned at the specific bucket to the service account performing the transfer.

The Storage Transfer Service uses the project-[$PROJECT_NUMBER]@storage-transfer-service.iam.gserviceaccount.com service account.

Setting permissions on AWS bucket

The AWS bucket is managed by OpenAlex, the bucket that is used is s3://openalex. The data in this bucket is publicly available and there aren’t any permissions required to download or inspect the data using the AWS s3 CLI.

However, the transfer service in GCP does require permissions to transfer the data, so it is required to create a user from the AWS console with programmatic access (using a key id and secret key).

The key id and secret access key that are created can then be used for the Airflow connection that is described below.

The required policy that needs to be assigned to this user is:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::openalex"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::openalex/*"
            ]
        }
    ]
}

Airflow connections

Note that all values need to be urlencoded. In the config.yaml file, the following airflow connections are required:

openalex

This connection contains the AWS access key id and secret access key that are used to access data in the AWS buckets. Make sure to URL encode each of the fields ‘access_key_id’ and ‘secret_access_key’.

openalex: aws://<access_key_id>:<secret_access_key>@

Latest schema

Author

name

type

mode

description

cited_by_count

INTEGER

NULLABLE

The total number Works that cite a work this author has created.

counts_by_year

RECORD

REPEATED

Author.works_count and Author.cited_by_count for each of the last ten years, binned by year. To put it another way: each year, you can see how many works this author published, and how many times they got cited. Any works or citations older than ten years old aren’t included.

counts_by_year.cited_by_count

INTEGER

NULLABLE

The total number Works that cite a work this author has created.

counts_by_year.oa_works_count

INTEGER

NULLABLE

counts_by_year.works_count

INTEGER

NULLABLE

The number of Works this this author has created.

counts_by_year.year

INTEGER

NULLABLE

The year.

created_date

DATE

NULLABLE

The date this Author object was created in the OpenAlex dataset, expressed as an ISO 8601 date string.

display_name

STRING

NULLABLE

The name of the author as a single string.

display_name_alternatives

STRING

REPEATED

Other ways that we’ve found this author’s name displayed.

id

STRING

NULLABLE

The OpenAlex ID for this author.

ids

RECORD

NULLABLE

All the persistent identifiers (PIDs) that we know about for this author, as key: value pairs, where key is the PID namespace, and value is the PID. IDs are expressed as URIs where possible. The openalex ID is the same one you’ll find at Author.id. All the IDs are strings except for mag, which is an integer.

ids.mag

INTEGER

NULLABLE

this author’s Microsoft Academic Graph ID

ids.openalex

STRING

NULLABLE

this author’s OpenAlex ID. Same as Author.id

ids.orcid

STRING

NULLABLE

this author’s ORCID ID. Same as Author.orcid

ids.scopus

STRING

NULLABLE

this author’s Scopus author ID

ids.twitter

STRING

NULLABLE

this author’s Twitter handle

ids.wikipedia

STRING

NULLABLE

this author’s Wikipedia page

last_known_institution

RECORD

NULLABLE

This author’s last known institutional affiliation. In this context “last known” means that we took all the Works where this author has an institutional affiliation, sorted them by publication date, and selected the most recent one. This is a dehydrated Institution object, and you can find more documentation on the Institution page.

last_known_institution.country_code

STRING

NULLABLE

The country where this institution is located, represented as an ISO two-letter country code.

last_known_institution.display_name

STRING

NULLABLE

The primary name of the institution.

last_known_institution.id

STRING

NULLABLE

The OpenAlex ID for this institution.

last_known_institution.lineage

STRING

REPEATED

OpenAlex IDs of institutions. The list will include this institution’s ID, as well as any parent institutions. If this institution has no parent institutions, this list will only contain its own ID.

last_known_institution.ror

STRING

NULLABLE

The ROR ID for this institution. The ROR (Research Organization Registry) identifier is a globally unique ID for research organization. ROR is the successor to GRiD, which is no longer being updated.

last_known_institution.type

STRING

NULLABLE

The institution’s primary type, using the ROR “type” controlled vocabulary. Possible values are: Education, Healthcare, Company, Archive, Nonprofit, Government, Facility, and Other.

most_cited_work

STRING

NULLABLE

The title of the most cited work.

orcid

STRING

NULLABLE

The ORCID for this author. ORCID global and unique ID for authors.

summary_stats

RECORD

NULLABLE

Citation metrics for this author.

summary_stats.2yr_cited_by_count

INTEGER

NULLABLE

summary_stats.2yr_h_index

INTEGER

NULLABLE

summary_stats.2yr_i10_index

INTEGER

NULLABLE

summary_stats.2yr_mean_citedness

FLOAT

NULLABLE

summary_stats.2yr_works_count

INTEGER

NULLABLE

summary_stats.cited_by_count

INTEGER

NULLABLE

summary_stats.h_index

INTEGER

NULLABLE

summary_stats.i10_index

INTEGER

NULLABLE

summary_stats.oa_percent

FLOAT

NULLABLE

summary_stats.works_count

INTEGER

NULLABLE

updated_date

TIMESTAMP

NULLABLE

The last time anything in this author object changed, expressed as an ISO 8601 date string. This date is updated for any change at all, including increases in various counts.

works_api_url

STRING

NULLABLE

A URL that will get you a list of all this author’s works. We express this as an API URL (instead of just listing the works themselves) because sometimes an author’s publication list is too long to reasonably fit into a single author object.

works_count

INTEGER

NULLABLE

The number of Works this this author has created.

x_concepts

RECORD

REPEATED

The “x” in x_concepts is because it’s experimental and subject to removal with very little warning. We plan to replace it with a custom link to the Concepts API endpoint.

x_concepts.display_name

STRING

NULLABLE

The English-language label of the concept.

x_concepts.id

STRING

NULLABLE

The OpenAlex ID for this concept.

x_concepts.level

INTEGER

NULLABLE

The level in the concept tree where this concept lives.

x_concepts.score

FLOAT

NULLABLE

The strength of association between this author and the listed concept, from 0-100.

x_concepts.wikidata

STRING

NULLABLE

The Wikidata ID for this concept.

Concept

name

type

mode

description

ancestors

RECORD

REPEATED

List of concepts that this concept descends from, as dehydrated Concept objects. See the concept tree section for more details on how the different layers of concepts work together.

ancestors.display_name

STRING

NULLABLE

The English-language label of the concept.

ancestors.id

STRING

NULLABLE

The OpenAlex ID for this concept.

ancestors.level

INTEGER

NULLABLE

The level in the concept tree where this concept lives.

ancestors.wikidata

STRING

NULLABLE

The Wikidata ID for this concept.

cited_by_count

INTEGER

NULLABLE

The number citations to works that have been tagged with this concept. Or less formally: the number of citations to this concept. For example, if there are just two works tagged with this concept and one of them has been cited 10 times, and the other has been cited 1 time, cited_by_count for this concept would be 11.

counts_by_year

RECORD

REPEATED

The values of works_count and cited_by_count for each of the last ten years, binned by year. To put it another way: for every listed year, you can see how many new works were tagged with this concept, and how many times any work tagged with this concept got cited.

counts_by_year.cited_by_count

INTEGER

NULLABLE

The number citations to works that have been tagged with this concept. Or less formally: the number of citations to this concept. For example, if there are just two works tagged with this concept and one of them has been cited 10 times, and the other has been cited 1 time, cited_by_count for this concept would be 11.

counts_by_year.oa_works_count

INTEGER

NULLABLE

counts_by_year.works_count

INTEGER

NULLABLE

The number of works tagged with this concept.

counts_by_year.year

INTEGER

NULLABLE

The year.

created_date

DATE

NULLABLE

The date this Concept object was created in the OpenAlex dataset, expressed as an ISO 8601 date string.

description

STRING

NULLABLE

A brief description of this concept.

display_name

STRING

NULLABLE

The English-language label of the concept.

id

STRING

NULLABLE

The OpenAlex ID for this concept.

ids

RECORD

NULLABLE

All the persistent identifiers (PIDs) that we know about for this venue, as key: value pairs, where key is the PID namespace, and value is the PID. IDs are expressed as URIs where possible. umls_aui and umls_cui refer to the Unified Medical Language System Atom Unique Identifier and Concept Unique Identifier respectively. These are lists. The other IDs are all strings, except except for mag, which is a long integer.

ids.mag

INTEGER

NULLABLE

this concept’s Microsoft Academic Graph ID

ids.openalex

STRING

NULLABLE

this concept’s OpenAlex ID. Same as Concept.id

ids.umls_aui

STRING

REPEATED

this concept’s Unified Medical Language System Atom Unique Identifiers

ids.umls_cui

STRING

REPEATED

this concept’s Unified Medical Language System Concept Unique Identifiers

ids.wikidata

STRING

NULLABLE

this concept’s Wikidata ID. Same as Concept.wikidata

ids.wikipedia

STRING

NULLABLE

this concept’s Wikipedia page URL

image_thumbnail_url

STRING

NULLABLE

Same as image_url, but it’s a smaller image.

image_url

STRING

NULLABLE

URL where you can get an image representing this concept, where available. Usually this is hosted on Wikipedia.

international

RECORD

NULLABLE

Translation of the display_name and description into multiple languages.

international.description

RECORD

NULLABLE

This concept’s description in many languages, derived from article titles on each language’s wikipedia.

international.description.keys

STRING

REPEATED

The language codes in wikidata language code format.

international.description.values

STRING

REPEATED

The translated descriptions in each language.

international.display_name

RECORD

NULLABLE

This concept’s display name in many languages, derived from article titles on each language’s wikipedia.

international.display_name.keys

STRING

REPEATED

The language codes in wikidata language code format.

international.display_name.values

STRING

REPEATED

The translated display_names in each language.

level

INTEGER

NULLABLE

The level in the concept tree where this concept lives. Lower-level concepts are more general, and higher-level concepts are more specific. Computer Science has a level of 0; Java Bytecode has a level of 5. Level 0 concepts have no ancestors and level 5 concepts have no descendants.

related_concepts

RECORD

REPEATED

Concepts that are similar to this one. Each listed concept is a dehydrated Concept object, with one additional attribute

related_concepts.display_name

STRING

NULLABLE

The English-language label of the concept.

related_concepts.id

STRING

NULLABLE

The OpenAlex ID for this concept.

related_concepts.level

INTEGER

NULLABLE

The level in the concept tree where this concept lives.

related_concepts.score

FLOAT

NULLABLE

The strength of association between this concept and the listed concept, on a scale of 0-100.

related_concepts.wikidata

STRING

NULLABLE

The Wikidata ID for this concept.

summary_stats

RECORD

NULLABLE

Citation metrics for this concept.

summary_stats.2yr_cited_by_count

INTEGER

NULLABLE

summary_stats.2yr_h_index

INTEGER

NULLABLE

summary_stats.2yr_i10_index

INTEGER

NULLABLE

summary_stats.2yr_mean_citedness

FLOAT

NULLABLE

summary_stats.2yr_works_count

INTEGER

NULLABLE

summary_stats.cited_by_count

INTEGER

NULLABLE

summary_stats.h_index

INTEGER

NULLABLE

summary_stats.i10_index

INTEGER

NULLABLE

summary_stats.oa_percent

FLOAT

NULLABLE

summary_stats.works_count

INTEGER

NULLABLE

updated_date

TIMESTAMP

NULLABLE

The last time anything in this concept object changed, expressed as an ISO 8601 date string. This date is updated for any change at all, including increases in various counts.

wikidata

STRING

NULLABLE

The Wikidata ID for this concept. This is the Canonical External ID for concepts.

works_api_url

STRING

NULLABLE

An URL that will get you a list of all the works tagged with this concept. We express this as an API URL (instead of just listing the works themselves) because there might be millions of works tagged with this concept, and that’s too many to fit here.

works_count

INTEGER

NULLABLE

The number of works tagged with this concept.

Institution

name

type

mode

description

associated_institutions

RECORD

REPEATED

associated_institutions.country_code

STRING

NULLABLE

The country where this institution is located, represented as an ISO two-letter country code.

associated_institutions.display_name

STRING

NULLABLE

The primary name of the institution.

associated_institutions.id

STRING

NULLABLE

The OpenAlex ID for this institution.

associated_institutions.relationship

STRING

NULLABLE

The type of relationship between this institution and the listed institution. Possible values: parent, child, and related.

associated_institutions.ror

STRING

NULLABLE

The ROR ID for this institution. The ROR (Research Organization Registry) identifier is a globally unique ID for research organization. ROR is the successor to GRiD, which is no longer being updated.

associated_institutions.type

STRING

NULLABLE

The institution’s primary type, using the ROR “type” controlled vocabulary. Possible values are: Education, Healthcare, Company, Archive, Nonprofit, Government, Facility, and Other.

cited_by_count

INTEGER

NULLABLE

The total number Works that cite a work created by an author affiliated with this institution. Or less formally: the number of citations this institution has collected.

country_code

STRING

NULLABLE

The country where this institution is located, represented as an ISO two-letter country code.

counts_by_year

RECORD

REPEATED

works_count and cited_by_count for each of the last ten years, binned by year. To put it another way: each year, you can see how many new works this venue started hosting, and how many times any work in this venue got cited.

counts_by_year.cited_by_count

INTEGER

NULLABLE

The total number Works that cite a work created by an author affiliated with this institution. Or less formally: the number of citations this institution has collected.

counts_by_year.oa_works_count

INTEGER

NULLABLE

counts_by_year.works_count

INTEGER

NULLABLE

The number of Works created by authors affiliated with this institution. Or less formally: the number of works coming out of this institution.

counts_by_year.year

INTEGER

NULLABLE

The year.

created_date

DATE

NULLABLE

The date this Institution object was created in the OpenAlex dataset, expressed as an ISO 8601 date string.

display_name

STRING

NULLABLE

The primary name of the institution.

display_name_acronyms

STRING

REPEATED

Acronyms or initialisms that people sometimes use instead of the full display_name.

display_name_alternatives

STRING

REPEATED

Other names people may use for this institution.

geo

RECORD

NULLABLE

A bunch of stuff we know about the location of this institution

geo.city

STRING

NULLABLE

The city where this institution lives.

geo.country

STRING

NULLABLE

The country where this institution lives.

geo.country_code

STRING

NULLABLE

The country where this institution lives, represented as an ISO two-letter country code.

geo.geonames_city_id

STRING

NULLABLE

The city where this institution lives, as a GeoNames database ID.

geo.latitude

FLOAT

NULLABLE

Does what it says.

geo.longitude

FLOAT

NULLABLE

Does what it says.

geo.region

STRING

NULLABLE

The sub-national region (state, province) where this institution lives.

homepage_url

STRING

NULLABLE

The URL for institution’s primary homepage

id

STRING

NULLABLE

The OpenAlex ID for this institution.

ids

RECORD

NULLABLE

All the persistent identifiers (PIDs) that we know about for this institution, as key: value pairs, where key is the PID namespace, and value is the PID. IDs are expressed as URIs where possible. They’re all strings except for mag, which is a long integer.

ids.grid

STRING

NULLABLE

this institution’s GRID ID

ids.mag

INTEGER

NULLABLE

this institution’s Microsoft Academic Graph ID

ids.openalex

STRING

NULLABLE

this institution’s OpenAlex ID. Same as Institution.id

ids.ror

STRING

NULLABLE

this institution’s ROR ID. Same as Institution.ror

ids.wikidata

STRING

NULLABLE

this institution’s Wikidata ID

ids.wikipedia

STRING

NULLABLE

this institution’s Wikipedia page URL

image_thumbnail_url

STRING

NULLABLE

Same as image_url, but it’s a smaller image.

image_url

STRING

NULLABLE

URL where you can get an image representing this institution. Usually this is hosted on Wikipedia, and usually it’s a seal or logo.

international

RECORD

NULLABLE

Translation of the display_name and description into multiple languages.

international.display_name

RECORD

NULLABLE

The institution’s display name in different languages. Derived from the wikipedia page for the institution in the given language.

international.display_name.keys

STRING

REPEATED

The language codes in wikidata language code format.

international.display_name.values

STRING

REPEATED

The translated display_names in each language.

lineage

STRING

REPEATED

OpenAlex IDs of institutions. The list will include this institution’s ID, as well as any parent institutions. If this institution has no parent institutions, this list will only contain its own ID.

repositories

RECORD

REPEATED

Repositories (Sources with type: repository) that have this institution as their host_organization

repositories.display_name

STRING

NULLABLE

The repositories display name.

repositories.host_organization

STRING

NULLABLE

The OpenAlex ID of the host organisation.

repositories.host_organization_lineage

STRING

REPEATED

OpenAlex IDs — See Publisher.lineage. This will only be included if the host_organization is a publisher (and not if the host_organization is an institution).

repositories.host_organization_lineage_names

STRING

REPEATED

The names of the organisations in host_organization_lineage.

repositories.host_organization_name

STRING

NULLABLE

The display_name from the host_organization, shown for convenience.

repositories.id

STRING

NULLABLE

The OpenAlex ID of the repository.

repositories.issn

STRING

REPEATED

repositories.issn_l

STRING

NULLABLE

repositories.publisher

STRING

NULLABLE

repositories.publisher_id

STRING

NULLABLE

repositories.type

STRING

NULLABLE

roles

RECORD

REPEATED

roles.id

STRING

NULLABLE

roles.role

STRING

NULLABLE

roles.works_count

INTEGER

NULLABLE

ror

STRING

NULLABLE

The ROR ID for this institution. The ROR (Research Organization Registry) identifier is a globally unique ID for research organization. ROR is the successor to GRiD, which is no longer being updated.

summary_stats

RECORD

NULLABLE

Citation metrics for this institutions.

summary_stats.2yr_cited_by_count

INTEGER

NULLABLE

summary_stats.2yr_h_index

INTEGER

NULLABLE

summary_stats.2yr_i10_index

INTEGER

NULLABLE

summary_stats.2yr_mean_citedness

FLOAT

NULLABLE

summary_stats.2yr_works_count

INTEGER

NULLABLE

summary_stats.cited_by_count

INTEGER

NULLABLE

summary_stats.h_index

INTEGER

NULLABLE

summary_stats.i10_index

INTEGER

NULLABLE

summary_stats.oa_percent

FLOAT

NULLABLE

summary_stats.works_count

INTEGER

NULLABLE

type

STRING

NULLABLE

The institution’s primary type, using the ROR “type” controlled vocabulary. Possible values are: Education, Healthcare, Company, Archive, Nonprofit, Government, Facility, and Other.

updated_date

TIMESTAMP

NULLABLE

The last time anything in this Institution changed, expressed as an ISO 8601 date string. This date is updated for any change at all, including increases in various counts.

works_api_url

STRING

NULLABLE

A URL that will get you a list of all the Works affiliated with this institution. We express this as an API URL (instead of just listing the Works themselves) because most institutions have way too many works to reasonably fit into a single return object.

works_count

INTEGER

NULLABLE

The number of Works created by authors affiliated with this institution. Or less formally: the number of works coming out of this institution.

x_concepts

RECORD

REPEATED

The “x” in x_concepts is because it’s experimental and subject to removal with very little warning. We plan to replace it with a custom link to the Concepts API endpoint. The Concepts most frequently applied to works affiliated with this institution. Each is represented as a dehydrated Concept object, with one additional attribute

x_concepts.display_name

STRING

NULLABLE

The English-language label of the concept.

x_concepts.id

STRING

NULLABLE

The OpenAlex ID for this concept.

x_concepts.level

INTEGER

NULLABLE

The level in the concept tree where this concept lives.

x_concepts.score

FLOAT

NULLABLE

The strength of association between this institution and the listed concept, from 0-100.

x_concepts.wikidata

STRING

NULLABLE

The Wikidata ID for this concept.

Funders

name

type

mode

description

alternate_titles

STRING

REPEATED

A list of alternate titles for this funder.

cited_by_count

INTEGER

NULLABLE

The total number Works that cite a work linked to this funder.

country_code

STRING

NULLABLE

The country where this funder is located, represented as an ISO two-letter country code.

counts_by_year

RECORD

REPEATED

The values of works_count and cited_by_count for each of the last ten years, binned by year. To put it another way: for every listed year, you can see how many new works are linked to this funder, and how many times any work linked to this funder was cited. Years with zero citations and zero works have been removed so you will need to add those back in if you need them.

counts_by_year.cited_by_count

INTEGER

NULLABLE

counts_by_year.oa_works_count

INTEGER

NULLABLE

counts_by_year.works_count

INTEGER

NULLABLE

counts_by_year.year

INTEGER

NULLABLE

created_date

DATE

NULLABLE

The date this Funder object was created in the OpenAlex dataset, expressed as an ISO 8601 date string.

description

STRING

NULLABLE

A short description of this funder, taken from Wikidata.

display_name

STRING

NULLABLE

The primary name of the funder.

homepage_url

STRING

NULLABLE

The URL for this funder’s primary homepage.

id

STRING

NULLABLE

The OpenAlex ID for this funder.

ids

RECORD

NULLABLE

All the external identifiers that we know about for this funder. IDs are expressed as URIs whenever possible.

ids.openalex

STRING

NULLABLE

this funder’s OpenAlex ID

ids.ror

STRING

NULLABLE

this funder’s ROR ID

ids.wikidata

STRING

NULLABLE

this funder’s Wikidata ID

ids.crossref

STRING

NULLABLE

this funder’s Crossref ID

ids.doi

STRING

NULLABLE

this funder’s DOI

image_thumbnail_url

STRING

NULLABLE

Same as image_url, but it’s a smaller image. This is usually a hotlink to a wikimedia image. You can change the width=300 parameter in the URL if you want a different thumbnail size.

image_url

STRING

NULLABLE

URL where you can get an image representing this funder. Usually this a hotlink to a Wikimedia image, and usually it’s a seal or logo.

roles

RECORD

REPEATED

List of role objects, which include the role (one of institution, funder, or publisher), the id (OpenAlex ID), and the works_count. In many cases, a single organization does not fit neatly into one role. For example, Yale University is a single organization that is a research university, funds research studies, and publishes an academic journal. The roles property links the OpenAlex entities together for a single organization, and includes counts for the works associated with each role. The roles list of an entity (Funder, Publisher, or Institution) always includes itself. In the case where an organization only has one role, the roles will be a list of length one, with itself as the only item.

roles.id

STRING

NULLABLE

roles.role

STRING

NULLABLE

roles.works_count

INTEGER

NULLABLE

summary_stats

RECORD

NULLABLE

Citation metrics for this funder. While the h-index and the i-10 index are normally author-level metrics and the 2-year mean citedness is normally a journal-level metric, they can be calculated for any set of papers, so we include them for funders.

summary_stats.2yr_cited_by_count

INTEGER

NULLABLE

summary_stats.2yr_h_index

INTEGER

NULLABLE

summary_stats.2yr_i10_index

INTEGER

NULLABLE

summary_stats.2yr_mean_citedness

FLOAT

NULLABLE

The 2-year mean citedness for this funder. Also known as impact factor.

summary_stats.2yr_works_count

INTEGER

NULLABLE

summary_stats.cited_by_count

INTEGER

NULLABLE

summary_stats.h_index

INTEGER

NULLABLE

The h-index for this funder.

summary_stats.i10_index

INTEGER

NULLABLE

The i-10 index for this funder.

summary_stats.oa_percent

FLOAT

NULLABLE

summary_stats.works_count

INTEGER

NULLABLE

updated_date

TIMESTAMP

NULLABLE

The last time anything in this funder object changed, expressed as an ISO 8601 date string. This date is updated for any change at all, including increases in various counts.

works_count

INTEGER

NULLABLE

The number of works linked to this funder.

Institutions

name

type

mode

description

associated_institutions

RECORD

REPEATED

associated_institutions.country_code

STRING

NULLABLE

The country where this institution is located, represented as an ISO two-letter country code.

associated_institutions.display_name

STRING

NULLABLE

The primary name of the institution.

associated_institutions.id

STRING

NULLABLE

The OpenAlex ID for this institution.

associated_institutions.relationship

STRING

NULLABLE

The type of relationship between this institution and the listed institution. Possible values: parent, child, and related.

associated_institutions.ror

STRING

NULLABLE

The ROR ID for this institution. The ROR (Research Organization Registry) identifier is a globally unique ID for research organization. ROR is the successor to GRiD, which is no longer being updated.

associated_institutions.type

STRING

NULLABLE

The institution’s primary type, using the ROR “type” controlled vocabulary. Possible values are: Education, Healthcare, Company, Archive, Nonprofit, Government, Facility, and Other.

cited_by_count

INTEGER

NULLABLE

The total number Works that cite a work created by an author affiliated with this institution. Or less formally: the number of citations this institution has collected.

country_code

STRING

NULLABLE

The country where this institution is located, represented as an ISO two-letter country code.

counts_by_year

RECORD

REPEATED

works_count and cited_by_count for each of the last ten years, binned by year. To put it another way: each year, you can see how many new works this venue started hosting, and how many times any work in this venue got cited.

counts_by_year.cited_by_count

INTEGER

NULLABLE

The total number Works that cite a work created by an author affiliated with this institution. Or less formally: the number of citations this institution has collected.

counts_by_year.oa_works_count

INTEGER

NULLABLE

counts_by_year.works_count

INTEGER

NULLABLE

The number of Works created by authors affiliated with this institution. Or less formally: the number of works coming out of this institution.

counts_by_year.year

INTEGER

NULLABLE

The year.

created_date

DATE

NULLABLE

The date this Institution object was created in the OpenAlex dataset, expressed as an ISO 8601 date string.

display_name

STRING

NULLABLE

The primary name of the institution.

display_name_acronyms

STRING

REPEATED

Acronyms or initialisms that people sometimes use instead of the full display_name.

display_name_alternatives

STRING

REPEATED

Other names people may use for this institution.

geo

RECORD

NULLABLE

A bunch of stuff we know about the location of this institution

geo.city

STRING

NULLABLE

The city where this institution lives.

geo.country

STRING

NULLABLE

The country where this institution lives.

geo.country_code

STRING

NULLABLE

The country where this institution lives, represented as an ISO two-letter country code.

geo.geonames_city_id

STRING

NULLABLE

The city where this institution lives, as a GeoNames database ID.

geo.latitude

FLOAT

NULLABLE

Does what it says.

geo.longitude

FLOAT

NULLABLE

Does what it says.

geo.region

STRING

NULLABLE

The sub-national region (state, province) where this institution lives.

homepage_url

STRING

NULLABLE

The URL for institution’s primary homepage

id

STRING

NULLABLE

The OpenAlex ID for this institution.

ids

RECORD

NULLABLE

All the persistent identifiers (PIDs) that we know about for this institution, as key: value pairs, where key is the PID namespace, and value is the PID. IDs are expressed as URIs where possible. They’re all strings except for mag, which is a long integer.

ids.grid

STRING

NULLABLE

this institution’s GRID ID

ids.mag

INTEGER

NULLABLE

this institution’s Microsoft Academic Graph ID

ids.openalex

STRING

NULLABLE

this institution’s OpenAlex ID. Same as Institution.id

ids.ror

STRING

NULLABLE

this institution’s ROR ID. Same as Institution.ror

ids.wikidata

STRING

NULLABLE

this institution’s Wikidata ID

ids.wikipedia

STRING

NULLABLE

this institution’s Wikipedia page URL

image_thumbnail_url

STRING

NULLABLE

Same as image_url, but it’s a smaller image.

image_url

STRING

NULLABLE

URL where you can get an image representing this institution. Usually this is hosted on Wikipedia, and usually it’s a seal or logo.

international

RECORD

NULLABLE

Translation of the display_name and description into multiple languages.

international.display_name

RECORD

NULLABLE

The institution’s display name in different languages. Derived from the wikipedia page for the institution in the given language.

international.display_name.keys

STRING

REPEATED

The language codes in wikidata language code format.

international.display_name.values

STRING

REPEATED

The translated display_names in each language.

lineage

STRING

REPEATED

OpenAlex IDs of institutions. The list will include this institution’s ID, as well as any parent institutions. If this institution has no parent institutions, this list will only contain its own ID.

repositories

RECORD

REPEATED

Repositories (Sources with type: repository) that have this institution as their host_organization

repositories.display_name

STRING

NULLABLE

The repositories display name.

repositories.host_organization

STRING

NULLABLE

The OpenAlex ID of the host organisation.

repositories.host_organization_lineage

STRING

REPEATED

OpenAlex IDs — See Publisher.lineage. This will only be included if the host_organization is a publisher (and not if the host_organization is an institution).

repositories.host_organization_lineage_names

STRING

REPEATED

The names of the organisations in host_organization_lineage.

repositories.host_organization_name

STRING

NULLABLE

The display_name from the host_organization, shown for convenience.

repositories.id

STRING

NULLABLE

The OpenAlex ID of the repository.

repositories.issn

STRING

REPEATED

repositories.issn_l

STRING

NULLABLE

repositories.publisher

STRING

NULLABLE

repositories.publisher_id

STRING

NULLABLE

repositories.type

STRING

NULLABLE

roles

RECORD

REPEATED

roles.id

STRING

NULLABLE

roles.role

STRING

NULLABLE

roles.works_count

INTEGER

NULLABLE

ror

STRING

NULLABLE

The ROR ID for this institution. The ROR (Research Organization Registry) identifier is a globally unique ID for research organization. ROR is the successor to GRiD, which is no longer being updated.

summary_stats

RECORD

NULLABLE

Citation metrics for this institutions.

summary_stats.2yr_cited_by_count

INTEGER

NULLABLE

summary_stats.2yr_h_index

INTEGER

NULLABLE

summary_stats.2yr_i10_index

INTEGER

NULLABLE

summary_stats.2yr_mean_citedness

FLOAT

NULLABLE

summary_stats.2yr_works_count

INTEGER

NULLABLE

summary_stats.cited_by_count

INTEGER

NULLABLE

summary_stats.h_index

INTEGER

NULLABLE

summary_stats.i10_index

INTEGER

NULLABLE

summary_stats.oa_percent

FLOAT

NULLABLE

summary_stats.works_count

INTEGER

NULLABLE

type

STRING

NULLABLE

The institution’s primary type, using the ROR “type” controlled vocabulary. Possible values are: Education, Healthcare, Company, Archive, Nonprofit, Government, Facility, and Other.

updated_date

TIMESTAMP

NULLABLE

The last time anything in this Institution changed, expressed as an ISO 8601 date string. This date is updated for any change at all, including increases in various counts.

works_api_url

STRING

NULLABLE

A URL that will get you a list of all the Works affiliated with this institution. We express this as an API URL (instead of just listing the Works themselves) because most institutions have way too many works to reasonably fit into a single return object.

works_count

INTEGER

NULLABLE

The number of Works created by authors affiliated with this institution. Or less formally: the number of works coming out of this institution.

x_concepts

RECORD

REPEATED

The “x” in x_concepts is because it’s experimental and subject to removal with very little warning. We plan to replace it with a custom link to the Concepts API endpoint. The Concepts most frequently applied to works affiliated with this institution. Each is represented as a dehydrated Concept object, with one additional attribute

x_concepts.display_name

STRING

NULLABLE

The English-language label of the concept.

x_concepts.id

STRING

NULLABLE

The OpenAlex ID for this concept.

x_concepts.level

INTEGER

NULLABLE

The level in the concept tree where this concept lives.

x_concepts.score

FLOAT

NULLABLE

The strength of association between this institution and the listed concept, from 0-100.

x_concepts.wikidata

STRING

NULLABLE

The Wikidata ID for this concept.

Publishers

name

type

mode

description

alternate_titles

STRING

REPEATED

A list of alternate titles for this publisher.

cited_by_count

INTEGER

NULLABLE

The number of citations to works that are linked to this publisher through journals or other sources. For example, if a publisher publishes 27 journals and those 27 journals have 3,050 works, this number is the sum of the cited_by_count values for all of those 3,050 works.

country_codes

STRING

REPEATED

The countries where the publisher is primarily located, as an ISO two-letter country code.

counts_by_year

RECORD

REPEATED

The values of works_count and cited_by_count for each of the last ten years, binned by year. To put it another way: for every listed year, you can see how many new works are linked to this publisher, and how many times any work linked to this publisher was cited. Years with zero citations and zero works have been removed so you will need to add those back in if you need them.

counts_by_year.cited_by_count

INTEGER

NULLABLE

The total number of Works that cite a Work published by this publisher.

counts_by_year.oa_works_count

INTEGER

NULLABLE

counts_by_year.works_count

INTEGER

NULLABLE

The total number of Works that are published by this publisher.

counts_by_year.year

INTEGER

NULLABLE

The year.

created_date

DATE

NULLABLE

The date this Publisher object was created in the OpenAlex dataset, expressed as an ISO 8601 date string.

display_name

STRING

NULLABLE

The primary name of the publisher.

hierarchy_level

INTEGER

NULLABLE

The hierarchy level for this publisher. A publisher with hierarchy level 0 has no parent publishers. A hierarchy level 1 publisher has one parent above it, and so on.

id

STRING

NULLABLE

The OpenAlex ID for this publisher.

ids

RECORD

NULLABLE

All the external identifiers that we know about for this publisher. IDs are expressed as URIs whenever possible.

ids.openalex

STRING

NULLABLE

this publishers’s OpenAlex ID

ids.ror

STRING

NULLABLE

this publisher’s ROR ID

ids.wikidata

STRING

NULLABLE

this publisher’s Wikidata ID

image_thumbnail_url

STRING

NULLABLE

This is usually a hotlink to a wikimedia image. You can change the width=300 parameter in the URL if you want a different thumbnail size.

image_url

STRING

NULLABLE

URL where you can get an image representing this publisher. Usually this a hotlink to a Wikimedia image, and usually it’s a seal or logo.

lineage

STRING

REPEATED

OpenAlex IDs of publishers. The list will include this publisher’s ID, as well as any parent publishers. If this publisher’s hierarchy_level is 0, this list will only contain its own ID.

parent_publisher

RECORD

NULLABLE

An OpenAlex ID linking to the direct parent of the publisher and display name. This will be null if the publisher’s hierarchy_level is 0.

parent_publisher.display_name

STRING

NULLABLE

parent_publisher.id

STRING

NULLABLE

roles

RECORD

REPEATED

roles.id

STRING

NULLABLE

roles.role

STRING

NULLABLE

roles.works_count

INTEGER

NULLABLE

sources_api_url

STRING

NULLABLE

An URL that will get you a list of all the sources published by this publisher. We express this as an API URL (instead of just listing the sources themselves) because there might be thousands of sources linked to a publisher, and that’s too many to fit here.

summary_stats

RECORD

NULLABLE

Citation metrics for this publisher

summary_stats.2yr_cited_by_count

INTEGER

NULLABLE

summary_stats.2yr_h_index

INTEGER

NULLABLE

summary_stats.2yr_i10_index

INTEGER

NULLABLE

summary_stats.2yr_mean_citedness

FLOAT

NULLABLE

summary_stats.2yr_works_count

INTEGER

NULLABLE

summary_stats.cited_by_count

INTEGER

NULLABLE

summary_stats.h_index

INTEGER

NULLABLE

summary_stats.i10_index

INTEGER

NULLABLE

summary_stats.oa_percent

FLOAT

NULLABLE

summary_stats.works_count

INTEGER

NULLABLE

updated_date

TIMESTAMP

NULLABLE

The last time anything in this publisher object changed, expressed as an ISO 8601 date string. This date is updated for any change at all, including increases in various counts.

works_count

INTEGER

NULLABLE

The number of works published by this publisher.

Sources

name

type

mode

description

abbreviated_title

STRING

NULLABLE

An abbreviated title obtained from the ISSN Centre.

alternate_titles

STRING

REPEATED

Alternate titles for this source, as obtained from the ISSN Centre and individual work records, like Crossref DOIs, that carry the source name as a string. These are commonly abbreviations or translations of the source’s canonical name.

apc_prices

RECORD

REPEATED

List of objects, each with price (Integer) and currency (String). Article processing charge information, taken directly from DOAJ.

apc_prices.currency

STRING

NULLABLE

Currency.

apc_prices.price

INTEGER

NULLABLE

Price.

apc_usd

INTEGER

NULLABLE

The source’s article processing charge in US Dollars, if available from DOAJ. The apc_usd value is calculated by taking the APC price (see apc_prices) with a currency of USD if it is available. If it’s not available, we convert the first available value from apc_prices into USD, using recent exchange rates.

cited_by_count

INTEGER

NULLABLE

The total number of Works that cite a Work hosted in this source.

country_code

STRING

NULLABLE

The country that this source is associated with, represented as an ISO two-letter country code.

counts_by_year

RECORD

REPEATED

works_count and cited_by_count for each of the last ten years, binned by year. To put it another way: each year, you can see how many new works this source started hosting, and how many times any work in this source got cited. If the source was founded less than ten years ago, there will naturally be fewer than ten years in this list. Years with zero citations and zero works have been removed so you will need to add those in if you need them.

counts_by_year.cited_by_count

INTEGER

NULLABLE

The total number of Works that cite a Work hosted in this source.

counts_by_year.oa_works_count

INTEGER

NULLABLE

counts_by_year.works_count

INTEGER

NULLABLE

The number of Works this this source hosts.

counts_by_year.year

INTEGER

NULLABLE

The year.

created_date

DATE

NULLABLE

The date this Source object was created in the OpenAlex dataset, expressed as an ISO 8601 date string.

display_name

STRING

NULLABLE

The name of the source.

homepage_url

STRING

NULLABLE

The starting page for navigating the contents of this source; the homepage for this source’s website.

host_organization

STRING

NULLABLE

The host organization for this source as an OpenAlex ID. This will be an Institution.id if the source is a repository, and a Publisher.id if the source is a journal, conference, or eBook platform (based on the type field).

host_organization_lineage

STRING

REPEATED

OpenAlex IDs — See Publisher.lineage. This will only be included if the host_organization is a publisher (and not if the host_organization is an institution).

host_organization_lineage_names

STRING

REPEATED

The names of the organisations in host_organization_lineage.

host_organization_name

STRING

NULLABLE

The display_name from the host_organization, shown for convenience.

id

STRING

NULLABLE

The OpenAlex ID for this source.

ids

RECORD

NULLABLE

All the external identifiers that we know about for this source. IDs are expressed as URIs whenever possible.

ids.fatcat

STRING

NULLABLE

this source’s Fatcat ID

ids.issn

STRING

REPEATED

a list of this source’s ISSNs. Same as Source.issn

ids.issn_l

STRING

NULLABLE

this source’s ISSN-L. Same as Source.issn_l

ids.mag

INTEGER

NULLABLE

this source’s Microsoft Academic Graph ID

ids.openalex

STRING

NULLABLE

this source’s OpenAlex ID. Same as Source.id

ids.wikidata

STRING

NULLABLE

this source’s Wikidata ID

is_in_doaj

BOOLEAN

NULLABLE

Whether this is a journal listed in the Directory of Open Access Journals (DOAJ).

is_oa

BOOLEAN

NULLABLE

Whether this is currently fully-open-access source. This could be true for a preprint repository where everything uploaded is free to read, or for a Gold or Diamond open access journal, where all newly published Works are available for free under an open license. We say “currently” because the status of a source can change over time. It’s common for journals to “flip” to Gold OA, after which they may make only future articles open or also open their back catalogs. It’s entirely possible for a source to say is_oa: true, but for an article from last year to require a subscription.

issn

STRING

REPEATED

The ISSNs used by this source. Many publications have multiple ISSNs (see above), so ISSN-L should be used when possible.

issn_l

STRING

NULLABLE

The ISSN-L identifying this source. ISSN is a global and unique ID for serial publications. However, different media versions of a given publication (e.g., print and electronic) often have different ISSNs. This is why we can’t have nice things. The ISSN-L or Linking ISSN solves the problem by designating a single canonical ISSN for all media versions of the title. It’s usually the same as the print ISSN.

publisher

STRING

NULLABLE

The name of this source’s publisher. Publisher is a tricky category, as journals often change publishers, publishers merge, publishers have subsidiaries (“imprints”), and of course no one is consistent in their naming. In the future, we plan to roll out support for a more structured publisher field, but for now it’s just a string.

publisher_id

STRING

NULLABLE

societies

RECORD

REPEATED

Societies on whose behalf the source is published and maintained, obtained from our crowdsourced list. Thanks!

societies.organization

STRING

NULLABLE

The society organisation name.

societies.url

STRING

NULLABLE

The society URL.

summary_stats

RECORD

NULLABLE

Citation metrics for this source.

summary_stats.2yr_cited_by_count

INTEGER

NULLABLE

summary_stats.2yr_h_index

INTEGER

NULLABLE

summary_stats.2yr_i10_index

INTEGER

NULLABLE

summary_stats.2yr_mean_citedness

FLOAT

NULLABLE

summary_stats.2yr_works_count

INTEGER

NULLABLE

summary_stats.cited_by_count

INTEGER

NULLABLE

summary_stats.h_index

INTEGER

NULLABLE

summary_stats.i10_index

INTEGER

NULLABLE

summary_stats.oa_percent

FLOAT

NULLABLE

summary_stats.works_count

INTEGER

NULLABLE

type

STRING

NULLABLE

The type of source, which will be one of the following from the Type column: journal, repository, conference, ebook platform.

updated_date

TIMESTAMP

NULLABLE

The last time anything in this Source object changed, expressed as an ISO 8601 date string. This date is updated for any change at all, including increases in various counts.

works_api_url

STRING

NULLABLE

A URL that will get you a list of all this Source’s Works. We express this as an API URL (instead of just listing the works themselves) because sometimes a source’s publication list is too long to reasonably fit into a single Source object.

works_count

INTEGER

NULLABLE

The number of Works this this Source hosts.

x_concepts

RECORD

REPEATED

The “x” in x_concepts is because it’s experimental and subject to removal with very little warning. We plan to replace it with a custom link to the Concepts API endpoint. The Concepts most frequently applied to works hosted by this source. Each is represented as a dehydrated Concept object, with one additional attribute:

x_concepts.display_name

STRING

NULLABLE

The English-language label of the concept.

x_concepts.id

STRING

NULLABLE

The OpenAlex ID for this concept.

x_concepts.level

INTEGER

NULLABLE

The level in the concept tree where this concept lives.

x_concepts.score

FLOAT

NULLABLE

The strength of association between this source and the listed concept, from 0-100.

x_concepts.wikidata

STRING

NULLABLE

The Wikidata ID for this concept.

Work

name

type

mode

description

abstract_inverted_index

RECORD

NULLABLE

The abstract of the work, as an inverted index, which encodes information about the abstract’s words and their positions within the text. Like Microsoft Academic Graph, OpenAlex doesn’t include plaintext abstracts due to legal constraints.

abstract_inverted_index.keys

STRING

REPEATED

Custom field created by COKI. Originally each word in the abstract was a key and the indices of where this word occurred inside the abstract the corresponding value.

abstract_inverted_index.values

STRING

REPEATED

Custom field created by COKI. Originally each word in the abstract was a key and the indices of where this word occurred inside the abstract the corresponding value.

apc_list

RECORD

NULLABLE

Objects containing information about the APC (article processing charge) for this work. This value is the APC list price–the price as listed by the journal’s publisher. That’s not always the price actually paid, because publishers may offer various discounts to authors. Unfortunately we don’t always know this discounted price, but when we do you can find it in apc_paid. Currently our only source for this data is DOAJ, and so doaj is the only value for apc_list.provenance, but we’ll add other sources over time.

apc_list.currency

STRING

NULLABLE

apc_list.price

INTEGER

NULLABLE

apc_list.price_usd

INTEGER

NULLABLE

APC converted to USD

apc_list.value

INTEGER

NULLABLE

apc_list.value_usd

INTEGER

NULLABLE

apc_list.provenance

STRING

NULLABLE

apc_paid

RECORD

NULLABLE

Object: Information about the paid APC (article processing charge) for this work. You can find the listed APC price (when we know it) for a given work using apc_list. However, authors don’t always pay the listed price; often they get a discounted price from publishers. So it’s useful to know the APC actually paid by authors, as distinct from the list price. This is our effort to provide this. Our best source for the actually paid price is the OpenAPC project. Where available, we use that data, and so apc_paid.provenance is openapc. Where OpenAPC data is unavailable (and unfortunately this is common) we make our best guess by assuming the author paid the APC list price, and apc_paid.provenance will be set to wherever we got the list price from.

apc_paid.currency

STRING

NULLABLE

apc_paid.price

INTEGER

NULLABLE

apc_paid.price_usd

INTEGER

NULLABLE

APC converted to USD

apc_paid.value

INTEGER

NULLABLE

apc_paid.value_usd

INTEGER

NULLABLE

apc_paid.provenance

STRING

NULLABLE

authors_count

INTEGER

NULLABLE

authorships_truncated

BOOLEAN

NULLABLE

authorships

RECORD

REPEATED

List of Authorship objects, each representing an author and their institution.

authorships.author

RECORD

NULLABLE

An author of this work, as a dehydrated Author object.

authorships.author.display_name

STRING

NULLABLE

The name of the author as a single string.

authorships.author.id

STRING

NULLABLE

The OpenAlex ID for this author.

authorships.author.orcid

STRING

NULLABLE

The ORCID for this author. ORCID global and unique ID for authors.

authorships.author_position

STRING

NULLABLE

A summarized description of this author’s position in the work’s author list. Possible values are first, middle, and last. It’s not strictly necessary, because author order is already implicitly recorded by the list order of Authorship objects; however it’s useful in some contexts to have this as a categorical value.

authorships.countries

STRING

REPEATED

authorships.institutions

RECORD

REPEATED

The institutional affiliations this author claimed in the context of this work, as dehydrated Institution objects.

authorships.institutions.country_code

STRING

NULLABLE

The country where this institution is located, represented as an ISO two-letter country code.

authorships.institutions.display_name

STRING

NULLABLE

The primary name of the institution.

authorships.institutions.id

STRING

NULLABLE

The OpenAlex ID for this institution.

authorships.institutions.lineage

STRING

REPEATED

OpenAlex IDs of institutions. The list will include this institution’s ID, as well as any parent institutions. If this institution has no parent institutions, this list will only contain its own ID.

authorships.institutions.ror

STRING

NULLABLE

The ROR ID for this institution. The ROR (Research Organization Registry) identifier is a globally unique ID for research organization. ROR is the successor to GRiD, which is no longer being updated.

authorships.institutions.type

STRING

NULLABLE

The institution’s primary type, using the ROR “type” controlled vocabulary. Possible values are: Education, Healthcare, Company, Archive, Nonprofit, Government, Facility, and Other.

authorships.is_corresponding

BOOLEAN

NULLABLE

authorships.raw_affiliation_string

STRING

NULLABLE

This author’s affiliation as it originally came to us (on a webpage or in an API), as a raw unformatted string.

authorships.raw_affiliation_strings

STRING

REPEATED

authorships.raw_author_name

STRING

NULLABLE

This author’s name as it originally came to us (on a webpage or in an API), as a raw unformatted string.

best_oa_location

RECORD

NULLABLE

A Location object with the best available open access location for this work.

best_oa_location.doi

STRING

NULLABLE

best_oa_location.is_accepted

BOOLEAN

NULLABLE

best_oa_location.is_oa

BOOLEAN

NULLABLE

True if this work is Open Access (OA).

best_oa_location.is_published

BOOLEAN

NULLABLE

best_oa_location.landing_page_url

STRING

NULLABLE

The landing page URL for this location.

best_oa_location.license

STRING

NULLABLE

The location’s publishing license. This can be a Create Commons license such as cc0 or cc-by, a publisher-specific license, or null which means we are not able to determine a license for this location.

best_oa_location.pdf_url

STRING

NULLABLE

A URL where you can find this location as a PDF.

best_oa_location.source

RECORD

NULLABLE

Information about the source of this location, as a DehydratedSource object.

best_oa_location.source.display_name

STRING

NULLABLE

The name of the source.

best_oa_location.source.host_institution_lineage

STRING

REPEATED

best_oa_location.source.host_institution_lineage_names

STRING

REPEATED

best_oa_location.source.host_organization

STRING

NULLABLE

The host organization for this source as an OpenAlex ID. This will be an Institution.id if the source is a repository, and a Publisher.id if the source is a journal, conference, or eBook platform (based on the type field).

best_oa_location.source.host_organization_lineage

STRING

REPEATED

OpenAlex IDs — See Publisher.lineage. This will only be included if the host_organization is a publisher (and not if the host_organization is an institution).

best_oa_location.source.host_organization_lineage_names

STRING

REPEATED

The names of the organisations in host_organization_lineage.

best_oa_location.source.host_organization_name

STRING

NULLABLE

The display_name from the host_organization, shown for convenience.

best_oa_location.source.id

STRING

NULLABLE

The OpenAlex ID for this source.

best_oa_location.source.is_in_doaj

BOOLEAN

NULLABLE

Whether this is a journal listed in the Directory of Open Access Journals (DOAJ).

best_oa_location.source.is_oa

BOOLEAN

NULLABLE

best_oa_location.source.issn

STRING

REPEATED

The ISSNs used by this source. Many publications have multiple ISSNs (see above), so ISSN-L should be used when possible.

best_oa_location.source.issn_l

STRING

NULLABLE

The ISSN-L identifying this source. This is the Canonical External ID for sources.

best_oa_location.source.publisher

STRING

NULLABLE

The publisher name.

best_oa_location.source.publisher_id

STRING

NULLABLE

The OpenAlex ID of the publisher.

best_oa_location.source.publisher_lineage

STRING

REPEATED

best_oa_location.source.publisher_lineage_names

STRING

REPEATED

best_oa_location.source.type

STRING

NULLABLE

The type of source.

best_oa_location.version

STRING

NULLABLE

The version of the work, based on the DRIVER Guidelines versioning scheme.

biblio

RECORD

NULLABLE

Old-timey bibliographic info for this work. This is mostly useful only in citation/reference contexts. These are all strings because sometimes you’ll get fun values like “Spring” and “Inside cover.”

biblio.first_page

STRING

NULLABLE

biblio.issue

STRING

NULLABLE

biblio.last_page

STRING

NULLABLE

biblio.volume

STRING

NULLABLE

cited_by_api_url

STRING

NULLABLE

cited_by_count

INTEGER

NULLABLE

The number of citations to this work. These are the times that other works have cited this work: Other works ➞ This work.

concepts

RECORD

REPEATED

List of dehydrated Concept objects. Each Concept object in the list also has one additional property

concepts.display_name

STRING

NULLABLE

The English-language label of the concept.

concepts.id

STRING

NULLABLE

The OpenAlex ID for this concept.

concepts.level

INTEGER

NULLABLE

The level in the concept tree where this concept lives.

concepts.score

FLOAT

NULLABLE

The strength of the connection between the work and this concept (higher is stronger).

concepts.wikidata

STRING

NULLABLE

The Wikidata ID for this concept.

concepts_count

INTEGER

NULLABLE

corresponding_author_ids

STRING

REPEATED

OpenAlex IDs of any authors for which authorships.is_corresponding is true.

corresponding_institution_ids

STRING

REPEATED

OpenAlex IDs of any institutions found within an authorship for which authorships.is_corresponding is true.

countries_distinct_count

INTEGER

NULLABLE

Number of distinct country_codes among the authorships for this work.

counts_by_year

RECORD

REPEATED

Works.cited_by_count for each of the last ten years, binned by year. To put it another way: each year, you can see how many times this work was cited.

counts_by_year.cited_by_count

INTEGER

NULLABLE

The number of times this work is cited in this year.

counts_by_year.oa_works_count

INTEGER

NULLABLE

counts_by_year.year

INTEGER

NULLABLE

The year.

created_date

DATE

NULLABLE

The date this Work object was created in the OpenAlex dataset, expressed as an ISO 8601 date string.

display_name

STRING

NULLABLE

Exactly the same as Work.title. It’s useful for Works to include a display_name property, since all the other entities have one.

doi

STRING

NULLABLE

The DOI for the work. This is the Canonical External ID for works. Occasionally, a work has more than one DOI–for example, there might be one DOI for a preprint version hosted on bioRxiv, and another DOI for the published version. However, this field always has just one DOI, the DOI for the published work.

doi_registration_agency

STRING

NULLABLE

fulltext_origin

STRING

NULLABLE

grants

RECORD

REPEATED

List of grant objects, which include the Funder and the award ID, if available. Our grants data comes from Crossref, and is currently fairly limited.

grants.award_id

STRING

NULLABLE

grants.funder

STRING

NULLABLE

grants.funder_display_name

STRING

NULLABLE

has_fulltext

BOOLEAN

NULLABLE

id

STRING

NULLABLE

The OpenAlex ID for this work.

ids

RECORD

NULLABLE

All the persistent identifiers (PIDs) that we know about for this work, as key: value pairs, where key is the PID namespace, and value is the PID. IDs are expressed as URIs where possible.

ids.arxiv_id

STRING

NULLABLE

ids.doi

STRING

NULLABLE

The DOI. Same as Work.doi

ids.mag

INTEGER

NULLABLE

The Microsoft Academic Graph ID

ids.openalex

STRING

NULLABLE

The OpenAlex ID. Same as Work.id

ids.pmcid

STRING

NULLABLE

the Pubmed Central identifier

ids.pmid

STRING

NULLABLE

The Pubmed Identifier

institutions_distinct_count

INTEGER

NULLABLE

Number of distinct institutions among the authorships for this work.

is_paratext

BOOLEAN

NULLABLE

True if we think this work is paratext. In our context, paratext is stuff that’s in scholarly venue (like a journal) but is about the venue rather than a scholarly work properly speaking. Some examples and nonexamples: yep it’s paratext: front cover, back cover, table of contents, editorial board listing, issue information, masthead. no, not paratext: research paper, dataset, letters to the editor, figures Turns out there is a lot of paratext in registries like Crossref. That’s not a bad thing… but we’ve found that it’s good to have a way to filter it out. We determine is_paratext algorithmically using title heuristics.

is_retracted

BOOLEAN

NULLABLE

True if we know this work has been retracted. This field has high precision but low recall. In other words, if is_retracted is true, the article is definitely retracted. But if is_retracted is False, it still might be retracted, but we just don’t know. This is because unfortunately, the open sources for retraction data aren’t currently very comprehensive, and the more comprehensive ones aren’t sufficiently open for us to use here.

language

STRING

NULLABLE

The language of the work in ISO 639-1 format. The language is automatically detected using the information we have about the work. We use the langdetect software library on the words in the work’s abstract, or the title if we do not have the abstract. The source code for this procedure is here. Keep in mind that this method is not perfect, and that in some cases the language of the title or abstract could be different from the body of the work.

license

STRING

NULLABLE

The license applied to this work at this host. Most toll-access works don’t have an explicit license (they’re under “all rights reserved” copyright), so this field generally has content only if is_oa is true.

locations

RECORD

REPEATED

A list of Location objects describing all unique places where this work lives.

locations.doi

STRING

NULLABLE

locations.is_accepted

BOOLEAN

NULLABLE

True if this location’s version is either acceptedVersion or publishedVersion; otherwise false.

locations.is_oa

BOOLEAN

NULLABLE

True if this work is Open Access (OA).

locations.is_published

BOOLEAN

NULLABLE

True if this location’s version is publishedVersion; otherwise false.

locations.landing_page_url

STRING

NULLABLE

The landing page URL for this location.

locations.license

STRING

NULLABLE

The location’s publishing license. This can be a Create Commons license such as cc0 or cc-by, a publisher-specific license, or null which means we are not able to determine a license for this location.

locations.pdf_url

STRING

NULLABLE

A URL where you can find this location as a PDF.

locations.source

RECORD

NULLABLE

locations.source.display_name

STRING

NULLABLE

The name of the source.

locations.source.host_institution_lineage

STRING

REPEATED

locations.source.host_institution_lineage_names

STRING

REPEATED

locations.source.host_organization

STRING

NULLABLE

The host organization for this source as an OpenAlex ID. This will be an Institution.id if the source is a repository, and a Publisher.id if the source is a journal, conference, or eBook platform (based on the type field).

locations.source.host_organization_lineage

STRING

REPEATED

OpenAlex IDs — See Publisher.lineage. This will only be included if the host_organization is a publisher (and not if the host_organization is an institution).

locations.source.host_organization_lineage_names

STRING

REPEATED

The names of the organisations in host_organization_lineage.

locations.source.host_organization_name

STRING

NULLABLE

The display_name from the host_organization, shown for convenience.

locations.source.id

STRING

NULLABLE

The OpenAlex ID for this source.

locations.source.is_in_doaj

BOOLEAN

NULLABLE

Whether this is a journal listed in the Directory of Open Access Journals (DOAJ).

locations.source.is_oa

BOOLEAN

NULLABLE

locations.source.issn

STRING

REPEATED

The ISSNs used by this source. Many publications have multiple ISSNs (see above), so ISSN-L should be used when possible.

locations.source.issn_l

STRING

NULLABLE

The ISSN-L identifying this source. This is the Canonical External ID for sources.

locations.source.publisher

STRING

NULLABLE

The publisher name.

locations.source.publisher_id

STRING

NULLABLE

The OpenAlex ID of the publisher.

locations.source.publisher_lineage

STRING

REPEATED

locations.source.publisher_lineage_names

STRING

REPEATED

locations.source.type

STRING

NULLABLE

The type of source.

locations.version

STRING

NULLABLE

The version of the work, based on the DRIVER Guidelines versioning scheme. Possible values are: publishedVersion: The document’s version of record. This is the most authoritative version. acceptedVersion: The document after having completed peer review and being officially accepted for publication. It will lack publisher formatting, but the content should be interchangeable with the that of the publishedVersion. submittedVersion: the document as submitted to the publisher by the authors, but before peer-review. Its content may differ significantly from that of the accepted article.

locations_count

INTEGER

NULLABLE

Number of locations for this work.

mesh

RECORD

REPEATED

List of MeSH tag objects. Only works found in PubMed have MeSH tags; for all other works, this is an empty list.

mesh.descriptor_name

STRING

NULLABLE

mesh.descriptor_ui

STRING

NULLABLE

mesh.is_major_topic

BOOLEAN

NULLABLE

mesh.qualifier_name

STRING

NULLABLE

mesh.qualifier_ui

STRING

NULLABLE

open_access

RECORD

NULLABLE

Information about the access status of this work, as an OpenAccess object.

open_access.any_repository_has_fulltext

BOOLEAN

NULLABLE

open_access.is_oa

BOOLEAN

NULLABLE

True if this work is Open Access (OA). There are many ways to define OA. OpenAlex uses a broad definition: having a URL where you can read the fulltext of this work without needing to pay money or log in. You can use the locations and oa_status fields to narrow your results further, accommodating any definition of OA you like.

open_access.oa_status

STRING

NULLABLE

The Open Access (OA) status of this work. Possible values are: -gold: Published in an OA journal that is indexed by the DOAJ. -green: Toll-access on the publisher landing page, but there is a free copy in an OA repository. -hybrid: Free under an open license in a toll-access journal. -bronze: Free to read on the publisher landing page, but without any identifiable license. -closed: All other articles.

open_access.oa_url

STRING

NULLABLE

The best Open Access (OA) URL for this work. Although there are many ways to define OA, in this context an OA URL is one where you can read the fulltext of this work without needing to pay money or log in. The “best” such URL is the one closest to the version of record. This URL might be a direct link to a PDF, or it might be to a landing page that links to the free PDF

primary_location

RECORD

NULLABLE

A Location object with the primary location of this work.

primary_location.doi

STRING

NULLABLE

primary_location.is_accepted

BOOLEAN

NULLABLE

primary_location.is_oa

BOOLEAN

NULLABLE

True if this work is Open Access (OA).

primary_location.is_published

BOOLEAN

NULLABLE

primary_location.landing_page_url

STRING

NULLABLE

The landing page URL for this location.

primary_location.license

STRING

NULLABLE

The location’s publishing license. This can be a Create Commons license such as cc0 or cc-by, a publisher-specific license, or null which means we are not able to determine a license for this location.

primary_location.pdf_url

STRING

NULLABLE

A URL where you can find this location as a PDF.

primary_location.source

RECORD

NULLABLE

primary_location.source.display_name

STRING

NULLABLE

The name of the source.

primary_location.source.host_institution_lineage

STRING

REPEATED

primary_location.source.host_institution_lineage_names

STRING

REPEATED

primary_location.source.host_organization

STRING

NULLABLE

The host organization for this source as an OpenAlex ID. This will be an Institution.id if the source is a repository, and a Publisher.id if the source is a journal, conference, or eBook platform (based on the type field).

primary_location.source.host_organization_lineage

STRING

REPEATED

OpenAlex IDs — See Publisher.lineage. This will only be included if the host_organization is a publisher (and not if the host_organization is an institution).

primary_location.source.host_organization_lineage_names

STRING

REPEATED

The names of the organisations in host_organization_lineage.

primary_location.source.host_organization_name

STRING

NULLABLE

The display_name from the host_organization, shown for convenience.

primary_location.source.id

STRING

NULLABLE

The OpenAlex ID for this source.

primary_location.source.is_in_doaj

BOOLEAN

NULLABLE

Whether this is a journal listed in the Directory of Open Access Journals (DOAJ).

primary_location.source.is_oa

BOOLEAN

NULLABLE

primary_location.source.issn

STRING

REPEATED

The ISSNs used by this source. Many publications have multiple ISSNs (see above), so ISSN-L should be used when possible.

primary_location.source.issn_l

STRING

NULLABLE

The ISSN-L identifying this source. This is the Canonical External ID for sources.

primary_location.source.publisher

STRING

NULLABLE

The publisher name.

primary_location.source.publisher_id

STRING

NULLABLE

The OpenAlex ID of the publisher.

primary_location.source.publisher_lineage

STRING

REPEATED

primary_location.source.publisher_lineage_names

STRING

REPEATED

primary_location.source.type

STRING

NULLABLE

The type of source.

primary_location.version

STRING

NULLABLE

The version of the work, based on the DRIVER Guidelines versioning scheme. Possible values are:. publishedVersion: The document’s version of record. This is the most authoritative version. acceptedVersion: The document after having completed peer review and being officially accepted for publication. It will lack publisher formatting, but the content should be interchangeable with the that of the publishedVersion. submittedVersion: the document as submitted to the publisher by the authors, but before peer-review. Its content may differ significantly from that of the accepted article.

publication_date

DATE

NULLABLE

The day when this work was published, formatted as an ISO 8601 date. Where different publication dates exist, we select the earliest available date of electronic publication. This date applies to the version found at Work.url. The other versions, found in Work.locations, may have been published at different (earlier) dates.

publication_year

INTEGER

NULLABLE

The year this work was published.

referenced_works

STRING

REPEATED

OpenAlex IDs for works that this work cites. These are citations that go from this work out to another work: This work ➞ Other works.

referenced_works_count

INTEGER

NULLABLE

related_works

STRING

REPEATED

OpenAlex IDs for works related to this work.

summary_stats

RECORD

NULLABLE

summary_stats.2yr_cited_by_count

INTEGER

NULLABLE

summary_stats.2yr_h_index

INTEGER

NULLABLE

summary_stats.2yr_i10_index

INTEGER

NULLABLE

summary_stats.2yr_mean_citedness

FLOAT

NULLABLE

summary_stats.cited_by_count

INTEGER

NULLABLE

summary_stats.h_index

INTEGER

NULLABLE

summary_stats.i10_index

INTEGER

NULLABLE

summary_stats.oa_percent

FLOAT

NULLABLE

sustainable_development_goals

RECORD

REPEATED

List of sustainable developement goal objects. The United Nations’ 17 Sustainable Development Goals are a collection of goals at the heart of a global “shared blueprint for peace and prosperity for people and the planet.” We use a machine learning model to tag works with their relevance to these goals based on our OpenAlex SDG Classifier, an mBERT machine learning model developed by the Aurora Universities Network. The score represents the model’s predicted probability of the work’s relevance for a particular goal.

sustainable_development_goals.display_name

STRING

NULLABLE

sustainable_development_goals.id

STRING

NULLABLE

sustainable_development_goals.score

FLOAT

NULLABLE

All of the SDGs with a prediction score higher than 0.1.

title

STRING

NULLABLE

The title of this work.

type

STRING

NULLABLE

The type or genre of the work. This field uses Crossref’s “type” controlled vocabulary; you can see all possible values via the Crossref api here: https://api.crossref.org/types. Where possible, we just pass along Crossref’s type value for each work. When that’s impossible (eg the work isn’t in Crossref), we do our best to figure out the type ourselves. Unfortunately the accuracy of Crossref’s data for this isn’t great, and ours isn’t much better. We’re working to develop better type classification.

type_crossref

STRING

NULLABLE

Legacy type information, using Crossref’s “type” controlled vocabulary.

updated

TIMESTAMP

NULLABLE

updated_date

TIMESTAMP

NULLABLE

The last time anything in this Work object changed, expressed as an ISO 8601 date string. This date is updated for any change at all, including increases in various counts.

url

STRING

NULLABLE

The URL where you can access this work.

version

STRING

NULLABLE

The version of the work, based on the DRIVER Guidelines versioning scheme. Possible values are: publishedVersion, acceptedVersion or submittedVersion.

keywords

RECORD

REPEATED

keywords.keyword

STRING

NULLABLE

keywords.score

FLOAT

NULLABLE

cited_by_percentile_year

RECORD

NULLABLE

cited_by_percentile_year.min

FLOAT

NULLABLE

cited_by_percentile_year.max

FLOAT

NULLABLE