ORCID

The ORCID (Open Researcher and Contributor ID) is a nonproprietary alphanumeric code to uniquely identify authors and contributors of scholarly communication as well as ORCID’s website and services to look up authors and their bibliographic output (and other user-supplied pieces of information). For more information, see: https://orcid.org/ This telescope syncs records from the ORCID AWS bucket and stores the up-to-date records in the BigQuery table.

The corresponding tables created in BigQuery are orcid.orcid and orcid.orcid_partitions.

Summary

Average runtime

15 hours

Average download size

70 GB

Harvest Type

AWS transfer

Harvest Frequency

Weekly

Runs on remote worker

True

Catchup missed runs

False

Table Write Disposition

Append

Update Frequency

Daily

Credentials Required

Yes

Uses Telescope Template

Stream

Using the transfer service

The files in the AWS bucket are transferred to a separate Google Cloud storage bucket using the storage transfer service. Unfortunately it is not possible to use the transfer service with a specified directory in a Google Cloud bucket, so a separate bucket needs to be created to sync the data. To use the transfer service it is required to enable the Storage Transfer API and to set the correct permissions on the Google Cloud Storage bucket as well as the AWS bucket.

Enabling the Storage Transfer API

The API should already be enabled from the Terraform set-up. If this is not the case, see the google support answer for info on how to enable an API. Search for the Storage Transfer API and enable this.

Setting permissions on Google Cloud bucket

A separate bucket needs to be created for the ORCID records. The following permissions are required on this Google Cloud bucket:

  • storage.buckets.get

  • storage.objects.list

  • storage.objects.get

  • storage.objects.create

The roles/storage.objectViewer and roles/storage.legacyBucketWriter roles together contain the permissions that are always required. These roles or permissions need to be assigned at the specific bucket to the service account performing the transfer.

The Storage Transfer Service uses the project-[$PROJECT_NUMBER]@storage-transfer-service.iam.gserviceaccount.com service account.

Additionally, the Airflow Service account requires the storage.buckets.get permission on the ORCID bucket, in order to check whether the bucket exists before starting the telescope, the role Storage Admin contains this permission. The Airflow Service account is in the format of <project_id>@<project_id>.iam.gserviceaccount.com

Setting permissions on AWS bucket

The AWS buckets are managed by ORCID. There are three different buckets:

  • orcid-lambda-file

  • v2.0-summaries

  • v2.0-activities

In this telescope only the first two are used. The required policy for these buckets is:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetBucketLocation"
      ],
      "Resource": [
        "arn:aws:s3:::orcid-lambda-file",
        "arn:aws:s3:::v2.0-activities",
        "arn:aws:s3:::v2.0-summaries"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::orcid-lambda-file/*",
        "arn:aws:s3:::v2.0-activities/*",
        "arn:aws:s3:::v2.0-summaries/*"
      ]
    }
  ]
}

Airflow connections

Note that all values need to be urlencoded. In the config.yaml file, the following airflow connections are required:

orcid

This connection contains the AWS access key id and secret access key that are used to access data in the AWS buckets. Make sure to URL encode each of the fields ‘access_key_id’ and ‘secret_access_key’.

orcid: aws://<access_key_id>:<secret_access_key>@

Airflow variables

In the config.yaml file, the following airflow variables are required (without gs:// prefix):

orcid_bucket

orcid_bucket: <orcid_bucket_name>