academic_observatory_workflows.orcid_telescope.telescope

Classes

DagParams

param dag_id:

the id of the DAG.

Functions

create_dag(→ airflow.DAG)

Construct an ORCID telescope instance.

Module Contents

class academic_observatory_workflows.orcid_telescope.telescope.DagParams(dag_id: str, cloud_workspace: observatory_platform.airflow.workflow.CloudWorkspace, orcid_bucket: str = 'ao-orcid', orcid_summaries_prefix: str = 'orcid_summaries', bq_dataset_id: str = 'orcid', api_bq_dataset_id: str = 'dataset_api', bq_main_table_name: str = 'orcid', bq_upsert_table_name: str = 'orcid_upsert', bq_delete_table_name: str = 'orcid_delete', dataset_description: str = 'The ORCID dataset and supporting tables', snapshot_expiry_days: int = 31, schema_file_path: str = project_path('orcid_telescope', 'schema', 'orcid.json'), delete_schema_file_path: str = project_path('orcid_telescope', 'schema', 'orcid_delete.json'), transfer_attempts: int = 5, max_workers: int | None = None, observatory_api_conn_id: str = AirflowConns.OBSERVATORY_API, aws_orcid_conn_id: str = 'aws_orcid', start_date: pendulum.DateTime = pendulum.datetime(2023, 6, 1), schedule: str = '0 0 * * 0', max_active_runs: int = 1, retries: int = 3, test_run: bool = False, gke_volume_size: str = '1000Gi', gke_namespace: str = 'coki-astro', gke_volume_name: str = 'orcid', **kwargs)[source]
Parameters:
  • dag_id – the id of the DAG.

  • cloud_workspace – the cloud workspace settings.

  • orcid_bucket – the Google Cloud Storage bucket where the ORCID files are stored.

  • orcid_summaries_prefix – the base folder containing the ORCID summaries.

  • bq_dataset_id – BigQuery dataset ID.

  • bq_main_table_name – BigQuery main table name for the ORCID table.

  • bq_upsert_table_name – BigQuery table name for the ORCID upsert table.

  • bq_delete_table_name – BigQuery table name for the ORCID delete table.

  • dataset_description – BigQuery dataset description.

  • snapshot_expiry_days – the number of days that a snapshot of each entity’s main table will take to expire,

which is set to 31 days so there is some time to rollback after an update. :param schema_file_path: the path to the schema file for the records produced by this workflow. :param delete_schema_file_path: the path to the delete schema file for the records produced by this workflow. :param transfer_attempts: the number of AWS to GCP transfer attempts. :param max_workers: maximum processes to use when transforming files. :param observatory_api_conn_id: the Observatory API Airflow Connection ID. :param aws_orcid_conn_id: Airflow Connection ID for the AWS ORCID bucket. :param start_date: the Apache Airflow DAG start date. :param schedule: the Apache Airflow schedule interval. :param max_active_runs: the maximum number of DAG runs that can be run at once. :param test_run: Whether this is a test run or not. :param gke_namespace: The cluster namespace to use. :param gke_volume_name: The name of the persistent volume to create :param gke_volume_size: The amount of storage to request for the persistent volume :param kwargs: Takes kwargs for building a GkeParams object.

dag_id[source]
cloud_workspace[source]
orcid_bucket = 'ao-orcid'[source]
orcid_summaries_prefix = 'orcid_summaries'[source]
bq_dataset_id = 'orcid'[source]
api_bq_dataset_id = 'dataset_api'[source]
bq_main_table_name = 'orcid'[source]
bq_upsert_table_name = 'orcid_upsert'[source]
bq_delete_table_name = 'orcid_delete'[source]
dataset_description = 'The ORCID dataset and supporting tables'[source]
snapshot_expiry_days = 31[source]
schema_file_path[source]
delete_schema_file_path[source]
transfer_attempts = 5[source]
max_workers = None[source]
observatory_api_conn_id[source]
aws_orcid_conn_id = 'aws_orcid'[source]
start_date[source]
schedule = '0 0 * * 0'[source]
max_active_runs = 1[source]
retries = 3[source]
test_run = False[source]
gke_params[source]
academic_observatory_workflows.orcid_telescope.telescope.create_dag(dag_params: DagParams) airflow.DAG[source]

Construct an ORCID telescope instance.