academic_observatory_workflows.orcid_telescope.telescope
Classes
|
Functions
|
Construct an ORCID telescope instance. |
Module Contents
- class academic_observatory_workflows.orcid_telescope.telescope.DagParams(dag_id: str, cloud_workspace: observatory_platform.airflow.workflow.CloudWorkspace, orcid_bucket: str = 'ao-orcid', orcid_summaries_prefix: str = 'orcid_summaries', bq_dataset_id: str = 'orcid', api_bq_dataset_id: str = 'dataset_api', bq_main_table_name: str = 'orcid', bq_upsert_table_name: str = 'orcid_upsert', bq_delete_table_name: str = 'orcid_delete', dataset_description: str = 'The ORCID dataset and supporting tables', snapshot_expiry_days: int = 31, schema_file_path: str = project_path('orcid_telescope', 'schema', 'orcid.json'), delete_schema_file_path: str = project_path('orcid_telescope', 'schema', 'orcid_delete.json'), transfer_attempts: int = 5, max_workers: int | None = None, observatory_api_conn_id: str = AirflowConns.OBSERVATORY_API, aws_orcid_conn_id: str = 'aws_orcid', start_date: pendulum.DateTime = pendulum.datetime(2023, 6, 1), schedule: str = '0 0 * * 0', max_active_runs: int = 1, retries: int = 3, test_run: bool = False, gke_volume_size: str = '1000Gi', gke_namespace: str = 'coki-astro', gke_volume_name: str = 'orcid', **kwargs)[source]
- Parameters:
dag_id – the id of the DAG.
cloud_workspace – the cloud workspace settings.
orcid_bucket – the Google Cloud Storage bucket where the ORCID files are stored.
orcid_summaries_prefix – the base folder containing the ORCID summaries.
bq_dataset_id – BigQuery dataset ID.
bq_main_table_name – BigQuery main table name for the ORCID table.
bq_upsert_table_name – BigQuery table name for the ORCID upsert table.
bq_delete_table_name – BigQuery table name for the ORCID delete table.
dataset_description – BigQuery dataset description.
snapshot_expiry_days – the number of days that a snapshot of each entity’s main table will take to expire,
which is set to 31 days so there is some time to rollback after an update. :param schema_file_path: the path to the schema file for the records produced by this workflow. :param delete_schema_file_path: the path to the delete schema file for the records produced by this workflow. :param transfer_attempts: the number of AWS to GCP transfer attempts. :param max_workers: maximum processes to use when transforming files. :param observatory_api_conn_id: the Observatory API Airflow Connection ID. :param aws_orcid_conn_id: Airflow Connection ID for the AWS ORCID bucket. :param start_date: the Apache Airflow DAG start date. :param schedule: the Apache Airflow schedule interval. :param max_active_runs: the maximum number of DAG runs that can be run at once. :param test_run: Whether this is a test run or not. :param gke_namespace: The cluster namespace to use. :param gke_volume_name: The name of the persistent volume to create :param gke_volume_size: The amount of storage to request for the persistent volume :param kwargs: Takes kwargs for building a GkeParams object.