academic_observatory_workflows.unpaywall_telescope.telescope

Classes

DagParams

param dag_id:

the id of the DAG.

Functions

create_dag(→ airflow.DAG)

The Unpaywall Data Feed Telescope.

Module Contents

class academic_observatory_workflows.unpaywall_telescope.telescope.DagParams(dag_id: str, cloud_workspace: observatory_platform.airflow.workflow.CloudWorkspace, bq_dataset_id: str = 'unpaywall', bq_table_name: str = 'unpaywall', api_bq_dataset_id: str = 'dataset_api', schema_folder: str = project_path('unpaywall_telescope', 'schema'), dataset_description: str = 'Unpaywall Data Feed: https://unpaywall.org/products/data-feed', table_description: str = 'Unpaywall Data Feed: https://unpaywall.org/products/data-feed', primary_key: str = 'doi', unpaywall_base_url: str = 'https://api.unpaywall.org', snapshot_expiry_days: int = 7, http_header: str = None, unpaywall_conn_id: str = 'unpaywall', start_date: pendulum.DateTime = pendulum.datetime(2021, 7, 2), schedule: str = '@daily', max_active_runs: int = 1, retries: int = 3, test_run: bool = False, gke_volume_size: str = '1000Gi', gke_namespace: str = 'coki-astro', gke_volume_name: str = 'unpaywall', **kwargs)[source]
Parameters:
  • dag_id – the id of the DAG.

  • cloud_workspace – the cloud workspace settings.

  • bq_dataset_id – the BigQuery dataset id.

  • bq_table_name – the BigQuery table name.

  • api_bq_dataset_id – the API dataset id.

  • schema_folder – the schema folder.

  • dataset_description – a description for the BigQuery dataset.

  • table_description – a description for the table.

  • primary_key – the primary key to use for merging changefiles.

  • unpaywall_base_url – The unpaywall api base url.

  • snapshot_expiry_days – the number of days to keep snapshots.

  • http_header – the http header to use when making requests to Unpaywall.

  • unpaywall_conn_id – Unpaywall connection key.

  • observatory_api_conn_id – the Observatory API connection key.

  • start_date – the start date of the DAG.

  • schedule – the schedule interval of the DAG.

  • max_active_runs – the maximum number of DAG runs that can be run at once.

  • retries – the number of times to retry a task.

  • gke_namespace – The cluster namespace to use.

  • gke_volume_name – The name of the persistent volume to create

  • gke_volume_size – The amount of storage to request for the persistent volume in GiB

  • kwargs – Takes kwargs for building a GkeParams object.

dag_id[source]
cloud_workspace[source]
bq_dataset_id = 'unpaywall'[source]
bq_table_name = 'unpaywall'[source]
api_bq_dataset_id = 'dataset_api'[source]
schema_folder[source]
schema_file_path[source]
dataset_description = 'Unpaywall Data Feed: https://unpaywall.org/products/data-feed'[source]
table_description = 'Unpaywall Data Feed: https://unpaywall.org/products/data-feed'[source]
primary_key = 'doi'[source]
unpaywall_base_url = 'https://api.unpaywall.org'[source]
snapshot_expiry_days = 7[source]
http_header = None[source]
unpaywall_conn_id = 'unpaywall'[source]
start_date[source]
schedule = '@daily'[source]
max_active_runs = 1[source]
retries = 3[source]
test_run = False[source]
gke_volume_size = '1000Gi'[source]
gke_namespace = 'coki-astro'[source]
gke_volume_name = 'unpaywall'[source]
gke_params[source]
academic_observatory_workflows.unpaywall_telescope.telescope.create_dag(dag_params: DagParams) airflow.DAG[source]

The Unpaywall Data Feed Telescope.