academic_observatory_workflows.crossref_metadata_telescope.telescope

Classes

DagParams

Parameters for the Crossref Metadata Telescope

Functions

create_dag(→ airflow.DAG)

The Crossref Metadata DAG

Module Contents

class academic_observatory_workflows.crossref_metadata_telescope.telescope.DagParams(dag_id: str, cloud_workspace: observatory_platform.airflow.workflow.CloudWorkspace, bq_dataset_id: str = 'crossref_metadata', bq_table_name: str = 'crossref_metadata', api_bq_dataset_id: str = 'dataset_api', schema_folder: str = project_path('crossref_metadata_telescope', 'schema'), dataset_description: str = 'The Crossref Metadata Plus dataset: https://www.crossref.org/services/metadata-retrieval/metadata-plus/', table_description: str = 'The Crossref Metadata Plus dataset: https://www.crossref.org/services/metadata-retrieval/metadata-plus/', crossref_metadata_conn_id: str = 'crossref_metadata', crossref_base_url: str = 'https://api.crossref.org', max_processes: int | None = None, batch_size: int = 20, start_date: pendulum.DateTime = pendulum.datetime(2020, 6, 7), schedule: str = '0 0 7 * *', catchup: bool = True, max_active_runs: int = 1, retries: int = 3, test_run: bool = False, gke_volume_size: str = '3000Gi', gke_namespace: str = 'coki-astro', gke_volume_name: str = 'crossref-metadata', **kwargs)[source]

Parameters for the Crossref Metadata Telescope

Parameters:

dag_id – the id of the DAG.
cloud_workspace – the cloud workspace settings.
bq_dataset_id – the BigQuery dataset id.
bq_table_name – the BigQuery table name.
api_dataset_id – the Dataset ID to use when storing releases.
schema_folder – the SQL schema path.
dataset_description – description for the BigQuery dataset.
table_description – description for the BigQuery table.
crossref_metadata_conn_id – the Crossref Metadata Airflow connection key.
crossref_base_url – The crossref metadata api base url.
observatory_api_conn_id – the Observatory API connection key.
max_processes – Max number of parallel processes. If None, will be determined at task runtime with cpu count.
batch_size – the number of files to send to ProcessPoolExecutor at one time.
start_date – the start date of the DAG.
schedule – the schedule interval of the DAG.
catchup – whether to catchup the DAG or not.
max_active_runs – the maximum number of DAG runs that can be run at once.
retries – the number of times to retry a task.
test_run – Whether this is a test run or not.
gke_namespace – The cluster namespace to use.
gke_volume_name – The name of the persistent volume to create
gke_volume_size – The amount of storage to request for the persistent volume
kwargs – Takes kwargs for building a GkeParams object.

dag_id[source]

cloud_workspace[source]

bq_dataset_id = 'crossref_metadata'[source]

bq_table_name = 'crossref_metadata'[source]

api_bq_dataset_id = 'dataset_api'[source]

schema_folder[source]

dataset_description = 'The Crossref Metadata Plus dataset: https://www.crossref.org/services/metadata-retrieval/metadata-plus/'[source]

table_description = 'The Crossref Metadata Plus dataset: https://www.crossref.org/services/metadata-retrieval/metadata-plus/'[source]

crossref_metadata_conn_id = 'crossref_metadata'[source]

crossref_base_url = 'https://api.crossref.org'[source]

max_processes = None[source]

batch_size = 20[source]

start_date[source]

schedule = '0 0 7 * *'[source]

catchup = True[source]

max_active_runs = 1[source]

retries = 3[source]

test_run = False[source]

gke_params[source]

academic_observatory_workflows.crossref_metadata_telescope.telescope.create_dag(dag_params: DagParams) → airflow.DAG[source]: The Crossref Metadata DAG