academic_observatory_workflows.pubmed_telescope.telescope
Classes
|
Functions
|
Construct a PubMed Telescope instance. |
Module Contents
- class academic_observatory_workflows.pubmed_telescope.telescope.DagParams(dag_id: str, cloud_workspace: observatory_platform.airflow.workflow.CloudWorkspace, bq_dataset_id: str = 'pubmed', api_bq_dataset_id: str = 'dataset_api', bq_main_table_name: str = 'pubmed', bq_upsert_table_name: str = 'pubmed_upsert', bq_delete_table_name: str = 'pubmed_delete', bq_dataset_description: str = 'Pubmed Medline database, only PubmedArticle records: https://pubmed.ncbi.nlm.nih.gov/about/', start_date: pendulum.DateTime = pendulum.datetime(year=2021, month=1, day=1), schedule: str = '@weekly', ftp_server_url: str = 'ftp.ncbi.nlm.nih.gov', ftp_port: int = 21, reset_ftp_counter: int = 40, max_download_attempt: int = 5, snapshot_expiry_days: int = 31, max_processes: int | None = None, max_active_runs: int = 1, retries: int = 3, baseline_table_description="Pubmed's main table of PubmedArticle reocrds - Includes all the metadata associated with a journal article citation, both the metadata to describe the published article, i.e. <MedlineCitation>, and additional metadata often pertaining to the publication's history or processing at NLM, i.e. <PubMedData>.", upsert_table_description="PubmedArticle upserts - Includes all the metadata associated with a journal article citation, both the metadata to describe the published article, i.e. <MedlineCitation>, and additional metadata often pertaining to the publication's history or processing at NLM, i.e. <PubMedData>.", delete_table_description='PubmedArticle deletes - Indicates one or more <PubmedArticle> or <PubmedBookArticle> that have been deleted. PMIDs in DeleteCitation will typically have been found to be duplicate citations, or citations to content that was determined to be out-of-scope for PubMed. It is possible that a PMID would appear in DeleteCitation without having been distributed in a previous file. This would happen if the creation and deletion of the record take place on the same day.', test_run: bool = False, gke_volume_size: str = '1000Gi', gke_namespace: str = 'coki-astro', gke_volume_name: str = 'pubmed', **kwargs)[source]
- Parameters:
dag_id – the id of the DAG.
cloud_workspace – Cloud settings.
bq_dataset_id – Dataset name for final tables.
api_bq_dataset_id – The dataset ID of the bigquery API.
bq_main_table_name – Table name of the final Pubmed table.
bq_upsert_table_name – Table name of the Pubmed upsert table.
bq_delete_table_name – Table name of the Pubmed delete table.
bq_dataset_description – Description of the Pubmed dataset.
start_date – The start date of the DAG.
schedule – How often the DAG should run.
ftp_server_url – Server address of Pubmed’s FTP server.
ftp_port – Port for connectiong to Pubmed’s FTP server.
reset_ftp_counter – Resets FTP connection after downloading x number of files.
max_download_attempt – Maximum number of download attempts of a single Pubmed file from the FTP server before throwing an error.
snapshot_expiry_days – How long until the backup snapshot (before this release’s upserts and deletes) of the Pubmed table exist in BQ.
max_processes – Max number of parallel processes. If None, will be determined at task runtime with cpu count.
max_active_runs – the maximum number of DAG runs that can be run at once.
retries – the number of times to retry a task.
test_run – Whether this is a test run or not.
gke_namespace – The cluster namespace to use.
gke_volume_name – The name of the persistent volume to create
gke_volume_size – The amount of storage to request for the persistent volume
kwargs – Takes kwargs for building a GkeParams object.
- bq_dataset_description = 'Pubmed Medline database, only PubmedArticle records: https://pubmed.ncbi.nlm.nih.gov/about/'[source]
- baseline_table_description = "Pubmed's main table of PubmedArticle reocrds - Includes all the metadata associated with a...[source]
- upsert_table_description = "PubmedArticle upserts - Includes all the metadata associated with a journal article citation,...[source]