academic_observatory_workflows.pubmed_telescope.telescope ========================================================= .. py:module:: academic_observatory_workflows.pubmed_telescope.telescope Classes ------- .. autoapisummary:: academic_observatory_workflows.pubmed_telescope.telescope.DagParams Functions --------- .. autoapisummary:: academic_observatory_workflows.pubmed_telescope.telescope.create_dag Module Contents --------------- .. py:class:: DagParams(dag_id: str, cloud_workspace: observatory_platform.airflow.workflow.CloudWorkspace, bq_dataset_id: str = 'pubmed', api_bq_dataset_id: str = 'dataset_api', bq_main_table_name: str = 'pubmed', bq_upsert_table_name: str = 'pubmed_upsert', bq_delete_table_name: str = 'pubmed_delete', bq_dataset_description: str = 'Pubmed Medline database, only PubmedArticle records: https://pubmed.ncbi.nlm.nih.gov/about/', start_date: pendulum.DateTime = pendulum.datetime(year=2021, month=1, day=1), schedule: str = '@weekly', ftp_server_url: str = 'ftp.ncbi.nlm.nih.gov', ftp_port: int = 21, reset_ftp_counter: int = 40, max_download_attempt: int = 5, snapshot_expiry_days: int = 31, max_processes: Optional[int] = None, max_active_runs: int = 1, retries: int = 3, baseline_table_description="Pubmed's main table of PubmedArticle reocrds - Includes all the metadata associated with a journal article citation, both the metadata to describe the published article, i.e. , and additional metadata often pertaining to the publication's history or processing at NLM, i.e. .", upsert_table_description="PubmedArticle upserts - Includes all the metadata associated with a journal article citation, both the metadata to describe the published article, i.e. , and additional metadata often pertaining to the publication's history or processing at NLM, i.e. .", delete_table_description='PubmedArticle deletes - Indicates one or more or that have been deleted. PMIDs in DeleteCitation will typically have been found to be duplicate citations, or citations to content that was determined to be out-of-scope for PubMed. It is possible that a PMID would appear in DeleteCitation without having been distributed in a previous file. This would happen if the creation and deletion of the record take place on the same day.', test_run: bool = False, gke_volume_size: str = '1000Gi', gke_namespace: str = 'coki-astro', gke_volume_name: str = 'pubmed', **kwargs) :param dag_id: the id of the DAG. :param cloud_workspace: Cloud settings. :param bq_dataset_id: Dataset name for final tables. :param api_bq_dataset_id: The dataset ID of the bigquery API. :param bq_main_table_name: Table name of the final Pubmed table. :param bq_upsert_table_name: Table name of the Pubmed upsert table. :param bq_delete_table_name: Table name of the Pubmed delete table. :param bq_dataset_description: Description of the Pubmed dataset. :param start_date: The start date of the DAG. :param schedule: How often the DAG should run. :param ftp_server_url: Server address of Pubmed's FTP server. :param ftp_port: Port for connectiong to Pubmed's FTP server. :param reset_ftp_counter: Resets FTP connection after downloading x number of files. :param max_download_attempt: Maximum number of download attempts of a single Pubmed file from the FTP server before throwing an error. :param snapshot_expiry_days: How long until the backup snapshot (before this release's upserts and deletes) of the Pubmed table exist in BQ. :param max_processes: Max number of parallel processes. If None, will be determined at task runtime with cpu count. :param max_active_runs: the maximum number of DAG runs that can be run at once. :param retries: the number of times to retry a task. :param test_run: Whether this is a test run or not. :param gke_namespace: The cluster namespace to use. :param gke_volume_name: The name of the persistent volume to create :param gke_volume_size: The amount of storage to request for the persistent volume :param kwargs: Takes kwargs for building a GkeParams object. .. py:attribute:: dag_id .. py:attribute:: cloud_workspace .. py:attribute:: bq_dataset_id :value: 'pubmed' .. py:attribute:: api_bq_dataset_id :value: 'dataset_api' .. py:attribute:: bq_main_table_name :value: 'pubmed' .. py:attribute:: bq_upsert_table_name :value: 'pubmed_upsert' .. py:attribute:: bq_delete_table_name :value: 'pubmed_delete' .. py:attribute:: bq_dataset_description :value: 'Pubmed Medline database, only PubmedArticle records: https://pubmed.ncbi.nlm.nih.gov/about/' .. py:attribute:: baseline_table_description :value: "Pubmed's main table of PubmedArticle reocrds - Includes all the metadata associated with a... .. py:attribute:: upsert_table_description :value: "PubmedArticle upserts - Includes all the metadata associated with a journal article citation,... .. py:attribute:: delete_table_description :value: 'PubmedArticle deletes - Indicates one or more or that have... .. py:attribute:: start_date .. py:attribute:: schedule :value: '@weekly' .. py:attribute:: ftp_server_url :value: 'ftp.ncbi.nlm.nih.gov' .. py:attribute:: ftp_port :value: 21 .. py:attribute:: reset_ftp_counter :value: 40 .. py:attribute:: max_download_attempt :value: 5 .. py:attribute:: snapshot_expiry_days :value: 31 .. py:attribute:: max_processes :value: None .. py:attribute:: max_active_runs :value: 1 .. py:attribute:: retries :value: 3 .. py:attribute:: test_run :value: False .. py:attribute:: gke_params .. py:function:: create_dag(dag_params: DagParams) -> airflow.DAG Construct a PubMed Telescope instance.