academic_observatory_workflows.unpaywall_telescope.unpaywall_telescope

Module Contents

Classes

Changefile

UnpaywallRelease

Functions

create_dag(, dataset_description, table_description, ...)

The Unpaywall Data Feed Telescope.

snapshot_url(→ str)

Snapshot URL

get_snapshot_file_name(→ str)

Get the Unpaywall snapshot filename.

changefiles_url(→ str)

Data Feed URL

changefile_download_url(filename, api_key)

get_unpaywall_changefiles(→ List[Changefile])

unpaywall_filename_to_datetime(→ pendulum.DateTime)

Parses a release date from a file name.

Attributes

SNAPSHOT_URL

CHANGEFILES_URL

CHANGEFILES_DOWNLOAD_URL

academic_observatory_workflows.unpaywall_telescope.unpaywall_telescope.SNAPSHOT_URL = 'https://api.unpaywall.org/feed/snapshot'[source]
academic_observatory_workflows.unpaywall_telescope.unpaywall_telescope.CHANGEFILES_URL = 'https://api.unpaywall.org/feed/changefiles'[source]
academic_observatory_workflows.unpaywall_telescope.unpaywall_telescope.CHANGEFILES_DOWNLOAD_URL = 'https://api.unpaywall.org/daily-feed/changefile'[source]
class academic_observatory_workflows.unpaywall_telescope.unpaywall_telescope.Changefile(filename: str, changefile_date: pendulum.DateTime, changefile_release: observatory.platform.workflows.workflow.ChangefileRelease = None)[source]
property download_file_path[source]
property extract_file_path[source]
property transform_file_path[source]
__eq__(other)[source]

Return self==value.

static from_dict(dict_: Dict) Changefile[source]
to_dict() Dict[source]
class academic_observatory_workflows.unpaywall_telescope.unpaywall_telescope.UnpaywallRelease(*, dag_id: str, run_id: str, cloud_workspace: observatory.platform.observatory_config.CloudWorkspace, bq_dataset_id: str, bq_table_name: str, is_first_run: bool, snapshot_date: pendulum.DateTime, changefiles: List[Changefile], prev_end_date: pendulum.DateTime)[source]

Bases: observatory.platform.workflows.workflow.Release

static from_dict(dict_: dict) UnpaywallRelease[source]
to_dict() dict[source]
academic_observatory_workflows.unpaywall_telescope.unpaywall_telescope.create_dag(*, dag_id: str, cloud_workspace: observatory.platform.observatory_config.CloudWorkspace, bq_dataset_id: str = 'unpaywall', bq_table_name: str = 'unpaywall', api_dataset_id: str = 'unpaywall', schema_folder: str = project_path('unpaywall_telescope', 'schema'), dataset_description: str = 'Unpaywall Data Feed: https://unpaywall.org/products/data-feed', table_description: str = 'Unpaywall Data Feed: https://unpaywall.org/products/data-feed', primary_key: str = 'doi', snapshot_expiry_days: int = 7, http_header: str = None, unpaywall_conn_id: str = 'unpaywall', observatory_api_conn_id: str = AirflowConns.OBSERVATORY_API, start_date: pendulum.DateTime = pendulum.datetime(2021, 7, 2), schedule: str = '@daily', max_active_runs: int = 1, retries: int = 3) airflow.DAG[source]

The Unpaywall Data Feed Telescope.

Parameters:
  • dag_id – the id of the DAG.

  • cloud_workspace – the cloud workspace settings.

  • bq_dataset_id – the BigQuery dataset id.

  • bq_table_name – the BigQuery table name.

  • api_dataset_id – the API dataset id.

  • schema_folder – the schema folder.

  • dataset_description – a description for the BigQuery dataset.

  • table_description – a description for the table.

  • primary_key – the primary key to use for merging changefiles.

  • snapshot_expiry_days – the number of days to keep snapshots.

  • http_header – the http header to use when making requests to Unpaywall.

  • unpaywall_conn_id – Unpaywall connection key.

  • observatory_api_conn_id – the Observatory API connection key.

  • start_date – the start date of the DAG.

  • schedule – the schedule interval of the DAG.

  • max_active_runs – the maximum number of DAG runs that can be run at once.

  • retries – the number of times to retry a task.

academic_observatory_workflows.unpaywall_telescope.unpaywall_telescope.snapshot_url(api_key: str) str[source]

Snapshot URL

academic_observatory_workflows.unpaywall_telescope.unpaywall_telescope.get_snapshot_file_name(api_key: str) str[source]

Get the Unpaywall snapshot filename.

Returns:

Snapshot file date.

academic_observatory_workflows.unpaywall_telescope.unpaywall_telescope.changefiles_url(api_key: str) str[source]

Data Feed URL

academic_observatory_workflows.unpaywall_telescope.unpaywall_telescope.changefile_download_url(filename: str, api_key: str)[source]
academic_observatory_workflows.unpaywall_telescope.unpaywall_telescope.get_unpaywall_changefiles(api_key: str) List[Changefile][source]
academic_observatory_workflows.unpaywall_telescope.unpaywall_telescope.unpaywall_filename_to_datetime(file_name: str) pendulum.DateTime[source]

Parses a release date from a file name.

Parameters:

file_name – Unpaywall release file name (contains date string).

Returns:

date.