academic_observatory_workflows.workflows.open_citations_telescope

Module Contents

Classes

OpenCitationsRelease

OpenCitationsTelescope

A telescope that harvests the Open Citations COCI CSV dataset . http://opencitations.net/index/coci

Functions

list_releases(→ List[Dict[str, str]])

List available releases from figshare between the start and end date. Semi-open interval [start, end).

Attributes

VERSION_URL

academic_observatory_workflows.workflows.open_citations_telescope.VERSION_URL = 'https://api.figshare.com/v2/articles/6741422/versions'[source]
class academic_observatory_workflows.workflows.open_citations_telescope.OpenCitationsRelease(*, dag_id: str, run_id: str, snapshot_date: pendulum.DateTime, files: List[observatory.platform.utils.http_download.DownloadInfo])[source]

Bases: observatory.platform.workflows.workflow.SnapshotRelease

class academic_observatory_workflows.workflows.open_citations_telescope.OpenCitationsTelescope(*, dag_id: str, cloud_workspace: observatory.platform.observatory_config.CloudWorkspace, bq_dataset_id: str = 'open_citations', bq_table_name: str = 'open_citations', api_dataset_id: str = 'open_citations', schema_folder: str = os.path.join(default_schema_folder(), 'open_citations'), dataset_description: str = 'The OpenCitations Indexes: http://opencitations.net/', table_description: str = 'The OpenCitations COCI CSV table: http://opencitations.net/', observatory_api_conn_id: str = AirflowConns.OBSERVATORY_API, start_date: pendulum.DateTime = pendulum.datetime(2018, 7, 1), schedule: str = '@weekly', catchup: bool = True, queue: str = 'remote_queue')[source]

Bases: observatory.platform.workflows.workflow.Workflow

A telescope that harvests the Open Citations COCI CSV dataset . http://opencitations.net/index/coci

process_release(release: Dict[str, str]) bool[source]

Indicates whether we should process this release. If there are no files, or if the BigQuery table exists, we will not process this release.

Parameters:

release – the release instance.

Returns:

Whether to process the release.

get_release_info(**kwargs)[source]

Calculate which releases require processing, and push the info to an XCom.

make_release(**kwargs) List[OpenCitationsRelease][source]

Make release instances. The release is passed as an argument to the function (TelescopeFunction) that is called in ‘task_callable’.

Parameters:

kwargs – the context passed from the BranchPythonOperator. See

https://airflow.apache.org/docs/stable/macros-ref.html for a list of the keyword arguments that are passed to this argument. :return: list of OpenCitationsRelease instances.

download(releases: List[OpenCitationsRelease], **kwargs)[source]

Task to download the data.

upload_downloaded(releases: List[OpenCitationsRelease], **kwargs)[source]

Upload the data to Cloud Storage.

extract(releases: List[OpenCitationsRelease], **kwargs)[source]

Task to extract the data.

upload_transformed(releases: List[OpenCitationsRelease], **kwargs) None[source]

Upload the transformed data to Cloud Storage.

bq_load(releases: List[OpenCitationsRelease], **kwargs) None[source]

Load the data into BigQuery.

add_new_dataset_releases(releases: List[OpenCitationsRelease], **kwargs) None[source]

Adds release information to API.

cleanup(releases: List[OpenCitationsRelease], **kwargs) None[source]

Delete all files, folders and XComs associated with this release.

academic_observatory_workflows.workflows.open_citations_telescope.list_releases(start_date: pendulum.DateTime, end_date: pendulum.DateTime) List[Dict[str, str]][source]

List available releases from figshare between the start and end date. Semi-open interval [start, end).

Parameters:
  • start_date – Start date.

  • end_date – End date.

Returns:

List of dictionaries containing release info.