academic_observatory_workflows.workflows.open_citations_telescope
Module Contents
Classes
A telescope that harvests the Open Citations COCI CSV dataset . http://opencitations.net/index/coci |
Functions
|
List available releases from figshare between the start and end date. Semi-open interval [start, end). |
Attributes
- academic_observatory_workflows.workflows.open_citations_telescope.VERSION_URL = 'https://api.figshare.com/v2/articles/6741422/versions'[source]
- class academic_observatory_workflows.workflows.open_citations_telescope.OpenCitationsRelease(*, dag_id: str, run_id: str, snapshot_date: pendulum.DateTime, files: List[observatory.platform.utils.http_download.DownloadInfo])[source]
Bases:
observatory.platform.workflows.workflow.SnapshotRelease
- class academic_observatory_workflows.workflows.open_citations_telescope.OpenCitationsTelescope(*, dag_id: str, cloud_workspace: observatory.platform.observatory_config.CloudWorkspace, bq_dataset_id: str = 'open_citations', bq_table_name: str = 'open_citations', api_dataset_id: str = 'open_citations', schema_folder: str = os.path.join(default_schema_folder(), 'open_citations'), dataset_description: str = 'The OpenCitations Indexes: http://opencitations.net/', table_description: str = 'The OpenCitations COCI CSV table: http://opencitations.net/', observatory_api_conn_id: str = AirflowConns.OBSERVATORY_API, start_date: pendulum.DateTime = pendulum.datetime(2018, 7, 1), schedule: str = '@weekly', catchup: bool = True, queue: str = 'remote_queue')[source]
Bases:
observatory.platform.workflows.workflow.WorkflowA telescope that harvests the Open Citations COCI CSV dataset . http://opencitations.net/index/coci
- process_release(release: Dict[str, str]) bool[source]
Indicates whether we should process this release. If there are no files, or if the BigQuery table exists, we will not process this release.
- Parameters:
release – the release instance.
- Returns:
Whether to process the release.
- get_release_info(**kwargs)[source]
Calculate which releases require processing, and push the info to an XCom.
- make_release(**kwargs) List[OpenCitationsRelease][source]
Make release instances. The release is passed as an argument to the function (TelescopeFunction) that is called in ‘task_callable’.
- Parameters:
kwargs – the context passed from the BranchPythonOperator. See
https://airflow.apache.org/docs/stable/macros-ref.html for a list of the keyword arguments that are passed to this argument. :return: list of OpenCitationsRelease instances.
- download(releases: List[OpenCitationsRelease], **kwargs)[source]
Task to download the data.
- upload_downloaded(releases: List[OpenCitationsRelease], **kwargs)[source]
Upload the data to Cloud Storage.
- extract(releases: List[OpenCitationsRelease], **kwargs)[source]
Task to extract the data.
- upload_transformed(releases: List[OpenCitationsRelease], **kwargs) None[source]
Upload the transformed data to Cloud Storage.
- bq_load(releases: List[OpenCitationsRelease], **kwargs) None[source]
Load the data into BigQuery.
- add_new_dataset_releases(releases: List[OpenCitationsRelease], **kwargs) None[source]
Adds release information to API.
- cleanup(releases: List[OpenCitationsRelease], **kwargs) None[source]
Delete all files, folders and XComs associated with this release.
- academic_observatory_workflows.workflows.open_citations_telescope.list_releases(start_date: pendulum.DateTime, end_date: pendulum.DateTime) List[Dict[str, str]][source]
List available releases from figshare between the start and end date. Semi-open interval [start, end).
- Parameters:
start_date – Start date.
end_date – End date.
- Returns:
List of dictionaries containing release info.