academic_observatory_workflows.workflows.crossref_fundref_telescope

Module Contents

Classes

CrossrefFundrefRelease

CrossrefFundrefTelescope

Functions

list_releases(→ List[dict])

List all available CrossrefFundref releases between the start and end date

new_funder_template()

Helper Function for creating a new Funder.

parse_fundref_registry_rdf(→ Tuple[List, Dict])

Helper function to parse a fundref registry rdf file and to return a python list containing each funder.

add_funders_relationships(→ List)

Adds any children/parent relationships to funder instances in the funders list.

recursive_funders(→ Tuple[List, int])

Recursively goes through a funder/sub_funder dict. The funder properties can be looked up with the

Attributes

RELEASES_URL

academic_observatory_workflows.workflows.crossref_fundref_telescope.RELEASES_URL = 'https://gitlab.com/api/v4/projects/crossref%2Fopen_funder_registry/releases'[source]
class academic_observatory_workflows.workflows.crossref_fundref_telescope.CrossrefFundrefRelease(*, dag_id: str, run_id: str, snapshot_date: pendulum.DateTime, url: str)[source]

Bases: observatory.platform.workflows.workflow.SnapshotRelease

class academic_observatory_workflows.workflows.crossref_fundref_telescope.CrossrefFundrefTelescope(*, dag_id: str, cloud_workspace: observatory.platform.observatory_config.CloudWorkspace, bq_dataset_id: str = 'crossref_fundref', bq_table_name: str = 'crossref_fundref', api_dataset_id: str = 'crossref_fundref', schema_folder: str = os.path.join(default_schema_folder(), 'crossref_fundref'), dataset_description: str = 'The Crossref Funder Registry dataset: https://www.crossref.org/services/funder-registry/', table_description: str = 'The Crossref Funder Registry dataset: https://www.crossref.org/services/funder-registry/', observatory_api_conn_id: str = AirflowConns.OBSERVATORY_API, start_date: pendulum.DateTime = pendulum.datetime(2014, 2, 23), schedule: str = '@weekly', catchup: bool = True, gitlab_pool_name: str = 'gitlab_pool', gitlab_pool_slots: int = 2, gitlab_pool_description: str = 'A pool to limit the connections to Gitlab')[source]

Bases: observatory.platform.workflows.workflow.Workflow

make_release(**kwargs) List[CrossrefFundrefRelease][source]

Make release instances. The release is passed as an argument to the function (TelescopeFunction) that is called in ‘task_callable’.

Parameters:

kwargs – the context passed from the PythonOperator. See

https://airflow.apache.org/docs/stable/macros-ref.html for a list of the keyword arguments that are passed to this argument. :return: a list of CrossrefFundrefRelease instances.

get_release_info(**kwargs) bool[source]

Based on a list of all releases, checks which ones were released between the prev and this execution date of the DAG. If the release falls within the time period mentioned above, checks if a bigquery table doesn’t exist yet for the release. A list of releases that passed both checks is passed to the next tasks. If the list is empty the workflow will stop.

download(releases: List[CrossrefFundrefRelease], **kwargs)[source]

Downloads release tar.gz file from url.

upload_downloaded(releases: List[CrossrefFundrefRelease], **kwargs)[source]

Upload the data to Cloud Storage.

extract(releases: List[CrossrefFundrefRelease], **kwargs)[source]

Extract release from gzipped tar file.

transform(releases: List[CrossrefFundrefRelease], **kwargs)[source]

Transforms release by storing file content in gzipped json format. Relationships between funders are added.

upload_transformed(releases: List[CrossrefFundrefRelease], **kwargs) None[source]

Upload the transformed data to Cloud Storage.

bq_load(releases: List[CrossrefFundrefRelease], **kwargs) None[source]

Load the data into BigQuery.

add_new_dataset_releases(releases: List[CrossrefFundrefRelease], **kwargs) None[source]

Adds release information to API.

cleanup(releases: List[CrossrefFundrefRelease], **kwargs) None[source]

Delete all files, folders and XComs associated with this release.

academic_observatory_workflows.workflows.crossref_fundref_telescope.list_releases(start_date: pendulum.DateTime, end_date: pendulum.DateTime) List[dict][source]

List all available CrossrefFundref releases between the start and end date

Parameters:
  • start_date – The start date of the period to look for releases

  • end_date – The end date of the period to look for releases

Returns:

list with dictionaries of release info (url and release date)

academic_observatory_workflows.workflows.crossref_fundref_telescope.new_funder_template()[source]

Helper Function for creating a new Funder.

Returns:

a blank funder object.

academic_observatory_workflows.workflows.crossref_fundref_telescope.parse_fundref_registry_rdf(registry_file_path: str) Tuple[List, Dict][source]

Helper function to parse a fundref registry rdf file and to return a python list containing each funder.

Parameters:

registry_file_path – the filename of the registry.rdf file to be parsed.

Returns:

funders list containing all the funders parsed from the input rdf and dictionary of funders with their

id as key.

academic_observatory_workflows.workflows.crossref_fundref_telescope.add_funders_relationships(funders: List, funders_by_key: Dict) List[source]

Adds any children/parent relationships to funder instances in the funders list.

Parameters:
  • funders – List of funders

  • funders_by_key – Dictionary of funders with their id as key.

Returns:

funders with added relationships.

academic_observatory_workflows.workflows.crossref_fundref_telescope.recursive_funders(funders_by_key: Dict, funder: Dict, depth: int, direction: str, sub_funders: List) Tuple[List, int][source]

Recursively goes through a funder/sub_funder dict. The funder properties can be looked up with the funders_by_key dictionary that stores the properties per funder id. Any children/parents for the funder are already given in the xml element with the ‘narrower’ and ‘broader’ tags. For each funder in the list, it will recursively add any children/parents for those funders in ‘narrower’/’broader’ and their funder properties.

Parameters:
  • funders_by_key – dictionary with id as key and funders object as value

  • funder – dictionary of a given funder containing ‘narrower’ and ‘broader’ info

  • depth – keeping track of nested depth

  • direction – either ‘narrower’ or ‘broader’ to get ‘children’ or ‘parents’

  • sub_funders – list to keep track of which funder ids are parents

Returns:

list of children and current depth