academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow

Module Contents

Classes

OaDashboardRelease

ZenodoVersion

Histogram

EntityHistograms

EntityStats

Stats

Functions

create_dag(, schedule, max_active_runs, retries)

Create the OaDashboardWorkflow, which generates data files for the COKI Open Access Dashboard.

bq_query_to_gcs(→ bool)

Run a BigQuery query and save the results on Google Cloud Storage.

save_oa_dashboard_dataset(download_folder, ...)

save_zenodo_dataset(download_folder, dataset_path, ...)

Save the COKI Open Access Dataset to a zip file.

oa_dashboard_subset(→ Dict)

zenodo_subset(item)

save_json(path, data)

Save data to JSON.

data_file_pattern(download_folder, entity_type)

yield_data_glob(→ List[Dict])

Load country or institution data files into a Pandas DataFrame.

make_entity_stats(→ EntityStats)

Calculate stats for entities.

make_logo_url(→ str)

Make a logo url.

fetch_institution_logo(→ Tuple[str, str])

Get the path to the logo for an institution.

clean_url(→ str)

Remove path and query from URL.

fetch_institution_logos(→ List[Dict])

Update the index with logos, downloading logos if they don't exist.

Attributes

INCLUSION_THRESHOLD

MAX_REPOSITORIES

START_YEAR

END_YEAR

README

academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.INCLUSION_THRESHOLD[source]
academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.MAX_REPOSITORIES = 200[source]
academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.START_YEAR = 2000[source]
academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.END_YEAR[source]
academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.README = Multiline-String[source]
Show Value
"""# COKI Open Access Dataset
The COKI Open Access Dataset measures open access performance for {{ n_countries }} countries and {{ n_institutions }} institutions
and is available in JSON Lines format. The data is visualised at the COKI Open Access Dashboard: https://open.coki.ac/.

## Licence
[COKI Open Access Dataset](https://open.coki.ac/data/) © {{ year }} by [Curtin University](https://www.curtin.edu.au/)
is licenced under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)

## Citing
To cite the COKI Open Access Dashboard please use the following citation:
> Diprose, J., Hosking, R., Rigoni, R., Roelofs, A., Chien, T., Napier, K., Wilson, K., Huang, C., Handcock, R., Montgomery, L., & Neylon, C. (2023). A User-Friendly Dashboard for Tracking Global Open Access Performance. The Journal of Electronic Publishing 26(1). doi: https://doi.org/10.3998/jep.3398

If you use the website code, please cite it as below:
> James P. Diprose, Richard Hosking, Richard Rigoni, Aniek Roelofs, Kathryn R. Napier, Tuan-Yow Chien, Alex Massen-Hane, Katie S. Wilson, Lucy Montgomery, & Cameron Neylon. (2022). COKI Open Access Website. Zenodo. https://doi.org/10.5281/zenodo.6374486

If you use this dataset, please cite it as below:
> Richard Hosking, James P. Diprose, Aniek Roelofs, Tuan-Yow Chien, Lucy Montgomery, & Cameron Neylon. (2022). COKI Open Access Dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6399463

## Attributions
The COKI Open Access Dataset contains information from:
* [Open Alex](https://openalex.org/) which is made available under a [CC0 licence](https://creativecommons.org/publicdomain/zero/1.0/).
* [Crossref Metadata](https://www.crossref.org/documentation/metadata-plus/) via the Metadata Plus program. Bibliographic metadata is made available without copyright restriction and Crossref generated data with a [CC0 licence](https://creativecommons.org/share-your-work/public-domain/cc0/). See [metadata licence information](https://www.crossref.org/documentation/retrieve-metadata/rest-api/rest-api-metadata-license-information/) for more details.
* [Unpaywall](https://unpaywall.org/). The [Unpaywall Data Feed](https://unpaywall.org/products/data-feed) is used under license. Data is freely available from Unpaywall via the API, data dumps and as a data feed.
* [Research Organization Registry](https://ror.org/) which is made available under a [CC0 licence](https://creativecommons.org/share-your-work/public-domain/cc0/).
"""
class academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.OaDashboardRelease(*, dag_id: str, run_id: str, snapshot_date: pendulum.DateTime, input_project_id: str, output_project_id: str, bq_agg_dataset_id: str, bq_ror_dataset_id: str, bq_settings_dataset_id: str, bq_oa_dashboard_dataset_id: str)[source]

Bases: observatory.platform.workflows.workflow.SnapshotRelease

property build_path[source]
property intermediate_path[source]
property out_path[source]
ror_table_id()[source]
country_table_id()[source]
observatory_agg_table_id(table_name: str)[source]
institution_ids_table_id()[source]
oa_dashboard_table_id(table_name: str)[source]
descriptions_table_id(table_name: str)[source]
logos_table_id(table_name: str)[source]
static from_dict(dict_: dict)[source]
to_dict() dict[source]
academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.create_dag(*, dag_id: str, cloud_workspace: observatory.platform.observatory_config.CloudWorkspace, data_bucket: str, conceptrecid: int, doi_dag_id: str = 'doi', entity_types: List[str] = None, bq_agg_dataset_id: str = 'observatory', bq_ror_dataset_id: str = 'ror', bq_settings_dataset_id: str = 'settings', bq_oa_dashboard_dataset_id: str = 'oa_dashboard', version: str = 'v10', zenodo_host: str = 'https://zenodo.org', github_conn_id='oa_dashboard_github_token', zenodo_conn_id='oa_dashboard_zenodo_token', start_date: pendulum.DateTime | None = pendulum.datetime(2021, 5, 2), schedule: str | None = '@weekly', max_active_runs: int = 1, retries: int = 3) airflow.DAG[source]

Create the OaDashboardWorkflow, which generates data files for the COKI Open Access Dashboard.

Parameters:
  • dag_id – the DAG id.

  • cloud_workspace – The CloudWorkspace.

  • data_bucket – the Google Cloud Storage bucket where image data should be stored.

  • conceptrecid – the Zenodo Concept Record ID for the COKI Open Access Dataset. The Concept Record ID is

the last set of numbers from the Concept DOI. :param doi_dag_id: the DAG id to wait for. :param entity_types: the table names. :param bq_agg_dataset_id: the id of the BigQuery dataset where the Academic Observatory aggregated data lives. :param bq_ror_dataset_id: the id of the BigQuery dataset containing the ROR table. :param bq_settings_dataset_id: the id of the BigQuery settings dataset, which contains the country table. :param bq_oa_dashboard_dataset_id: the id of the BigQuery dataset where the tables produced by this workflow will be created. :param version: the dataset version published by this workflow. The Github Action pulls from a specific dataset version: https://github.com/The-Academic-Observatory/coki-oa-web/blob/develop/.github/workflows/build-on-data-update.yml#L68-L74. This is so that when breaking changes are made to the schema, the web application won’t break. :param zenodo_host: the Zenodo hostname, can be changed to https://sandbox.zenodo.org for testing. :param github_conn_id: the Github Token Airflow Connection ID. :param zenodo_conn_id: the Zenodo Token Airflow Connection ID. :param start_date: the start date. :param schedule: the schedule interval. :param max_active_runs: the maximum number of DAG runs that can be run at once. :param retries: the number of times to retry a task.

The figure below illustrates the data files produced by this workflow: . ├── data: data │ ├── index.json: used by the Cloudflare Worker search and filtering API. │ ├── country: individual entity statistics files for countries. Used to build each country page. │ │ ├── ALB.json │ │ ├── ARE.json │ │ └── ARG.json │ ├── country.json: used to create the country table. First 18 countries used to build first page of country table │ │ and then this file is included in the public folder and downloaded by the client to enable the │ │ other pages of the table to be displayed. Copied into public/data folder. │ ├── institution: individual entity statistics files for institutions. Used to build each institution page. │ │ ├── 05ykr0121.json │ │ ├── 05ym42410.json │ │ └── 05ynxx418.json │ ├── institution.json: used to create the institution table. First 18 institutions used to build first page of institution table │ │ and then this file is included in the public folder and downloaded by the client to enable the │ │ other pages of the table to be displayed. Copied into public/data folder. │ └── stats.json: global statistics, e.g. the minimum and maximum date for the dataset, when it was last updated etc. └── images:

└── logos: country and institution logos.

├── country │ ├── md: medium logos displayed on country pages. │ │ ├── ALB.svg │ │ ├── ARE.svg │ │ └── ARG.svg │ └── sm: small logos displayed in country table. │ ├── ALB.svg │ ├── ARE.svg │ └── ARG.svg └── institution

├── lg: large logos used for social media cards. │ ├── 05ykr0121.png │ ├── 05ym42410.png │ └── 05ynxx418.png ├── md: medium logos displayed on institution pages. │ ├── 05ykr0121.jpg │ ├── 05ym42410.jpg │ └── 05ynxx418.jpg └── sm: small logos displayed in institution table.

├── 05ykr0121.jpg ├── 05ym42410.jpg └── 05ynxx418.jpg

academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.bq_query_to_gcs(*, query: str, project_id: str, destination_uri: str, location: str = 'us') bool[source]

Run a BigQuery query and save the results on Google Cloud Storage.

Parameters:
  • query – the query string.

  • project_id – the Google Cloud project id.

  • destination_uri – the Google Cloud Storage destination uri.

  • location – the BigQuery dataset location.

Returns:

the status of the job.

academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.save_oa_dashboard_dataset(download_folder: str, build_data_path: str, entity_types: List[str], zenodo_versions: List[ZenodoVersion])[source]
academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.save_zenodo_dataset(download_folder: str, dataset_path: str, entity_types: List[str])[source]

Save the COKI Open Access Dataset to a zip file.

Parameters:
  • download_folder – the path where the downloaded data files can be found.

  • dataset_path – the path to the folder where the dataset should be saved.

  • entity_types – the entity types.

Returns:

None.

academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.oa_dashboard_subset(item: Dict) Dict[source]
academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.zenodo_subset(item: Dict)[source]
class academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.ZenodoVersion[source]
release_date: pendulum.DateTime[source]
download_url: str[source]
to_dict() Dict[source]
class academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.Histogram[source]
data: List[int][source]
bins: List[float][source]
to_dict() Dict[source]
class academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.EntityHistograms[source]
p_outputs_open: Histogram[source]
n_outputs: Histogram[source]
n_outputs_open: Histogram[source]
to_dict() Dict[source]
class academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.EntityStats[source]
n_items: int[source]
min: Dict[source]
max: Dict[source]
median: Dict[source]
histograms: EntityHistograms[source]
to_dict() Dict[source]
class academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.Stats[source]
start_year: int[source]
end_year: int[source]
last_updated: str[source]
zenodo_versions: List[ZenodoVersion][source]
country: EntityStats[source]
institution: EntityStats[source]
to_dict() Dict[source]
academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.save_json(path: str, data: Dict | List)[source]

Save data to JSON.

Parameters:
  • path – the output path.

  • data – the data to save.

Returns:

None.

academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.data_file_pattern(download_folder: str, entity_type: str)[source]
academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.yield_data_glob(pattern: str) List[Dict][source]

Load country or institution data files into a Pandas DataFrame.

Parameters:

pattern – the file path including a glob pattern.

Returns:

the list of dicts.

academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.make_entity_stats(entities: List[Dict]) EntityStats[source]

Calculate stats for entities.

Parameters:

entities – a list of entities.

Returns:

the entity stats object.

academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.make_logo_url(*, entity_type: str, entity_id: str, size: str, fmt: str) str[source]

Make a logo url.

Parameters:
  • entity_type – the entity entity_type: country or institution.

  • entity_id – the entity id.

  • size – the size of the logo: s or l.

  • fmt – the format of the logo.

Returns:

the logo url.

Get the path to the logo for an institution. If the logo does not exist in the build path yet, download from the Clearbit Logo API tool. If the logo does not exist and failed to download, the path will default to “unknown.svg”.

Parameters:
  • ror_id – the institution’s ROR id

  • url – the URL of the company domain + suffix e.g. spotify.com

  • size – the image size of the small logo for tables etc.

  • width – the width of the image.

  • fmt – the image format.

  • build_path – the build path for files of this workflow

Returns:

The ROR id and relative path (from build path) to the logo

academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.clean_url(url: str) str[source]

Remove path and query from URL.

Parameters:

url – the url.

Returns:

the cleaned url.

academic_observatory_workflows.oa_dashboard_workflow.oa_dashboard_workflow.fetch_institution_logos(build_path: str, entities: List[Tuple[str, str]]) List[Dict][source]

Update the index with logos, downloading logos if they don’t exist.

Parameters:
  • build_path – the path to the build folder.

  • entities – the entities to process consisting of their id and url.

Returns:

None.