academic_observatory_workflows.model

Module Contents

Classes

Repository

A repository.

Institution

An institution.

Paper

A paper.

AccessType

The access type of a paper.

COKIOpenAccess

The COKI Open Access types.

PublisherCategories

The publisher open subcategories.

OtherPlatformCategories

The other platform open subcategories

Author

An author.

Funder

A research funder.

Publisher

A publisher.

FieldOfStudy

A field of study.

Journal

A journal

Event

An event.

ObservatoryDataset

The generated observatory dataset.

Functions

date_between_dates(→ pendulum.DateTime)

Return a datetime between two timestamps.

make_doi(doi_prefix)

Makes a randomised DOI given a DOI prefix.

make_observatory_dataset(→ ObservatoryDataset)

Generate an observatory dataset.

make_funders(→ FunderList)

Make the funders ground truth dataset.

make_publishers(→ PublisherList)

Make publishers ground truth dataset.

make_fields_of_study(→ FieldOfStudyList)

Generate the fields of study for the ground truth dataset.

make_authors(→ AuthorList)

Generate the authors ground truth dataset.

make_papers(→ PaperList)

Generate the list of ground truth papers.

make_open_citations(→ List[Dict])

Generate an Open Citations table from an ObservatoryDataset instance.

make_crossref_events(→ List[Dict])

Generate the Crossref Events table from an ObservatoryDataset instance.

make_scihub(→ List[Dict])

Generate the SciHub table from an ObservatoryDataset instance.

make_unpaywall(→ List[Dict])

Generate the Unpaywall table from an ObservatoryDataset instance.

make_openalex_dataset(→ List[dict])

Generate the OpenAlex table data from an ObservatoryDataset instance.

make_orcid(→ List[Dict])

make_pubmed(→ List[Dict])

Generate the Pubmed table from an ObservatoryDataset instance.

make_crossref_fundref(→ List[Dict])

Generate the Crossref Fundref table from an ObservatoryDataset instance.

make_crossref_metadata(→ List[Dict])

Generate the Crossref Metadata table from an ObservatoryDataset instance.

bq_load_observatory_dataset(observatory_dataset, ...)

Load the fake Observatory Dataset in BigQuery.

aggregate_events(→ Tuple[List[Dict], List[Dict], ...)

Aggregate events by source into total events for all time, monthly and yearly counts.

sort_events(events, months, years)

Sort events in-place.

make_doi_table(→ List[Dict])

Generate the DOI table from an ObservatoryDataset instance.

make_doi_events(→ Dict)

Make the events for a DOI table row.

make_doi_funders(→ List[Dict])

Make a DOI table row funders affiliation list.

make_doi_journals(→ List[Dict])

Make the journal affiliation list for a DOI table row.

to_affiliations_list(dict_)

Convert affiliation dict into a list.

make_doi_publishers(→ List[Dict])

Make the publisher affiliations for a DOI table row.

make_doi_institutions(→ List[Dict])

Make the institution affiliations for a DOI table row.

make_doi_countries(author_list)

Make the countries affiliations for a DOI table row.

make_doi_regions(author_list)

Make the regions affiliations for a DOI table row.

make_doi_subregions(author_list)

Make the subregions affiliations for a DOI table row.

calc_percent(→ float)

Calculate a percentage and round to 2dp.

make_aggregate_table(→ List[Dict])

Generate an aggregate table from an ObservatoryDataset instance.

Attributes

LICENSES

EVENT_TYPES

OUTPUT_TYPES

FUNDREF_COUNTRY_CODES

FUNDREF_REGIONS

FUNDING_BODY_TYPES

FUNDING_BODY_SUBTYPES

InstitutionList

AuthorList

FunderList

PublisherList

PaperList

FieldOfStudyList

EventsList

RepositoryList

academic_observatory_workflows.model.LICENSES = ['cc-by', None][source]
academic_observatory_workflows.model.EVENT_TYPES = ['f1000', 'stackexchange', 'datacite', 'twitter', 'reddit-links', 'wordpressdotcom', 'plaudit',...[source]
academic_observatory_workflows.model.OUTPUT_TYPES = ['journal_articles', 'book_sections', 'authored_books', 'edited_volumes', 'reports', 'datasets',...[source]
academic_observatory_workflows.model.FUNDREF_COUNTRY_CODES = ['usa', 'gbr', 'aus', 'can'][source]
academic_observatory_workflows.model.FUNDREF_REGIONS[source]
academic_observatory_workflows.model.FUNDING_BODY_TYPES = ['For-profit companies (industry)', 'Trusts, charities, foundations (both public and private)',...[source]
academic_observatory_workflows.model.FUNDING_BODY_SUBTYPES[source]
class academic_observatory_workflows.model.Repository[source]

A repository.

name: str[source]
endpoint_id: str[source]
pmh_domain: str[source]
url_domain: str[source]
category: str[source]
ror_id: str[source]
_key()[source]
__eq__(other)[source]

Return self==value.

__hash__()[source]

Return hash(self).

static from_dict(dict_: Dict)[source]
class academic_observatory_workflows.model.Institution[source]

An institution.

Parameters:
  • id – unique identifier.

  • name – the institution’s name.

  • grid_id – the institution’s GRID id.

  • ror_id – the institution’s ROR id.

  • country_code – the institution’s country code.

  • country_code_2 – the institution’s country code.

  • subregion – the institution’s subregion.

  • papers – the papers published by the institution.

  • types – the institution type.

  • country – the institution country name.

  • coordinates – the institution’s coordinates.

id: int[source]
name: str[source]
grid_id: str[source]
ror_id: str[source]
country_code: str[source]
country_code_2: str[source]
region: str[source]
subregion: str[source]
papers: List[Paper][source]
types: str[source]
country: str[source]
coordinates: str[source]
repository: Repository[source]
academic_observatory_workflows.model.date_between_dates(start_ts: int, end_ts: int) pendulum.DateTime[source]

Return a datetime between two timestamps.

Parameters:
  • start_ts – the start timestamp.

  • end_ts – the end timestamp.

Returns:

the DateTime datetime.

class academic_observatory_workflows.model.Paper[source]

A paper.

Parameters:
  • id – unique identifier.

  • doi – the DOI of the paper.

  • title – the title of the paper.

  • published_date – the date the paper was published.

  • output_type – the output type, see OUTPUT_TYPES.

  • authors – the authors of the paper.

  • funders – the funders of the research published in the paper.

  • journal – the journal this paper is published in.

  • publisher – the publisher of this paper (the owner of the journal).

  • events – a list of events related to this paper.

  • cited_by – a list of papers that this paper is cited by.

  • fields_of_study – a list of the fields of study of the paper.

  • license – the papers license at the publisher.

  • is_free_to_read_at_publisher – whether the paper is free to read at the publisher.

  • repositories – the list of repositories where the paper can be read.

property access_type: AccessType[source]

Return the access type for the paper.

Returns:

AccessType.

property oa_coki: COKIOpenAccess[source]

Return the access type for the paper.

Returns:

AccessType.

id: int[source]
doi: str[source]
title: str[source]
type: str[source]
published_date: pendulum.Date[source]
output_type: str[source]
authors: List[Author][source]
funders: List[Funder][source]
journal: Journal[source]
publisher: Publisher[source]
events: List[Event][source]
cited_by: List[Paper][source]
fields_of_study: List[FieldOfStudy][source]
publisher_license: str[source]
publisher_is_free_to_read: bool = False[source]
repositories: List[Repository][source]
in_scihub: bool = False[source]
in_unpaywall: bool = True[source]
class academic_observatory_workflows.model.AccessType[source]

The access type of a paper.

Parameters:
  • oa – whether the paper is open access or not.

  • green – when the paper is available in an institutional repository.

  • gold – when the paper is an open access journal or (it is not in an open access journal and is free to read

at the publisher and has an open access license). :param gold_doaj: when the paper is an open access journal. :param hybrid: where the paper is free to read at the publisher, it has an open access license and the journal is not open access. :param bronze: when the paper is free to read at the publisher website however there is no license. :param green_only: where the paper is not free to read from the publisher, however it is available at an :param black: where the paper is available at SciHub. institutional repository.

oa: bool[source]
green: bool[source]
gold: bool[source]
gold_doaj: bool[source]
hybrid: bool[source]
bronze: bool[source]
green_only: bool[source]
black: bool[source]
class academic_observatory_workflows.model.COKIOpenAccess[source]

The COKI Open Access types.

Parameters:
  • open

    .

  • closed

    .

  • publisher

    .

  • other_platform

    .

  • publisher_only

    .

  • both

    .

  • other_platform_only

    .

  • publisher_categories

    .

  • other_platform_categories

    .

open: bool[source]
closed: bool[source]
publisher: bool[source]
other_platform: bool[source]
publisher_only: bool[source]
both: bool[source]
other_platform_only: bool[source]
publisher_categories: PublisherCategories[source]
other_platform_categories: OtherPlatformCategories[source]
class academic_observatory_workflows.model.PublisherCategories[source]

The publisher open subcategories.

Parameters:
  • oa_journal

    .

  • hybrid

    .

  • no_guarantees

    .

oa_journal: bool[source]
hybrid: bool[source]
no_guarantees: bool[source]
class academic_observatory_workflows.model.OtherPlatformCategories[source]

The other platform open subcategories

Parameters:
  • preprint

    .

  • domain

    .

  • institution

    .

  • public

    .

  • aggregator

    .

  • other_internet

    .

  • unknown

    .

preprint: bool[source]
domain: bool[source]
institution: bool[source]
public: bool[source]
aggregator: bool[source]
other_internet: bool[source]
unknown: bool[source]
class academic_observatory_workflows.model.Author[source]

An author.

Parameters:
  • id – unique identifier.

  • name – the name of the author.

  • institution – the author’s institution.

id: int[source]
name: str[source]
institution: Institution[source]
class academic_observatory_workflows.model.Funder[source]

A research funder.

Parameters:
  • id – unique identifier.

  • name – the name of the funder.

  • doi – the DOI of the funder.

  • country_code – the country code of the funder.

  • region – the region the funder is located in.

  • funding_body_type – the funding body type, see FUNDING_BODY_TYPES.

  • funding_body_subtype – the funding body subtype, see FUNDING_BODY_SUBTYPES.

id: int[source]
name: str[source]
doi: str[source]
country_code: str[source]
region: str[source]
funding_body_type: str[source]
funding_body_subtype: str[source]
class academic_observatory_workflows.model.Publisher[source]

A publisher.

Parameters:
  • id – unique identifier.

  • name – the name of the publisher.

  • doi_prefix – the publisher DOI prefix.

  • journals – the journals owned by the publisher.

id: int[source]
name: str[source]
doi_prefix: int[source]
journals: List[Journal][source]
class academic_observatory_workflows.model.FieldOfStudy[source]

A field of study.

Parameters:
  • id – unique identifier.

  • name – the field of study name.

  • level – the field of study level.

id: int[source]
name: str[source]
level: int[source]
class academic_observatory_workflows.model.Journal[source]

A journal

Parameters:
  • id – unique identifier.

  • name – the journal name.

  • name – the license that articles are published under by the journal.

id: int[source]
name: str[source]
license: str[source]
class academic_observatory_workflows.model.Event[source]

An event.

Parameters:
  • source – the source of the event, see EVENT_TYPES.

  • event_date – the date of the event.

source: str[source]
event_date: pendulum.DateTime[source]
academic_observatory_workflows.model.InstitutionList[source]
academic_observatory_workflows.model.AuthorList[source]
academic_observatory_workflows.model.FunderList[source]
academic_observatory_workflows.model.PublisherList[source]
academic_observatory_workflows.model.PaperList[source]
academic_observatory_workflows.model.FieldOfStudyList[source]
academic_observatory_workflows.model.EventsList[source]
academic_observatory_workflows.model.RepositoryList[source]
class academic_observatory_workflows.model.ObservatoryDataset[source]

The generated observatory dataset.

Parameters:
  • institutions – list of institutions.

  • authors – list of authors.

  • funders – list of funders.

  • publishers – list of publishers.

  • papers – list of papers.

  • fields_of_study – list of fields of study.

  • fields_of_study – list of fields of study.

institutions: InstitutionList[source]
authors: AuthorList[source]
funders: FunderList[source]
publishers: PublisherList[source]
papers: PaperList[source]
fields_of_study: FieldOfStudyList[source]
repositories: RepositoryList[source]
academic_observatory_workflows.model.make_doi(doi_prefix: int)[source]

Makes a randomised DOI given a DOI prefix.

Parameters:

doi_prefix – the DOI prefix.

Returns:

the DOI.

academic_observatory_workflows.model.make_observatory_dataset(institutions: List[Institution], repositories: List[Repository], n_funders: int = 5, n_publishers: int = 5, n_authors: int = 10, n_papers: int = 100, n_fields_of_study_per_level: int = 5) ObservatoryDataset[source]

Generate an observatory dataset.

Parameters:
  • institutions – a list of institutions.

  • repositories – a list of known repositories.

  • n_funders – the number of funders to generate.

  • n_publishers – the number of publishers to generate.

  • n_authors – the number of authors to generate.

  • n_papers – the number of papers to generate.

  • n_fields_of_study_per_level – the number of fields of study to generate per level.

Returns:

the observatory dataset.

academic_observatory_workflows.model.make_funders(*, n_funders: int, doi_prefix: int, faker: faker.Faker) FunderList[source]

Make the funders ground truth dataset.

Parameters:
  • n_funders – number of funders to generate.

  • doi_prefix – the DOI prefix for the funders.

  • faker – the faker instance.

Returns:

a list of funders.

academic_observatory_workflows.model.make_publishers(*, n_publishers: int, doi_prefix: int, faker: faker.Faker, min_journals_per_publisher: int = 1, max_journals_per_publisher: int = 3) PublisherList[source]

Make publishers ground truth dataset.

Parameters:
  • n_publishers – number of publishers.

  • doi_prefix – the publisher DOI prefix.

  • faker – the faker instance.

  • min_journals_per_publisher – the min number of journals to generate per publisher.

  • max_journals_per_publisher – the max number of journals to generate per publisher.

Returns:

academic_observatory_workflows.model.make_fields_of_study(*, n_fields_of_study_per_level: int, faker: faker.Faker, n_levels: int = 6, min_title_length: int = 1, max_title_length: int = 3) FieldOfStudyList[source]

Generate the fields of study for the ground truth dataset.

Parameters:
  • n_fields_of_study_per_level – the number of fields of study per level.

  • faker – the faker instance.

  • n_levels – the number of levels.

  • min_title_length – the minimum field of study title length (words).

  • max_title_length – the maximum field of study title length (words).

Returns:

a list of the fields of study.

academic_observatory_workflows.model.make_authors(*, n_authors: int, institutions: InstitutionList, faker: faker.Faker) AuthorList[source]

Generate the authors ground truth dataset.

Parameters:
  • n_authors – the number of authors to generate.

  • institutions – the institutions.

  • faker – the faker instance.

Returns:

a list of authors.

academic_observatory_workflows.model.make_papers(*, n_papers: int, authors: AuthorList, funders: FunderList, publishers: PublisherList, fields_of_study: List, repositories: List[Repository], faker: faker.Faker, min_title_length: int = 2, max_title_length: int = 10, min_authors: int = 1, max_authors: int = 10, min_funders: int = 0, max_funders: int = 3, min_events: int = 0, max_events: int = 100, min_fields_of_study: int = 1, max_fields_of_study: int = 20, min_repos: int = 1, max_repos: int = 10, min_year: int = 2017, max_year: int = 2021) PaperList[source]

Generate the list of ground truth papers.

Parameters:
  • n_papers – the number of papers to generate.

  • authors – the authors list.

  • funders – the funders list.

  • publishers – the publishers list.

  • fields_of_study – the fields of study list.

  • repositories – the repositories.

  • faker – the faker instance.

  • min_title_length – the min paper title length.

  • max_title_length – the max paper title length.

  • min_authors – the min number of authors for each paper.

  • max_authors – the max number of authors for each paper.

  • min_funders – the min number of funders for each paper.

  • max_funders – the max number of funders for each paper.

  • min_events – the min number of events per paper.

  • max_events – the max number of events per paper.

  • min_fields_of_study – the min fields of study per paper.

  • max_fields_of_study – the max fields of study per paper.

  • min_repos – the min repos per paper when green.

  • max_repos – the max repos per paper when green.

  • min_year – the min year.

  • max_year – the max year.

Returns:

the list of papers.

academic_observatory_workflows.model.make_open_citations(dataset: ObservatoryDataset) List[Dict][source]

Generate an Open Citations table from an ObservatoryDataset instance.

Parameters:

dataset – the Observatory Dataset.

Returns:

table rows.

academic_observatory_workflows.model.make_crossref_events(dataset: ObservatoryDataset) List[Dict][source]

Generate the Crossref Events table from an ObservatoryDataset instance.

Parameters:

dataset – the Observatory Dataset.

Returns:

table rows.

academic_observatory_workflows.model.make_scihub(dataset: ObservatoryDataset) List[Dict][source]

Generate the SciHub table from an ObservatoryDataset instance.

Parameters:

dataset – the Observatory Dataset.

Returns:

table rows.

academic_observatory_workflows.model.make_unpaywall(dataset: ObservatoryDataset) List[Dict][source]

Generate the Unpaywall table from an ObservatoryDataset instance.

Parameters:

dataset – the Observatory Dataset.

Returns:

table rows.

academic_observatory_workflows.model.make_openalex_dataset(dataset: ObservatoryDataset) List[dict][source]

Generate the OpenAlex table data from an ObservatoryDataset instance.

Parameters:

dataset – the Observatory Dataset.

Returns:

OpenAlex table data.

academic_observatory_workflows.model.make_orcid(dataset: ObservatoryDataset) List[Dict][source]
academic_observatory_workflows.model.make_pubmed(dataset: ObservatoryDataset) List[Dict][source]

Generate the Pubmed table from an ObservatoryDataset instance.

Parameters:

dataset – the Observatory Dataset.

Returns:

table rows.

academic_observatory_workflows.model.make_crossref_fundref(dataset: ObservatoryDataset) List[Dict][source]

Generate the Crossref Fundref table from an ObservatoryDataset instance.

Parameters:

dataset – the Observatory Dataset.

Returns:

table rows.

academic_observatory_workflows.model.make_crossref_metadata(dataset: ObservatoryDataset) List[Dict][source]

Generate the Crossref Metadata table from an ObservatoryDataset instance.

Parameters:

dataset – the Observatory Dataset.

Returns:

table rows.

academic_observatory_workflows.model.bq_load_observatory_dataset(observatory_dataset: ObservatoryDataset, repository: List[Dict], bucket_name: str, dataset_id_all: str, dataset_id_settings: str, snapshot_date: pendulum.DateTime, project_id: str)[source]

Load the fake Observatory Dataset in BigQuery.

Parameters:
  • observatory_dataset – the Observatory Dataset.

  • repository – the repository table data.

  • bucket_name – the Google Cloud Storage bucket name.

  • dataset_id_all – the dataset id for all data tables.

  • dataset_id_settings – the dataset id for settings tables.

  • snapshot_date – the release date for the observatory dataset.

  • project_id – api project id.

Returns:

None.

academic_observatory_workflows.model.aggregate_events(events: List[Event]) Tuple[List[Dict], List[Dict], List[Dict]][source]

Aggregate events by source into total events for all time, monthly and yearly counts.

Parameters:

events – list of events.

Returns:

list of events for each source aggregated by all time, months and years.

academic_observatory_workflows.model.sort_events(events: List[Dict], months: List[Dict], years: List[Dict])[source]

Sort events in-place.

Parameters:
  • events – events all time.

  • months – events by month.

  • years – events by year.

Returns:

None.

academic_observatory_workflows.model.make_doi_table(dataset: ObservatoryDataset) List[Dict][source]

Generate the DOI table from an ObservatoryDataset instance.

Parameters:

dataset – the Observatory Dataset.

Returns:

table rows.

academic_observatory_workflows.model.make_doi_events(doi: str, event_list: EventsList) Dict[source]

Make the events for a DOI table row.

Parameters:
  • doi – the DOI.

  • event_list – a list of events for the paper.

Returns:

the events for the DOI table.

academic_observatory_workflows.model.make_doi_funders(funder_list: FunderList) List[Dict][source]

Make a DOI table row funders affiliation list.

Parameters:

funder_list – the funders list.

Returns:

the funders affiliation list.

academic_observatory_workflows.model.make_doi_journals(in_unpaywall: bool, journal: Journal) List[Dict][source]

Make the journal affiliation list for a DOI table row.

Parameters:

in_unpaywall – whether the work is in Unpaywall or not. At the moment the journal IDs come from Unpaywall,

and if the work is not in Unpaywall then the journal id and name will be None. :param journal: the paper’s journal. :return: the journal affiliation list.

academic_observatory_workflows.model.to_affiliations_list(dict_: Dict)[source]

Convert affiliation dict into a list.

Parameters:

dict – affiliation dict.

Returns:

affiliation list.

academic_observatory_workflows.model.make_doi_publishers(publisher: Publisher) List[Dict][source]

Make the publisher affiliations for a DOI table row.

Parameters:

publisher – the paper’s publisher.

Returns:

the publisher affiliations list.

academic_observatory_workflows.model.make_doi_institutions(author_list: AuthorList) List[Dict][source]

Make the institution affiliations for a DOI table row.

Parameters:

author_list – the paper’s author list.

Returns:

the institution affiliation list.

academic_observatory_workflows.model.make_doi_countries(author_list: AuthorList)[source]

Make the countries affiliations for a DOI table row.

Parameters:

author_list – the paper’s author list.

Returns:

the countries affiliation list.

academic_observatory_workflows.model.make_doi_regions(author_list: AuthorList)[source]

Make the regions affiliations for a DOI table row.

Parameters:

author_list – the paper’s author list.

Returns:

the regions affiliation list.

academic_observatory_workflows.model.make_doi_subregions(author_list: AuthorList)[source]

Make the subregions affiliations for a DOI table row.

Parameters:

author_list – the paper’s author list.

Returns:

the subregions affiliation list.

academic_observatory_workflows.model.calc_percent(value: float, total: float) float[source]

Calculate a percentage and round to 2dp.

Parameters:
  • value – the value.

  • total – the total.

Returns:

the percentage.

academic_observatory_workflows.model.make_aggregate_table(agg: str, dataset: ObservatoryDataset) List[Dict][source]

Generate an aggregate table from an ObservatoryDataset instance.

Parameters:
  • agg – the aggregation type, e.g. country, institution.

  • dataset – the Observatory Dataset.

Returns:

table rows.