DOI Workflow

The primary purpose of the DOI workflow is to link together open datasets of higher education through the use of persistent identifiers, and to then produce useful summary statistics across a variety of common aggregations

Input datasets

A number of specialised research publication datasets are collected. These include Crossref Metadata, Crossref Funder Registry, Crossref Events, Microsoft Academic Graph, Unpaywall, the Research Organization Registry, Open Citations. The table below illustrates each of these datesets and the PIDs that are exposed within them.

PIDs in this case refer to persistent identifiers. Commonly used identifier schemas that normalised how we refer to unique entities across a range of independent systems.

Name	Table	PID
Crossref Metadata	crossref_metadata.crossref_metadataYYYYMMDD	DOI
Crossref Fundref	crossref_fundref.crossref_fundrefYYYYMMDD	FunderDOI
Crossref Events	crossref_events.crossref_events	DOI
RoR	ror.rorYYYYMMDD	ROR/Country
ORCID	orcid.orcid	ORCID/DOI
Open Citations	open_citations.open_citationsYYYYMMDD	DOI
Unpaywall	unpaywall.unpaywall	DOI
Country & Region Names & Codes	settings.country	Country
OpenAlex	openalex.worksYYYYMMDD	DOI
PubMed	pubmed.pubmed	DOI

Synthesis

After fetching the datasets, they are synthesised to produce aggregate time series statistics for each country and institution (entity type) in the dataset. The aggregate timeseries statistics include publication count, open access status, citation count and alt-metrics.

The synthesis occurs in three steps:

Creating a table of publications.
Grouping the publications by entity type and year of publication.
Computing aggregated summaries for each group. Each step of the process is explained below with examples.

The table of publications is created by joining records from the research publication datasets on Digital Object Identifiers (DOIs); unique digital identifiers given to the majority of publications. Figure 2 illustrates how each dataset contributes to the publications table during the joining process, using the example of a single publication. Unique publications are discovered with Crossref Metadata, from which the publication’s DOI, Journal, Publisher, Funder identifiers and citation counts are derived. The publication’s Open Access status is computed using Unpaywall. The authors of the paper and their institutional affiliations are derived with Microsoft Academic Graph. The Research Organisation Registry (ROR) is used to enrich the institutional affiliation records with institution details and iso3166 maps institutions to countries and regions.

Once the publications table has been created, the publications are grouped by entity type and publication year. For instance publications are grouped by institution and publication year. The last step involves creating aggregate timeseries statistics based on the yearly groups of publications.

Intermediate Datasets

Digging into the mechanics of the workflow, there are a number of intermediate tables that are created. Much of the time, these are not relevant as the information is better organised in the final output tables. However, for specific analysis they may be useful so are included here for completeness

The primary purpose of these tables is to pre-process some of the raw datasets, creating a new format that is easier to work with downstream. In the case of GRID, it adds additional information around county and region from the iso3166 dataset, and for unpaywall it computes some COKI defined logic for OA status types.

For each of the scripts, they can be found in the following folder

Name	Table	PID	SQL File Name
Crossref Events \| observatory_intermediate.crossref_eventsYYYYMMDD \| DOI \| crossref_events.sql.jinja2
Crossref Fundref \| `observatory_intermediate.crossref_fundrefYYYYMMDD`\| DOI \| crossref_fundref.sql.jinja2
Crossref Metadata\| `observatory_intermediate.crossref_fundrefYYYYMMDD`\| DOI \| crossref_metadata.sql.jinja2
RoR \| observatory_intermediate.rorYYYYMMDD \| ROR \| ror.sql.jinja2
Open Citations \| observatory_intermediate.open_citationsYYYYMMDD \| DOI \| open_citations.sql.jinja2
ORCID \| observatory_intermediate.orcidYYYYMMDD \| DOI \| orcid.sql.jinja2
Unpaywall \| observatory_intermediate.unpaywallYYYYMMDD \| DOI \| unpaywall.sql.jinja2
OpenAlex \| observatory_intermediate.openalexYYYYMMDD \| DOI \| openalex.sql.jinja2
PubMed \| observatory_intermediate.PubMedYYYYMMDD \| DOI \| pubmed.sql.jinja2

Output datasets

The final set of output dataset are as follows. Aside from the DOI table, which is an integrated set of all the input datasets linked by DOI, they each share a very similar, or identical, schema. The difference is around the entity being aggregated against.

Name	Table	PID	SQL File Name
DOI	observatory.doiYYYYMMDD	DOI	create_doi.sql.jinja2
Author	observatory.authorYYYYMMDD	ORCID	create_aggregate.sql.jinja2
Book	observatory.bookYYYYMMDD	ISBN	create_book.sql.jinja2
Country	observatory.countryYYYYMMDD	Alpha3	create_aggregate.sql.jinja2
Funder	observatory.funderYYYYMMDD	Name	create_aggregate.sql.jinja2
Group	observatory.groupYYYYMMDD	Name	create_aggregate.sql.jinja2
Institution	observatory.institutionYYYYMMDD	ROR	create_aggregate.sql.jinja2
Journal	observatory.journalYYYYMMDD	ISSN-l	create_aggregate.sql.jinja2
Publisher	observatory.publisherYYYYMMDD	Name	create_aggregate.sql.jinja2
Region	observatory.regionYYYYMMDD	Region	create_aggregate.sql.jinja2
Sub-region	observatory.subregionYYYYMMDD	SubRegion	create_aggregate.sql.jinja2

Output Schemas

The following Links provide details in to the schemas behind the doi, book and various other aggregration tables schemas:

Other Outputs

There are also two additional output from this workflow

COKI Dashboards

This dataset is a copy of the observatory output datasets, with two differences.

Firstly, the table naming removes the trailing date. It always replaces the current version with the new version each week. By keeping the names consistent, datastudio dashboards can reference this stable name, and continue to review updates
Secondly, it adds a range of database ‘views’. These views are created to enable comparisons in Data Studio. Providing a secondary table to reference, which keeps only the bare minimum information for each table being compared. This is the common ‘view’ query that is shared across these comparison views.

Data Exports

This dataset contains a range of tables that are exported specifically for Elasticsearch/Kibana. Because of the limitations within Kibana Dashboards, the output tables needs to be further broken down into individual sections. The follow table provides the list of queries that produce this output.

The way to understand each of the below queries, as that each of them (except the dois one) are run for every output table. For for example, where the table below says data_export.ao_*_access_typesYYYYMMDD this will result in a range of table, one of which will take the form data_export.ao_author_access_typesYYYYMMDD.

A final note, for the ‘Relations’ output query applies to a range of exports, due to them all sharing the same schema. Those types are:

Countries
Funders
Groupings
Institutions
Journals
Publishers

For each of the scripts, they can be found in the following folder

Name	Table	SQL File Name
Access Types	data_export.ao_*_access_typesYYYYMMDD	export_access_types.sql.jinja2
Disciplines	data_export.ao_*_disciplinesYYYYMMDD	export_disciplines.sql.jinja2
Events	data_export.ao_*_eventsYYYYMMDD	export_events.sql.jinja2
Metrics	data_export.ao_*_metricsYYYYMMDD	export_metrics.sql.jinja2
Output Types	data_export.ao_*_output_typesYYYYMMDD	export_output_types.sql.jinja2
Relations	data_export.ao_*_[relation]YYYYMMDD	export_relations.sql.jinja2
Unique Lists	data_export.ao_*_unique_listYYYYMMDD	export_unique_list.sql.jinja2
Dois	data_export.ao_doisYYYYMMDD	export_dois.sql.jinja2

Name	Table	PID	SQL File Name
Crossref Events \| observatory_intermediate.crossref_eventsYYYYMMDD \| DOI \| crossref_events.sql.jinja2
Crossref Fundref \| `observatory_intermediate.crossref_fundrefYYYYMMDD`\| DOI \| crossref_fundref.sql.jinja2
Crossref Metadata\| `observatory_intermediate.crossref_fundrefYYYYMMDD`\| DOI \| crossref_metadata.sql.jinja2
RoR \| observatory_intermediate.rorYYYYMMDD \| ROR \| ror.sql.jinja2
Open Citations \| observatory_intermediate.open_citationsYYYYMMDD \| DOI \| open_citations.sql.jinja2
ORCID \| observatory_intermediate.orcidYYYYMMDD \| DOI \| orcid.sql.jinja2
Unpaywall \| observatory_intermediate.unpaywallYYYYMMDD \| DOI \| unpaywall.sql.jinja2
OpenAlex \| observatory_intermediate.openalexYYYYMMDD \| DOI \| openalex.sql.jinja2
PubMed \| observatory_intermediate.PubMedYYYYMMDD \| DOI \| pubmed.sql.jinja2