Web of Science
“The Web of Science is the information and technology provider for the global scientific research community. We provide data, analytics and insights, as well as workflow tools and bespoke professional services to researchers and the entire research community that underpins research – universities and research institutions, national and local governments, private and public research funding organizations, publishers and research-intensive corporations, across the world.” – Web of Science website.
Web of science, previously Web of knowledge, provides bibliometric information, including funding acknowledgements, international publication identifiers, and abstracts. - source: WOS and data details.
Observatory Platform API
The telescope relies on the Observatory Platform API in order to create dags. A DAG will be created in Airflow for every web_of_science telescope returned by the API, i.e., one for each organisation.
The following fields need to be set in the extra
field of the telescope:
airflow_connection
which is a list of Airflow connection ID names containing the login and password for accessing the Web of Science service.institution_ids
which is a list of strings containing the institution IDs to search in Web of Science, for example “Curtin University”.earliest_date
which is the earliestdatetime
to query.
Storage location
The telescope saves the dataset to a Google BigQuery table with the project_id specified in the standard Airflow variable for the project ID, the dataset as clarivate
(unless overriden), and table id as web_of_science<date suffix>
.
Download throttling limits
The telescope downloads results in parallel. Web of Science has imposed throttling limits for API access. The following limits are observed:
New session creation: 5 per 5-min period.
API calls: 2 calls/s
Returned results: 100 max per call.
Cited references: 100 max per article.
Max records retrievable in period: licence dependent. Unclear what Curtin’s limit is if any.
Summary |
|
---|---|
Harvest Type |
API |
Harvest frequency |
Default: @monthly |
Runs on remote worker |
Default: False |
Catchup missed runs |
Default: False |
Table Write Disposition |
Append |
Dataset Update Frequency |
Daily |
Credentials Required |
Yes |
Uses Workflow Template |
Snapshot |
Each shard includes all data |
Yes |
Latest schema
name |
type |
mode |
description |
---|---|---|---|
categories |
RECORD |
NULLABLE |
Category descriptions |
categories.subjects |
RECORD |
REPEATED |
Category subjects |
categories.subjects.text |
STRING |
NULLABLE |
Category name |
categories.subjects.code |
STRING |
NULLABLE |
Category code |
categories.subjects.ascatype |
STRING |
NULLABLE |
Defines the two collection of subject categories used to classify journals in Web of Knowledge |
categories.subheadings |
STRING |
REPEATED |
Category subheadings |
categories.headings |
STRING |
REPEATED |
Category headings |
fund_ack |
RECORD |
NULLABLE |
Funding acknowledgements |
fund_ack.grants |
RECORD |
REPEATED |
Grant information |
fund_ack.grants.ids |
STRING |
REPEATED |
Grant id |
fund_ack.grants.agency |
STRING |
NULLABLE |
Grant agency |
fund_ack.text |
STRING |
REPEATED |
Funding acknowledgement texts |
identifiers |
RECORD |
NULLABLE |
Document identifiers |
identifiers.art_no |
STRING |
NULLABLE |
|
identifiers.doi |
STRING |
NULLABLE |
Digital object identifier |
identifiers.eissn |
STRING |
NULLABLE |
Electronic ISSN |
identifiers.issn |
STRING |
NULLABLE |
ISSN |
identifiers.meeting_abs |
STRING |
NULLABLE |
|
identifiers.xref_doi |
STRING |
NULLABLE |
|
identifiers.isbn |
STRING |
NULLABLE |
ISBN |
identifiers.eisbn |
STRING |
NULLABLE |
Electronic ISBN |
identifiers.parent_book_doi |
STRING |
NULLABLE |
|
identifiers.uid |
STRING |
NULLABLE |
Web of Science UID |
abstract |
STRING |
REPEATED |
List of abstracts |
conferences |
RECORD |
REPEATED |
Information on the conference proceedings |
conferences.name |
STRING |
NULLABLE |
Conference name |
conferences.id |
INTEGER |
NULLABLE |
Conference id |
ref_count |
INTEGER |
NULLABLE |
Reference count |
names |
RECORD |
REPEATED |
Names associated with publication |
names.full_name |
STRING |
NULLABLE |
Full name |
names.daisng_id |
INTEGER |
NULLABLE |
WoS identifier from their entity disambiguation algorithm |
names.orcid |
STRING |
NULLABLE |
ORCID identifier |
names.last_name |
STRING |
NULLABLE |
Last name |
names.wos_standard |
STRING |
NULLABLE |
Surname followed by a comma and up to five initials |
names.role |
STRING |
NULLABLE |
Role of name, e.g., author |
names.first_name |
STRING |
NULLABLE |
First name |
names.r_id |
STRING |
NULLABLE |
ResearcherID identifier |
names.seq_no |
INTEGER |
NULLABLE |
Position the author appears in the publication |
languages |
RECORD |
REPEATED |
Languages used in the publication |
languages.name |
STRING |
NULLABLE |
Name of language |
languages.type |
STRING |
NULLABLE |
Type of role language plays, e.g., primary |
title |
STRING |
NULLABLE |
Title of publication |
orgs |
RECORD |
REPEATED |
Organisations affiliated with the publication (possibly through authors) |
orgs.org_name |
STRING |
NULLABLE |
Organisation name |
orgs.country |
STRING |
NULLABLE |
Country where organisation resides |
orgs.state |
STRING |
NULLABLE |
State where organisation resides |
orgs.names |
RECORD |
REPEATED |
Names associated with this organisation, e.g., authors |
orgs.names.wos_standard |
STRING |
NULLABLE |
Surname followed by a comma and up to five initials |
orgs.names.full_name |
STRING |
NULLABLE |
Full name |
orgs.names.daisng_id |
INTEGER |
NULLABLE |
WoS identifier from their entity disambiguation algorithm |
orgs.names.last_name |
STRING |
NULLABLE |
Last name |
orgs.names.first_name |
STRING |
NULLABLE |
First name |
orgs.suborgs |
STRING |
REPEATED |
Any relevant suborganisations of this organisation |
orgs.city |
STRING |
NULLABLE |
City where organisation resides |
keywords |
STRING |
REPEATED |
List of keywords and keywords plus (where available) |
pub_info |
RECORD |
NULLABLE |
Publication information summary |
pub_info.publisher |
STRING |
NULLABLE |
Name of publisher |
pub_info.publisher_city |
STRING |
NULLABLE |
City where publisher is |
pub_info.doc_type |
STRING |
NULLABLE |
Type of publication |
pub_info.source |
STRING |
NULLABLE |
Source publication, e.g., publishing journal |
pub_info.pub_type |
STRING |
NULLABLE |
Type of publication |
pub_info.page_count |
INTEGER |
NULLABLE |
Page count for this document |
pub_info.sort_date |
DATE |
NULLABLE |
%E4Y-%m-%d |
snapshot_date |
DATE |
NULLABLE |
%E4Y-%m-%d the date that the workflow harvested the data. |
institution_ids |
STRING |
REPEATED |
Institution IDs used to fetch record. This indicates the list of institutions fetched under same key in a OR query, e.g., OG=(Curtin University OR Other authorised institution) |