academic_observatory_workflows.datacite_telescope.datacite_transform

Attributes

parser

Functions

yield_jsonl(file_path)

Return or yield row of a JSON lines file as a dictionary. If the file

merge_schema_maps(→ collections.OrderedDict)

Using the SchemaGenerator from the bigquery_schema_generator library, merge the schemas found

flatten_schema(→ dict)

A quick trick using the JSON encoder and load string function to convert from a nested

sort_schema(input_file)

list_jsonl_files(folder_path)

remove_empty_dicts(arr)

remove_nulls_from_list_field(obj, field)

clean_string_or_bool(value)

format_geography_point(point)

Formats a geography point string if latitude and longitude are present.

normalize_to_string_or_none(value)

Normalizes a value to a string or returns None if the value is None or an empty string.

filter_non_empty_dicts(arr)

Filters out empty dictionaries from a list.

transform_geo_locations(geoLocations)

Transforms and cleans geo-location data.

normalize_affiliations_and_identifiers(obj, field)

Normalizes affiliation and name identifier fields in a list.

normalize_identifier_fields(obj, field, subfield)

Normalizes identifier fields to strings within a specified field.

normalize_related_item(value)

Normalizes related items by converting to string or returning None.

transform_object(obj)

transform(→ Tuple[str, bool, collections.OrderedDict, ...)

get_chunks(→ List[Any])

Generator that splits a list into chunks of a fixed size.

generate_schema_for_dataset(input_folder, ...)

check_directory(path)

Check if the provided path is a valid directory.

Module Contents

academic_observatory_workflows.datacite_telescope.datacite_transform.yield_jsonl(file_path: str)[source]

Return or yield row of a JSON lines file as a dictionary. If the file is gz compressed then it will be extracted.

Parameters:

file_path – the path to the JSON lines file.

Returns:

generator.

academic_observatory_workflows.datacite_telescope.datacite_transform.merge_schema_maps(to_add: collections.OrderedDict, old: collections.OrderedDict) collections.OrderedDict[source]

Using the SchemaGenerator from the bigquery_schema_generator library, merge the schemas found when from scanning through files into one large nested OrderedDict.

Parameters:
  • to_add – The incoming schema to add to the existing “old” schema.

  • old – The existing old schema with previously populated values.

Returns:

The old schema with newly added fields.

academic_observatory_workflows.datacite_telescope.datacite_transform.flatten_schema(schema_map: collections.OrderedDict) dict[source]

A quick trick using the JSON encoder and load string function to convert from a nested OrderedDict object to a regular dictionary.

Parameters:

schema_map – The generated schema from SchemaGenerator.

Return schema:

A Bigquery style schema.

academic_observatory_workflows.datacite_telescope.datacite_transform.sort_schema(input_file: pathlib.Path)[source]
academic_observatory_workflows.datacite_telescope.datacite_transform.list_jsonl_files(folder_path)[source]
academic_observatory_workflows.datacite_telescope.datacite_transform.remove_empty_dicts(arr)[source]
academic_observatory_workflows.datacite_telescope.datacite_transform.remove_nulls_from_list_field(obj: dict, field: str)[source]
academic_observatory_workflows.datacite_telescope.datacite_transform.clean_string_or_bool(value)[source]
academic_observatory_workflows.datacite_telescope.datacite_transform.format_geography_point(point)[source]

Formats a geography point string if latitude and longitude are present.

academic_observatory_workflows.datacite_telescope.datacite_transform.normalize_to_string_or_none(value)[source]

Normalizes a value to a string or returns None if the value is None or an empty string.

academic_observatory_workflows.datacite_telescope.datacite_transform.filter_non_empty_dicts(arr)[source]

Filters out empty dictionaries from a list.

academic_observatory_workflows.datacite_telescope.datacite_transform.transform_geo_locations(geoLocations)[source]

Transforms and cleans geo-location data.

academic_observatory_workflows.datacite_telescope.datacite_transform.normalize_affiliations_and_identifiers(obj, field)[source]

Normalizes affiliation and name identifier fields in a list.

academic_observatory_workflows.datacite_telescope.datacite_transform.normalize_identifier_fields(obj: dict, field: str, subfield: str)[source]

Normalizes identifier fields to strings within a specified field.

Normalizes related items by converting to string or returning None.

academic_observatory_workflows.datacite_telescope.datacite_transform.transform_object(obj)[source]
academic_observatory_workflows.datacite_telescope.datacite_transform.transform(input_path: str, output_path: str) Tuple[str, bool, collections.OrderedDict, list][source]
academic_observatory_workflows.datacite_telescope.datacite_transform.get_chunks(*, input_list: List[Any], chunk_size: int = 8) List[Any][source]

Generator that splits a list into chunks of a fixed size.

Parameters:
  • input_list – Input list.

  • chunk_size – Size of chunks.

Returns:

The next chunk from the input list.

academic_observatory_workflows.datacite_telescope.datacite_transform.generate_schema_for_dataset(input_folder: pathlib.Path, output_folder: pathlib.Path, max_workers: int)[source]
academic_observatory_workflows.datacite_telescope.datacite_transform.check_directory(path)[source]

Check if the provided path is a valid directory.

academic_observatory_workflows.datacite_telescope.datacite_transform.parser[source]