academic_observatory_workflows.datacite_telescope.datacite_transform

Attributes

parser

Functions

`yield_jsonl`(file_path)	Return or yield row of a JSON lines file as a dictionary. If the file
`merge_schema_maps`(→ collections.OrderedDict)	Using the SchemaGenerator from the bigquery_schema_generator library, merge the schemas found
`flatten_schema`(→ dict)	A quick trick using the JSON encoder and load string function to convert from a nested
`sort_schema`(input_file)
`list_jsonl_files`(folder_path)
`remove_empty_dicts`(arr)
`remove_nulls_from_list_field`(obj, field)
`clean_string_or_bool`(value)
`format_geography_point`(point)	Formats a geography point string if latitude and longitude are present.
`normalize_to_string_or_none`(value)	Normalizes a value to a string or returns None if the value is None or an empty string.
`filter_non_empty_dicts`(arr)	Filters out empty dictionaries from a list.
`transform_geo_locations`(geoLocations)	Transforms and cleans geo-location data.
`normalize_affiliations_and_identifiers`(obj, field)	Normalizes affiliation and name identifier fields in a list.
`normalize_identifier_fields`(obj, field, subfield)	Normalizes identifier fields to strings within a specified field.
`normalize_related_item`(value)	Normalizes related items by converting to string or returning None.
`transform_object`(obj)
`transform`(→ Tuple[str, bool, collections.OrderedDict, ...)
`get_chunks`(→ List[Any])	Generator that splits a list into chunks of a fixed size.
`generate_schema_for_dataset`(input_folder, ...)
`check_directory`(path)	Check if the provided path is a valid directory.

Module Contents

academic_observatory_workflows.datacite_telescope.datacite_transform.yield_jsonl(file_path: str)[source]

Return or yield row of a JSON lines file as a dictionary. If the file is gz compressed then it will be extracted.

Parameters:: file_path – the path to the JSON lines file.
Returns:: generator.

academic_observatory_workflows.datacite_telescope.datacite_transform.merge_schema_maps(to_add: collections.OrderedDict, old: collections.OrderedDict) → collections.OrderedDict[source]

Using the SchemaGenerator from the bigquery_schema_generator library, merge the schemas found when from scanning through files into one large nested OrderedDict.

Parameters:

to_add – The incoming schema to add to the existing “old” schema.
old – The existing old schema with previously populated values.

Returns:

The old schema with newly added fields.

academic_observatory_workflows.datacite_telescope.datacite_transform.flatten_schema(schema_map: collections.OrderedDict) → dict[source]

A quick trick using the JSON encoder and load string function to convert from a nested OrderedDict object to a regular dictionary.

Parameters:: schema_map – The generated schema from SchemaGenerator.
Return schema:: A Bigquery style schema.

academic_observatory_workflows.datacite_telescope.datacite_transform.sort_schema(input_file: pathlib.Path)[source]

academic_observatory_workflows.datacite_telescope.datacite_transform.list_jsonl_files(folder_path)[source]

academic_observatory_workflows.datacite_telescope.datacite_transform.remove_empty_dicts(arr)[source]

academic_observatory_workflows.datacite_telescope.datacite_transform.remove_nulls_from_list_field(obj: dict, field: str)[source]

academic_observatory_workflows.datacite_telescope.datacite_transform.clean_string_or_bool(value)[source]

academic_observatory_workflows.datacite_telescope.datacite_transform.format_geography_point(point)[source]: Formats a geography point string if latitude and longitude are present.

academic_observatory_workflows.datacite_telescope.datacite_transform.normalize_to_string_or_none(value)[source]: Normalizes a value to a string or returns None if the value is None or an empty string.

academic_observatory_workflows.datacite_telescope.datacite_transform.filter_non_empty_dicts(arr)[source]: Filters out empty dictionaries from a list.

academic_observatory_workflows.datacite_telescope.datacite_transform.transform_geo_locations(geoLocations)[source]: Transforms and cleans geo-location data.

academic_observatory_workflows.datacite_telescope.datacite_transform.normalize_affiliations_and_identifiers(obj, field)[source]: Normalizes affiliation and name identifier fields in a list.

academic_observatory_workflows.datacite_telescope.datacite_transform.normalize_identifier_fields(obj: dict, field: str, subfield: str)[source]: Normalizes identifier fields to strings within a specified field.

academic_observatory_workflows.datacite_telescope.datacite_transform.normalize_related_item(value)[source]: Normalizes related items by converting to string or returning None.

academic_observatory_workflows.datacite_telescope.datacite_transform.transform_object(obj)[source]

academic_observatory_workflows.datacite_telescope.datacite_transform.transform(input_path: str, output_path: str) → Tuple[str, bool, collections.OrderedDict, list][source]

academic_observatory_workflows.datacite_telescope.datacite_transform.get_chunks(*, input_list: List[Any], chunk_size: int = 8) → List[Any][source]

Generator that splits a list into chunks of a fixed size.

Parameters:

input_list – Input list.
chunk_size – Size of chunks.

Returns:

The next chunk from the input list.

academic_observatory_workflows.datacite_telescope.datacite_transform.generate_schema_for_dataset(input_folder: pathlib.Path, output_folder: pathlib.Path, max_workers: int)[source]

academic_observatory_workflows.datacite_telescope.datacite_transform.check_directory(path)[source]: Check if the provided path is a valid directory.

academic_observatory_workflows.datacite_telescope.datacite_transform.parser[source]