academic_observatory_workflows.datacite_telescope.datacite_transform
Attributes
Functions
|
Return or yield row of a JSON lines file as a dictionary. If the file |
|
Using the SchemaGenerator from the bigquery_schema_generator library, merge the schemas found |
|
A quick trick using the JSON encoder and load string function to convert from a nested |
|
|
|
|
|
|
|
|
|
|
|
Formats a geography point string if latitude and longitude are present. |
|
Normalizes a value to a string or returns None if the value is None or an empty string. |
Filters out empty dictionaries from a list. |
|
|
Transforms and cleans geo-location data. |
|
Normalizes affiliation and name identifier fields in a list. |
|
Normalizes identifier fields to strings within a specified field. |
|
Normalizes related items by converting to string or returning None. |
|
|
|
|
|
Generator that splits a list into chunks of a fixed size. |
|
|
|
Check if the provided path is a valid directory. |
Module Contents
- academic_observatory_workflows.datacite_telescope.datacite_transform.yield_jsonl(file_path: str)[source]
Return or yield row of a JSON lines file as a dictionary. If the file is gz compressed then it will be extracted.
- Parameters:
file_path – the path to the JSON lines file.
- Returns:
generator.
- academic_observatory_workflows.datacite_telescope.datacite_transform.merge_schema_maps(to_add: collections.OrderedDict, old: collections.OrderedDict) collections.OrderedDict[source]
Using the SchemaGenerator from the bigquery_schema_generator library, merge the schemas found when from scanning through files into one large nested OrderedDict.
- Parameters:
to_add – The incoming schema to add to the existing “old” schema.
old – The existing old schema with previously populated values.
- Returns:
The old schema with newly added fields.
- academic_observatory_workflows.datacite_telescope.datacite_transform.flatten_schema(schema_map: collections.OrderedDict) dict[source]
A quick trick using the JSON encoder and load string function to convert from a nested OrderedDict object to a regular dictionary.
- Parameters:
schema_map – The generated schema from SchemaGenerator.
- Return schema:
A Bigquery style schema.
- academic_observatory_workflows.datacite_telescope.datacite_transform.sort_schema(input_file: pathlib.Path)[source]
- academic_observatory_workflows.datacite_telescope.datacite_transform.list_jsonl_files(folder_path)[source]
- academic_observatory_workflows.datacite_telescope.datacite_transform.remove_empty_dicts(arr)[source]
- academic_observatory_workflows.datacite_telescope.datacite_transform.remove_nulls_from_list_field(obj: dict, field: str)[source]
- academic_observatory_workflows.datacite_telescope.datacite_transform.clean_string_or_bool(value)[source]
- academic_observatory_workflows.datacite_telescope.datacite_transform.format_geography_point(point)[source]
Formats a geography point string if latitude and longitude are present.
- academic_observatory_workflows.datacite_telescope.datacite_transform.normalize_to_string_or_none(value)[source]
Normalizes a value to a string or returns None if the value is None or an empty string.
- academic_observatory_workflows.datacite_telescope.datacite_transform.filter_non_empty_dicts(arr)[source]
Filters out empty dictionaries from a list.
- academic_observatory_workflows.datacite_telescope.datacite_transform.transform_geo_locations(geoLocations)[source]
Transforms and cleans geo-location data.
- academic_observatory_workflows.datacite_telescope.datacite_transform.normalize_affiliations_and_identifiers(obj, field)[source]
Normalizes affiliation and name identifier fields in a list.
- academic_observatory_workflows.datacite_telescope.datacite_transform.normalize_identifier_fields(obj: dict, field: str, subfield: str)[source]
Normalizes identifier fields to strings within a specified field.
Normalizes related items by converting to string or returning None.
- academic_observatory_workflows.datacite_telescope.datacite_transform.transform(input_path: str, output_path: str) Tuple[str, bool, collections.OrderedDict, list][source]
- academic_observatory_workflows.datacite_telescope.datacite_transform.get_chunks(*, input_list: List[Any], chunk_size: int = 8) List[Any][source]
Generator that splits a list into chunks of a fixed size.
- Parameters:
input_list – Input list.
chunk_size – Size of chunks.
- Returns:
The next chunk from the input list.
- academic_observatory_workflows.datacite_telescope.datacite_transform.generate_schema_for_dataset(input_folder: pathlib.Path, output_folder: pathlib.Path, max_workers: int)[source]