academic_observatory_workflows.wikipedia
Module Contents
Functions
|
Get the wikipedia descriptions for each entity (institution or country). |
|
Get a Wikipedia title from a Wikipedia URL. |
|
Fetch the wikipedia descriptions for a set of Wikipedia URLs |
|
Remove any text between (nested) brackets. |
|
Shorten a text to as many complete sentences as possible, while the total number of characters stays below |
Attributes
- academic_observatory_workflows.wikipedia.fetch_wikipedia_descriptions(wikipedia_urls: List[str]) List[Tuple[str, str]] [source]
Get the wikipedia descriptions for each entity (institution or country).
- Parameters:
wikipedia_urls – a list of Wikipedia URLs.
- Returns:
a list of tuples containing Wikipedia URL and Wikipedia description.
- academic_observatory_workflows.wikipedia.get_wikipedia_title(url: str) str [source]
Get a Wikipedia title from a Wikipedia URL.
- Parameters:
url – a Wikipedia URL.
- Returns:
the title.
- academic_observatory_workflows.wikipedia.fetch_wikipedia_descriptions_batch(urls: List) List[Tuple[str, str]] [source]
Fetch the wikipedia descriptions for a set of Wikipedia URLs
- Parameters:
urls – a list of Wikipedia URLs.
- Returns:
List with tuples (id, wiki description)
- academic_observatory_workflows.wikipedia.remove_text_between_brackets(text: str) str [source]
Remove any text between (nested) brackets. If there is a space after the opening bracket, this is removed as well. E.g. ‘Like this (foo, (bar)) example’ -> ‘Like this example’
- Parameters:
text – The text to modify
- Returns:
The modified text
- academic_observatory_workflows.wikipedia.shorten_text_full_sentences(text: str, *, char_limit: int = 300) str [source]
Shorten a text to as many complete sentences as possible, while the total number of characters stays below the char_limit. Always return at least one sentence, even if this exceeds the char_limit.
- Parameters:
text – A string with the complete text
char_limit – The max number of characters
- Returns:
The shortened text.