academic_observatory_workflows.wikipedia

Module Contents

Functions

fetch_wikipedia_descriptions(→ List[Tuple[str, str]])

Get the wikipedia descriptions for each entity (institution or country).

get_wikipedia_title(→ str)

Get a Wikipedia title from a Wikipedia URL.

fetch_wikipedia_descriptions_batch(→ List[Tuple[str, str]])

Fetch the wikipedia descriptions for a set of Wikipedia URLs

remove_text_between_brackets(→ str)

Remove any text between (nested) brackets.

shorten_text_full_sentences(→ str)

Shorten a text to as many complete sentences as possible, while the total number of characters stays below

Attributes

WIKI_MAX_TITLES

academic_observatory_workflows.wikipedia.WIKI_MAX_TITLES = 20[source]
academic_observatory_workflows.wikipedia.fetch_wikipedia_descriptions(wikipedia_urls: List[str]) List[Tuple[str, str]][source]

Get the wikipedia descriptions for each entity (institution or country).

Parameters:

wikipedia_urls – a list of Wikipedia URLs.

Returns:

a list of tuples containing Wikipedia URL and Wikipedia description.

academic_observatory_workflows.wikipedia.get_wikipedia_title(url: str) str[source]

Get a Wikipedia title from a Wikipedia URL.

Parameters:

url – a Wikipedia URL.

Returns:

the title.

academic_observatory_workflows.wikipedia.fetch_wikipedia_descriptions_batch(urls: List) List[Tuple[str, str]][source]

Fetch the wikipedia descriptions for a set of Wikipedia URLs

Parameters:

urls – a list of Wikipedia URLs.

Returns:

List with tuples (id, wiki description)

academic_observatory_workflows.wikipedia.remove_text_between_brackets(text: str) str[source]

Remove any text between (nested) brackets. If there is a space after the opening bracket, this is removed as well. E.g. ‘Like this (foo, (bar)) example’ -> ‘Like this example’

Parameters:

text – The text to modify

Returns:

The modified text

academic_observatory_workflows.wikipedia.shorten_text_full_sentences(text: str, *, char_limit: int = 300) str[source]

Shorten a text to as many complete sentences as possible, while the total number of characters stays below the char_limit. Always return at least one sentence, even if this exceeds the char_limit.

Parameters:
  • text – A string with the complete text

  • char_limit – The max number of characters

Returns:

The shortened text.