Utilities#
NLP tools#
Well-tested tools that can be used in production with confidence. The toolbox of feature-extraction functions is expected to grow to cover other domains as well.
clean(text, ignore_xml=False, ignore_latex=False, remove_brackets=False, convert_to_ascii=False)
#
Clean all XML, LaTeX, PDF-extraction, and Unicode artifacts from the text.
The cleaning is quite heavy-weight and can be destructive. However, when working with text, this is usually required to achieve sufficient cleanliness before further processing.
Optionally, the text can be turned into ASCII. Carefully consider whether this is absolutely needed for your use-case.
Examples:
>>> clean('<h2 color="red">Bj\\"{o}rn is \t \\textit{happy} 🙂 <3</h2>')
'Björn is happy 🙂 <3'
>>> clean(
... '<h2 color="red">Bj\\"{o}rn is \t \\textit{happy} 🙂 <3</h2>',
... convert_to_ascii=True
... )
'Bjorn is happy <3'
>>> clean(
... '<h2 color="red">Bj\\"{o}rn is \t \\textit{happy} 🙂 <3</h2>',
... ignore_xml=True
... )
'<h2 color="red">Björn is happy 🙂 lt;3</h2>'
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
Text to be cleaned. |
required |
ignore_xml |
bool
|
Do not process/remove XML-tags. |
False
|
ignore_latex |
bool
|
Do not process/remove LaTeX-tags. |
False
|
remove_brackets |
bool
|
Do not remove brackets ([]) |
False
|
convert_to_ascii |
bool
|
Strip (or convert) non-ascii characters. |
False
|
Returns:
Type | Description |
---|---|
str
|
The cleaned input text with sensibly collapsed whitespace and optionally no markup. |
Source code in great_ai/utilities/clean.py
get_sentences(text, ignore_partial=False, true_case=False, remove_punctuation=False)
#
Return the list of sentences found in the input text.
Use syntok to segment the sentences. Further processing can be enabled with optional arguments.
Examples:
>>> get_sentences('This is a sentence. This is a half', ignore_partial=True)
['This is a sentence.']
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
str
|
Text to be segmented into sentences. |
required |
ignore_partial |
bool
|
Filter out sentences that are not capitalised/don't end with a punctuation. |
False
|
true_case |
bool
|
Crude method: lowercase the first word of each sentence. |
False
|
remove_punctuation |
bool
|
Remove all kinds of punctuation. |
False
|
Returns:
Type | Description |
---|---|
List[str]
|
The found sentences (with partial sentences optionally filtered out). |
Source code in great_ai/utilities/get_sentences.py
predict_language(text)
#
Predict the language code from text.
A thin wrapper over langcodes for convenient language tagging.
Examples:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
text |
Optional[str]
|
Text used for prediction. |
required |
Returns:
Type | Description |
---|---|
str
|
The predicted language code (en, en-US) or |
Source code in great_ai/utilities/language/predict_language.py
english_name_of_language(language_code)
#
Human-friendly English name of language from its language_code
.
A thin wrapper over langcodes for convenient language tagging.
Examples:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
language_code |
Optional[str]
|
Language code, for example, returned by great_ai.utilities.language.predict_language.predict_language. |
required |
Returns:
Type | Description |
---|---|
str
|
English name of language. |
Source code in great_ai/utilities/language/english_name_of_language.py
is_english(language_code)
#
Decide whether the language_code
is of an English language.
A thin wrapper over langcodes for convenient language tagging.
Examples:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
language_code |
Optional[str]
|
Language code, for example, returned by `great_ai.utilities.language.predict_language.predict_language. |
required |
Returns:
Type | Description |
---|---|
bool
|
Boolean indicating whether it's English. |
Source code in great_ai/utilities/language/is_english.py
evaluate_ranking(expected, actual_scores, target_recall, title='', disable_interpolation=False, axes=None, output_svg=None, reverse_order=False, plot=True)
#
Render the Precision-Recall curve of a ranking.
And improved version of scikit-learn's PR-curve
Parameters:
Name | Type | Description | Default |
---|---|---|---|
expected |
List[T]
|
Expected ordering of the elements (rank if it's an integer, alphabetical if a string) |
required |
actual_scores |
List[float]
|
Actual ranking scores (need not be on the same scale as
|
required |
title |
Optional[str]
|
Title of the plot. |
''
|
disable_interpolation |
bool
|
Do not interpolate. |
False
|
axes |
Optional[Axes]
|
Matplotlib axes for plotting inside a subplot. |
None
|
output_svg |
Optional[Path]
|
If specified, save the chart as an svg to the given Path. |
None
|
reverse_order |
bool
|
Reverse the ranking specified by |
False
|
plot |
bool
|
Display a plot on the screen. |
True
|
Returns:
Type | Description |
---|---|
Dict[T, float]
|
Precision values at given recall. |
Source code in great_ai/utilities/evaluate_ranking/evaluate_ranking.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
|
Parallel processing#
Multiprocessing and multithreading-based parallelism with support for async
functions. Its main purpose is to implement great_ai.GreatAI.process_batch, however, the parallel processing functions are also convenient for covering other types of mapping needs with a friendlier API than joblib or multiprocess.
simple_parallel_map(func, input_values, *, chunk_size=None, concurrency=None)
#
Execute a map operation on an list mimicking the API of the built-in map()
.
A thin-wrapper over parallel_map. For more options, consult the documentation of parallel_map.
Examples:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
func |
Callable[[T], Union[V, Awaitable[V]]]
|
The function that should be applied to each element of |
required |
input_values |
Sequence[T]
|
An iterable of items that |
required |
chunk_size |
Optional[int]
|
Tune the number of items processed in each step. Larger numbers
result in smaller communication overhead but less parallelism at the start
and end. If |
None
|
concurrency |
Optional[int]
|
Number of new processes to start. Shouldn't be too much more than the number of physical cores. |
None
|
Returns:
Type | Description |
---|---|
List[V]
|
An iterable of results obtained from applying |
Raises:
Type | Description |
---|---|
WorkerException
|
If there was an error in the |
Source code in great_ai/utilities/parallel_map/simple_parallel_map.py
parallel_map(func, input_values, *, chunk_size=None, ignore_exceptions=False, concurrency=None, unordered=False)
#
Execute a map operation on an iterable stream.
A custom parallel map operation supporting both synchronous and async
map
functions. The func
function is serialised with dill
. Exceptions encountered in
the map function are sent to the host process where they are either raised (default)
or ignored.
The new processes are forked if the OS allows it, otherwise, new Python processes are bootstrapped which can incur some start-up cost. Each process processes a single chunk at once.
Examples:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
func |
Callable[[T], Union[V, Awaitable[V]]]
|
The function that should be applied to each element of |
required |
input_values |
Union[Iterable[T], Sequence[T]]
|
An iterable of items that |
required |
chunk_size |
Optional[int]
|
Tune the number of items processed in each step. Larger numbers
result in smaller communication overhead but less parallelism at the start
and end. If |
None
|
ignore_exceptions |
bool
|
Ignore chunks if |
False
|
concurrency |
Optional[int]
|
Number of new processes to start. Shouldn't be too much more than the number of physical cores. |
None
|
unordered |
bool
|
Do not preserve the order of the elements, yield them as soon as they have been processed. This decreases the latency caused by difficult-to-process items. |
False
|
Yields:
Type | Description |
---|---|
Iterable[Optional[V]]
|
The next result obtained from applying |
Raises:
Type | Description |
---|---|
WorkerException
|
If there was an error in the |
Source code in great_ai/utilities/parallel_map/parallel_map.py
77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 |
|
threaded_parallel_map(func, input_values, *, chunk_size=None, ignore_exceptions=False, concurrency=None, unordered=False)
#
Execute a map operation on an iterable stream.
Similar to parallel_map but uses threads instead of processes. Hence, it is not helpful in CPU-bound situations.
A custom parallel map operation supporting both synchronous and async
map
functions. Exceptions encountered in the map function are sent to the host thread
where they are either raised (default) or ignored. Each process processes a single
chunk at once.
Examples:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
func |
Callable[[T], Union[V, Awaitable[V]]]
|
The function that should be applied to each element of |
required |
input_values |
Union[Iterable[T], Sequence[T]]
|
An iterable of items that |
required |
chunk_size |
Optional[int]
|
Tune the number of items processed in each step. Larger numbers
result in smaller communication overhead but less parallelism at the start
and end. If |
None
|
ignore_exceptions |
bool
|
Ignore chunks if |
False
|
concurrency |
Optional[int]
|
Number of new threads to start. |
None
|
unordered |
bool
|
Do not preserve the order of the elements, yield them as soon as they have been processed. This decreases the latency caused by difficult-to-process items. |
False
|
Yields:
Type | Description |
---|---|
Iterable[Optional[V]]
|
The next result obtained from applying |
Raises:
Type | Description |
---|---|
WorkerException
|
If there was an error in the |
Source code in great_ai/utilities/parallel_map/threaded_parallel_map.py
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 |
|
Composable parallel processing#
Because both threaded_parallel_map and parallel_map have a streaming interface, it is easy to compose them and end up with, for example, a process for each CPU core with its own thread-pool or event-loop. Longer pipelines are also easy to imagine. The chunking methods help in these compositions.
chunk(values, chunk_size)
#
Turn an iterable of items into an iterable of lists (chunks) of items.
Each returned chunk is of length chunk_size
except the last one the length of
which is between 1 and chunk_size
.
Useful for parallel processing.
Examples:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
values |
Iterable[T]
|
The stream of items to pack into chunks. |
required |
chunk_size |
int
|
Desired length of each (but the last) chunk. |
required |
Yields:
Type | Description |
---|---|
Iterable[List[T]]
|
The next chunk. |
Source code in great_ai/utilities/chunk.py
unchunk(chunks)
#
Turn a stream of chunks of items into a stream of items (flatten operation).
The inverse operation of chunk. Useful for parallel processing.
Similar to itertools.chain but ignores None
chunks.
Examples:
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chunks |
Iterable[Optional[Iterable[T]]]
|
Stream of chunks to unpack. |
required |
Yields:
Type | Description |
---|---|
Iterable[T]
|
The next item in the flattened iterable. |
Source code in great_ai/utilities/unchunk.py
Operations#
ConfigFile
#
Bases: Mapping[str, str]
A small and safe INI
-style configuration loader with dict
and ENV
support.
The values can be accessed using both dot- and index-notation. It is compatible
with the dict
interface.
File format example:
# comments are allowed everywhere
key = value # you can leave or omit whitespace around the equal-sign
my_hashtag = "#great_ai" # the r-value can be quoted with " or ' or `.
my_var = my_default_value # Default values can be given to env-vars,
# see next line. The default value must come first.
my_var = ENV:MY_ENV_VAR # If the value starts with the `ENV:` prefix,
# it is looked up from the environment variables.
Examples:
>>> ConfigFile('tests/utilities/data/simple.conf')
ConfigFile(path=tests/utilities/data/simple.conf) {'zeroth_key': 'test', 'first_key': 'András'}
>>> ConfigFile('tests/utilities/data/simple.conf').second_key
Traceback (most recent call last):
...
KeyError: 'Key `second_key` is not found in configuration file ...
>>> a = ConfigFile('tests/utilities/data/simple.conf')
>>> {**a}
{'zeroth_key': 'test', 'first_key': 'András'}
Source code in great_ai/utilities/config_file/config_file.py
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
|
path: Path
property
#
Original path from where the configuration was loaded.
__init__(path, *, ignore_missing=False)
#
Load and parse a configuration file.
Everything is eager-loaded, thus, exceptions may be thrown here.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
Union[Path, str]
|
Local path of the configuration file. |
required |
ignore_missing |
bool
|
Don't raise an exception on missing environment variables. |
False
|
Raises:
Type | Description |
---|---|
FileNotFoundError
|
If there is no file at the specified path. |
ParseError
|
If the provided file does not conform to the expected format. |
KeyError
|
If there is duplication in the keys. |
ValueError
|
If an environment variable is referenced but it is not set in
the system and |
Source code in great_ai/utilities/config_file/config_file.py
get_logger(name, level=logging.INFO, disable_colors=False)
#
Return a customised logger used throughout the GreatAI codebase.
Uses colors, and only prints timestamps when not running inside notebook.