Simple example: data engineering¶
Here, we solve a problem similar to the tutorial's but with an explainable Naive Bayes classifier and more best practices. In short, we train a domain classifier on the semantic scholar dataset by taking full advantage of great-ai
. Subsequently, we create a production-ready deployment.
The blue boxes show the steps of a typical AI-development lifecycle implemented in this notebook.
Since the true scope of great-ai
is the phase between proof-of-concept code and production-ready service, it is predominantly used in the deployment notebook. Feel free to skip there, or continue reading if you'd like to see the full picture.
Extract¶
This can be achieved by downloading a public dataset (such as in this case), or by having a Data Engineer setup and give us access to the organisation's data.
In this example, we download the semantic scholar dataset from a public S3 bucket.
MAX_CHUNK_COUNT = 4
import urllib.request
from random import shuffle
manifest = (
urllib.request.urlopen(
"https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/"
"open-corpus/2022-02-01/manifest.txt"
)
.read()
.decode()
) # a list of available chunks separated by '\n' characters
lines = manifest.split()
shuffle(lines)
chunks = lines[:MAX_CHUNK_COUNT]
f"""Processing {len(chunks)} out of the {
len(manifest.split())
} available chunks"""
'Processing 4 out of the 6002 available chunks'
Transform¶
- Filter out non-English abstracts using
great_ai.utilities.predict_language
- Project it to only keep the necessary components (text and labels), clean the textual content using
great_ai.utilities.clean
- We will speed up processing using
great_ai.utilities.simple_parallel_map
.
from typing import List, Tuple
import json
import gzip
from great_ai.utilities import (
simple_parallel_map,
clean,
is_english,
predict_language,
unchunk,
)
def preprocess_chunk(chunk_key: str) -> List[Tuple[str, List[str]]]:
response = urllib.request.urlopen(
f"https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/"
f"open-corpus/2022-02-01/{chunk_key}"
) # a gzipped JSON Lines file
decompressed = gzip.decompress(response.read())
decoded = decompressed.decode()
chunk = [json.loads(line) for line in decoded.split("\n") if line]
# Transform
return [
(
clean(
f'{c["title"]} {c["paperAbstract"]} '
f'{c["journalName"]} {c["venue"]}',
convert_to_ascii=True,
), # The text is cleaned to remove common artifacts
c["fieldsOfStudy"],
) # Create pairs of `(text, [...domains])`
for c in chunk
if (c["fieldsOfStudy"] and is_english(predict_language(c["paperAbstract"])))
]
preprocessed_data = unchunk(
simple_parallel_map(preprocess_chunk, chunks, concurrency=4)
)
100%|██████████| 4/4 [04:22<00:00, 65.51s/it]
X, y = zip(*preprocessed_data) # X is the input, y is the expected output
Load¶
Upload the dataset (or a part of it) to a central repository using great_ai.add_ground_truth
. This step automatically tags each data-point with a split label according to the ratios we set. Additional tags can be also given.
Production-ready backend¶
The MongoDB driver is automatically configured if mongo.ini
exists with the following scheme:
mongo_connection_string=mongodb://localhost:27017/
mongo_database=my_great_ai_db
You can install MongoDB from here or use it as a service
Otherwise, TinyDB is used which is just a local JSON file.
from great_ai import add_ground_truth
add_ground_truth(X, y, train_split_ratio=0.8, test_split_ratio=0.2)
Environment variable ENVIRONMENT is not set, defaulting to development mode ‼️ Cannot find credentials files, defaulting to using ParallelTinyDbDriver The selected tracing database (ParallelTinyDbDriver) is not recommended for production Cannot find credentials files, defaulting to using LargeFileLocal GreatAI (v0.1.6): configured ✅ 🔩 tracing_database: ParallelTinyDbDriver 🔩 large_file_implementation: LargeFileLocal 🔩 is_production: False 🔩 should_log_exception_stack: True 🔩 prediction_cache_size: 512 🔩 dashboard_table_size: 50 You still need to check whether you follow all best practices before trusting your deployment. > Find out more at https://se-ml.github.io/practices