Simple example: data engineering¶

Here, we solve a problem similar to the tutorial's but with an explainable Naive Bayes classifier and more best practices. In short, we train a domain classifier on the semantic scholar dataset by taking full advantage of great-ai. Subsequently, we create a production-ready deployment.

position of this step in the lifecycle

The blue boxes show the steps of a typical AI-development lifecycle implemented in this notebook.

Since the true scope of great-ai is the phase between proof-of-concept code and production-ready service, it is predominantly used in the deployment notebook. Feel free to skip there, or continue reading if you'd like to see the full picture.

Extract¶

This can be achieved by downloading a public dataset (such as in this case), or by having a Data Engineer setup and give us access to the organisation's data.

In this example, we download the semantic scholar dataset from a public S3 bucket.

In [1]:

Copied!

MAX_CHUNK_COUNT = 4
MAX_CHUNK_COUNT = 4

In [2]:

Copied!





import urllib.request
from random import shuffle

manifest = (
    urllib.request.urlopen(
        "https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/"
        "open-corpus/2022-02-01/manifest.txt"
    )
    .read()
    .decode()
)  # a list of available chunks separated by '\n' characters

lines = manifest.split()
shuffle(lines)
chunks = lines[:MAX_CHUNK_COUNT]

f"""Processing {len(chunks)} out of the {
    len(manifest.split())
} available chunks"""
import urllib.request
from random import shuffle

manifest = (
    urllib.request.urlopen(
        "https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/"
        "open-corpus/2022-02-01/manifest.txt"
    )
    .read()
    .decode()
)  # a list of available chunks separated by '\n' characters

lines = manifest.split()
shuffle(lines)
chunks = lines[:MAX_CHUNK_COUNT]

f"""Processing {len(chunks)} out of the {
    len(manifest.split())
} available chunks"""

Out[2]:

'Processing 4 out of the 6002 available chunks'

Transform¶

Filter out non-English abstracts using great_ai.utilities.predict_language
Project it to only keep the necessary components (text and labels), clean the textual content using great_ai.utilities.clean
We will speed up processing using great_ai.utilities.simple_parallel_map.

In [3]:

Copied!





from typing import List, Tuple
import json
import gzip
from great_ai.utilities import (
    simple_parallel_map,
    clean,
    is_english,
    predict_language,
    unchunk,
)


def preprocess_chunk(chunk_key: str) -> List[Tuple[str, List[str]]]:
    response = urllib.request.urlopen(
        f"https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/"
        f"open-corpus/2022-02-01/{chunk_key}"
    )  # a gzipped JSON Lines file

    decompressed = gzip.decompress(response.read())
    decoded = decompressed.decode()
    chunk = [json.loads(line) for line in decoded.split("\n") if line]

    # Transform
    return [
        (
            clean(
                f'{c["title"]} {c["paperAbstract"]} '
                f'{c["journalName"]} {c["venue"]}',
                convert_to_ascii=True,
            ),  # The text is cleaned to remove common artifacts
            c["fieldsOfStudy"],
        )  # Create pairs of `(text, [...domains])`
        for c in chunk
        if (c["fieldsOfStudy"] and is_english(predict_language(c["paperAbstract"])))
    ]


preprocessed_data = unchunk(
    simple_parallel_map(preprocess_chunk, chunks, concurrency=4)
)
from typing import List, Tuple
import json
import gzip
from great_ai.utilities import (
    simple_parallel_map,
    clean,
    is_english,
    predict_language,
    unchunk,
)


def preprocess_chunk(chunk_key: str) -> List[Tuple[str, List[str]]]:
    response = urllib.request.urlopen(
        f"https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/"
        f"open-corpus/2022-02-01/{chunk_key}"
    )  # a gzipped JSON Lines file

    decompressed = gzip.decompress(response.read())
    decoded = decompressed.decode()
    chunk = [json.loads(line) for line in decoded.split("\n") if line]

    # Transform
    return [
        (
            clean(
                f'{c["title"]} {c["paperAbstract"]} '
                f'{c["journalName"]} {c["venue"]}',
                convert_to_ascii=True,
            ),  # The text is cleaned to remove common artifacts
            c["fieldsOfStudy"],
        )  # Create pairs of `(text, [...domains])`
        for c in chunk
        if (c["fieldsOfStudy"] and is_english(predict_language(c["paperAbstract"])))
    ]


preprocessed_data = unchunk(
    simple_parallel_map(preprocess_chunk, chunks, concurrency=4)
)

100%|██████████| 4/4 [04:22<00:00, 65.51s/it]

In [4]:

Copied!

X, y = zip(*preprocessed_data)  # X is the input, y is the expected output
X, y = zip(*preprocessed_data)  # X is the input, y is the expected output

Load¶

Upload the dataset (or a part of it) to a central repository using great_ai.add_ground_truth. This step automatically tags each data-point with a split label according to the ratios we set. Additional tags can be also given.

Production-ready backend¶

The MongoDB driver is automatically configured if mongo.ini exists with the following scheme:

mongo_connection_string=mongodb://localhost:27017/
mongo_database=my_great_ai_db

You can install MongoDB from here or use it as a service

Otherwise, TinyDB is used which is just a local JSON file.

In [5]:

Copied!

from great_ai import add_ground_truth

add_ground_truth(X, y, train_split_ratio=0.8, test_split_ratio=0.2)
from great_ai import add_ground_truth

add_ground_truth(X, y, train_split_ratio=0.8, test_split_ratio=0.2)

Environment variable ENVIRONMENT is not set, defaulting to development mode ‼️
Cannot find credentials files, defaulting to using ParallelTinyDbDriver
The selected tracing database (ParallelTinyDbDriver) is not recommended for production
Cannot find credentials files, defaulting to using LargeFileLocal
GreatAI (v0.1.6): configured ✅
  🔩 tracing_database: ParallelTinyDbDriver
  🔩 large_file_implementation: LargeFileLocal
  🔩 is_production: False
  🔩 should_log_exception_stack: True
  🔩 prediction_cache_size: 512
  🔩 dashboard_table_size: 50
You still need to check whether you follow all best practices before trusting your deployment.
> Find out more at https://se-ml.github.io/practices

Next: Part 2 ¶

Last update: July 24, 2022