Optimise and train a model¶

position of this step in the lifecycle

The blue boxes show the steps implemented in this notebook.

In the first part, we have cleaned and transformed our training data. We can now access this data using great_ai.LargeFile. Locally, it will gives us the cached version, otherwise, the latest version is downloaded from S3 or GridFS.

In this part, we hyperparameter-optimise and train a simple Naive Bayes classifier which we then export for deployment using great_ai.save_model.

Load data that has been extracted in part 1 ¶

In [1]:

Copied!

from great_ai import query_ground_truth

data = query_ground_truth("train")
X = [d.input for d in data for domain in d.feedback]
y = [domain for d in data for domain in d.feedback]
from great_ai import query_ground_truth

data = query_ground_truth("train")
X = [d.input for d in data for domain in d.feedback]
y = [domain for d in data for domain in d.feedback]

Environment variable ENVIRONMENT is not set, defaulting to development mode ‼️
Cannot find credentials files, defaulting to using ParallelTinyDbDriver
The selected tracing database (ParallelTinyDbDriver) is not recommended for production
Cannot find credentials files, defaulting to using LargeFileLocal
GreatAI (v0.1.6): configured ✅
  🔩 tracing_database: ParallelTinyDbDriver
  🔩 large_file_implementation: LargeFileLocal
  🔩 is_production: False
  🔩 should_log_exception_stack: True
  🔩 prediction_cache_size: 512
  🔩 dashboard_table_size: 50
You still need to check whether you follow all best practices before trusting your deployment.
> Find out more at https://se-ml.github.io/practices

In [2]:

Copied!





from collections import Counter
import matplotlib.pyplot as plt

domains, counts = zip(*Counter(y).most_common())

# Configure matplotlib to have nice, high-resolution charts
%matplotlib inline
plt.rcParams["figure.figsize"] = (20, 5)
plt.rcParams["figure.facecolor"] = "white"
plt.rcParams["font.size"] = 12

plt.xticks(rotation=90)
plt.bar(domains, counts)
None
from collections import Counter
import matplotlib.pyplot as plt

domains, counts = zip(*Counter(y).most_common())

# Configure matplotlib to have nice, high-resolution charts
%matplotlib inline
plt.rcParams["figure.figsize"] = (20, 5)
plt.rcParams["figure.facecolor"] = "white"
plt.rcParams["font.size"] = 12

plt.xticks(rotation=90)
plt.bar(domains, counts)
None

No description has been provided for this image

Optimise and train Multinomial Naive Bayes classifier¶

In [3]:

Copied!





from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer


def create_pipeline() -> Pipeline:
    return Pipeline(
        steps=[
            ("vectorizer", TfidfVectorizer(sublinear_tf=True)),
            ("classifier", MultinomialNB()),
        ]
    )
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer


def create_pipeline() -> Pipeline:
    return Pipeline(
        steps=[
            ("vectorizer", TfidfVectorizer(sublinear_tf=True)),
            ("classifier", MultinomialNB()),
        ]
    )

In [4]:

Copied!





from sklearn.model_selection import GridSearchCV
import pandas as pd

optimisation_pipeline = GridSearchCV(
    create_pipeline(),
    {
        "vectorizer__min_df": [5, 20, 100],
        "vectorizer__max_df": [0.05, 0.1],
        "classifier__alpha": [0.5, 1],
        "classifier__fit_prior": [True, False],
    },
    scoring="f1_macro",
    cv=3,
    n_jobs=-1,
    verbose=1,
)
optimisation_pipeline.fit(X, y)

results = pd.DataFrame(optimisation_pipeline.cv_results_)
results.sort_values("rank_test_score")
from sklearn.model_selection import GridSearchCV
import pandas as pd

optimisation_pipeline = GridSearchCV(
    create_pipeline(),
    {
        "vectorizer__min_df": [5, 20, 100],
        "vectorizer__max_df": [0.05, 0.1],
        "classifier__alpha": [0.5, 1],
        "classifier__fit_prior": [True, False],
    },
    scoring="f1_macro",
    cv=3,
    n_jobs=-1,
    verbose=1,
)
optimisation_pipeline.fit(X, y)

results = pd.DataFrame(optimisation_pipeline.cv_results_)
results.sort_values("rank_test_score")

Fitting 3 folds for each of 24 candidates, totalling 72 fits

Out[4]:

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_classifier__alpha	param_classifier__fit_prior	param_vectorizer__max_df	param_vectorizer__min_df	params	split0_test_score	split1_test_score	split2_test_score	mean_test_score	std_test_score	rank_test_score
7	14.549476	0.361685	8.476837	0.222398	0.5	False	0.05	20	{'classifier__alpha': 0.5, 'classifier__fit_pr...	0.518061	0.514842	0.511599	0.514834	0.002638	1
10	11.235289	0.130426	4.092868	0.082518	0.5	False	0.1	20	{'classifier__alpha': 0.5, 'classifier__fit_pr...	0.513897	0.515661	0.507867	0.512475	0.003337	2
19	7.383645	0.138110	4.130709	0.250048	1	False	0.05	20	{'classifier__alpha': 1, 'classifier__fit_prio...	0.496825	0.501045	0.496854	0.498241	0.001983	3
11	10.435154	0.305144	3.882101	0.128886	0.5	False	0.1	100	{'classifier__alpha': 0.5, 'classifier__fit_pr...	0.493247	0.497814	0.502245	0.497769	0.003674	4
8	13.643193	0.310696	4.173707	0.142980	0.5	False	0.05	100	{'classifier__alpha': 0.5, 'classifier__fit_pr...	0.489609	0.495207	0.498154	0.494323	0.003544	5
22	7.048340	0.050070	3.172948	0.152418	1	False	0.1	20	{'classifier__alpha': 1, 'classifier__fit_prio...	0.487456	0.493865	0.491157	0.490826	0.002627	6
23	7.564685	0.146092	2.374111	0.285026	1	False	0.1	100	{'classifier__alpha': 1, 'classifier__fit_prio...	0.485160	0.494039	0.490127	0.489776	0.003633	7
20	7.172353	0.212599	3.747219	0.130217	1	False	0.05	100	{'classifier__alpha': 1, 'classifier__fit_prio...	0.481303	0.490002	0.488269	0.486524	0.003759	8
6	14.276345	0.456576	8.318859	0.268701	0.5	False	0.05	5	{'classifier__alpha': 0.5, 'classifier__fit_pr...	0.482429	0.487735	0.484888	0.485017	0.002168	9
2	14.902358	0.737693	5.975091	0.171150	0.5	True	0.05	100	{'classifier__alpha': 0.5, 'classifier__fit_pr...	0.469598	0.474490	0.473637	0.472575	0.002134	10
9	12.677349	0.145143	4.374204	0.175674	0.5	False	0.1	5	{'classifier__alpha': 0.5, 'classifier__fit_pr...	0.468872	0.476451	0.470921	0.472082	0.003201	11
5	13.423686	0.482872	8.008324	0.442975	0.5	True	0.1	100	{'classifier__alpha': 0.5, 'classifier__fit_pr...	0.465726	0.474548	0.471879	0.470718	0.003694	12
1	13.819117	0.838347	6.161175	0.336590	0.5	True	0.05	20	{'classifier__alpha': 0.5, 'classifier__fit_pr...	0.463395	0.473982	0.471262	0.469546	0.004489	13
4	13.281476	0.588822	8.335852	0.254627	0.5	True	0.1	20	{'classifier__alpha': 0.5, 'classifier__fit_pr...	0.458734	0.468053	0.464418	0.463735	0.003835	14
14	7.282247	0.444940	3.567094	0.044519	1	True	0.05	100	{'classifier__alpha': 1, 'classifier__fit_prio...	0.438189	0.450160	0.446180	0.444843	0.004978	15
17	7.098797	0.196241	3.838628	0.091128	1	True	0.1	100	{'classifier__alpha': 1, 'classifier__fit_prio...	0.436488	0.444503	0.445900	0.442297	0.004147	16
18	7.492791	0.288889	3.843224	0.073438	1	False	0.05	5	{'classifier__alpha': 1, 'classifier__fit_prio...	0.428196	0.431945	0.430160	0.430100	0.001531	17
21	7.229826	0.099823	3.656332	0.073780	1	False	0.1	5	{'classifier__alpha': 1, 'classifier__fit_prio...	0.403130	0.410170	0.409801	0.407700	0.003235	18
13	7.158370	0.169818	3.765632	0.082924	1	True	0.05	20	{'classifier__alpha': 1, 'classifier__fit_prio...	0.399237	0.412872	0.407982	0.406697	0.005640	19
16	7.064643	0.119529	3.810983	0.125897	1	True	0.1	20	{'classifier__alpha': 1, 'classifier__fit_prio...	0.388060	0.399247	0.396325	0.394544	0.004737	20
0	13.749660	0.465174	6.407841	0.549166	0.5	True	0.05	5	{'classifier__alpha': 0.5, 'classifier__fit_pr...	0.384852	0.386487	0.386796	0.386045	0.000853	21
3	15.954147	0.318013	6.337361	0.261697	0.5	True	0.1	5	{'classifier__alpha': 0.5, 'classifier__fit_pr...	0.369785	0.375645	0.375858	0.373763	0.002814	22
12	7.120198	0.050452	3.833905	0.069540	1	True	0.05	5	{'classifier__alpha': 1, 'classifier__fit_prio...	0.277741	0.280564	0.285337	0.281214	0.003135	23
15	7.497707	0.183054	3.870714	0.062888	1	True	0.1	5	{'classifier__alpha': 1, 'classifier__fit_prio...	0.255578	0.263381	0.266184	0.261714	0.004487	24

In [5]:

Copied!

from sklearn import set_config

set_config(display="diagram")

classifier = create_pipeline()
classifier.set_params(**optimisation_pipeline.best_params_)
classifier.fit(X, y)
from sklearn import set_config

set_config(display="diagram")

classifier = create_pipeline()
classifier.set_params(**optimisation_pipeline.best_params_)
classifier.fit(X, y)

Out[5]:

Pipeline(steps=[('vectorizer',
                 TfidfVectorizer(max_df=0.05, min_df=20, sublinear_tf=True)),
                ('classifier', MultinomialNB(alpha=0.5, fit_prior=False))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Export the model using GreatAI¶

In [6]:

Copied!

from great_ai import save_model

save_model(classifier, key="small-domain-prediction", keep_last_n=5)
from great_ai import save_model

save_model(classifier, key="small-domain-prediction", keep_last_n=5)

Fetching cached versions of small-domain-prediction
Copying file for small-domain-prediction-2
Compressing small-domain-prediction-2
Model small-domain-prediction uploaded with version 2

Out[6]:

'small-domain-prediction:2'

Next: Part 3 ¶

Last update: July 16, 2022

Optimise and train a model¶

Load data that has been extracted in part 1¶

Optimise and train Multinomial Naive Bayes classifier¶

Export the model using GreatAI¶

Next: Part 3¶

Load data that has been extracted in part 1 ¶

Next: Part 3 ¶