Optimise and train a model¶
The blue boxes show the steps implemented in this notebook.
In the first part, we have cleaned and transformed our training data. We can now access this data using great_ai.LargeFile
. Locally, it will gives us the cached version, otherwise, the latest version is downloaded from S3 or GridFS.
In this part, we hyperparameter-optimise and train a simple Naive Bayes classifier which we then export for deployment using great_ai.save_model
.
Load data that has been extracted in part 1¶
In [1]:
Copied!
from great_ai import query_ground_truth
data = query_ground_truth("train")
X = [d.input for d in data for domain in d.feedback]
y = [domain for d in data for domain in d.feedback]
from great_ai import query_ground_truth
data = query_ground_truth("train")
X = [d.input for d in data for domain in d.feedback]
y = [domain for d in data for domain in d.feedback]
Environment variable ENVIRONMENT is not set, defaulting to development mode ‼️ Cannot find credentials files, defaulting to using ParallelTinyDbDriver The selected tracing database (ParallelTinyDbDriver) is not recommended for production Cannot find credentials files, defaulting to using LargeFileLocal GreatAI (v0.1.6): configured ✅ 🔩 tracing_database: ParallelTinyDbDriver 🔩 large_file_implementation: LargeFileLocal 🔩 is_production: False 🔩 should_log_exception_stack: True 🔩 prediction_cache_size: 512 🔩 dashboard_table_size: 50 You still need to check whether you follow all best practices before trusting your deployment. > Find out more at https://se-ml.github.io/practices
In [2]:
Copied!
from collections import Counter
import matplotlib.pyplot as plt
domains, counts = zip(*Counter(y).most_common())
# Configure matplotlib to have nice, high-resolution charts
%matplotlib inline
plt.rcParams["figure.figsize"] = (20, 5)
plt.rcParams["figure.facecolor"] = "white"
plt.rcParams["font.size"] = 12
plt.xticks(rotation=90)
plt.bar(domains, counts)
None
from collections import Counter
import matplotlib.pyplot as plt
domains, counts = zip(*Counter(y).most_common())
# Configure matplotlib to have nice, high-resolution charts
%matplotlib inline
plt.rcParams["figure.figsize"] = (20, 5)
plt.rcParams["figure.facecolor"] = "white"
plt.rcParams["font.size"] = 12
plt.xticks(rotation=90)
plt.bar(domains, counts)
None
Optimise and train Multinomial Naive Bayes classifier¶
In [3]:
Copied!
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
def create_pipeline() -> Pipeline:
return Pipeline(
steps=[
("vectorizer", TfidfVectorizer(sublinear_tf=True)),
("classifier", MultinomialNB()),
]
)
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
def create_pipeline() -> Pipeline:
return Pipeline(
steps=[
("vectorizer", TfidfVectorizer(sublinear_tf=True)),
("classifier", MultinomialNB()),
]
)
In [4]:
Copied!
from sklearn.model_selection import GridSearchCV
import pandas as pd
optimisation_pipeline = GridSearchCV(
create_pipeline(),
{
"vectorizer__min_df": [5, 20, 100],
"vectorizer__max_df": [0.05, 0.1],
"classifier__alpha": [0.5, 1],
"classifier__fit_prior": [True, False],
},
scoring="f1_macro",
cv=3,
n_jobs=-1,
verbose=1,
)
optimisation_pipeline.fit(X, y)
results = pd.DataFrame(optimisation_pipeline.cv_results_)
results.sort_values("rank_test_score")
from sklearn.model_selection import GridSearchCV
import pandas as pd
optimisation_pipeline = GridSearchCV(
create_pipeline(),
{
"vectorizer__min_df": [5, 20, 100],
"vectorizer__max_df": [0.05, 0.1],
"classifier__alpha": [0.5, 1],
"classifier__fit_prior": [True, False],
},
scoring="f1_macro",
cv=3,
n_jobs=-1,
verbose=1,
)
optimisation_pipeline.fit(X, y)
results = pd.DataFrame(optimisation_pipeline.cv_results_)
results.sort_values("rank_test_score")
Fitting 3 folds for each of 24 candidates, totalling 72 fits
Out[4]:
mean_fit_time | std_fit_time | mean_score_time | std_score_time | param_classifier__alpha | param_classifier__fit_prior | param_vectorizer__max_df | param_vectorizer__min_df | params | split0_test_score | split1_test_score | split2_test_score | mean_test_score | std_test_score | rank_test_score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7 | 14.549476 | 0.361685 | 8.476837 | 0.222398 | 0.5 | False | 0.05 | 20 | {'classifier__alpha': 0.5, 'classifier__fit_pr... | 0.518061 | 0.514842 | 0.511599 | 0.514834 | 0.002638 | 1 |
10 | 11.235289 | 0.130426 | 4.092868 | 0.082518 | 0.5 | False | 0.1 | 20 | {'classifier__alpha': 0.5, 'classifier__fit_pr... | 0.513897 | 0.515661 | 0.507867 | 0.512475 | 0.003337 | 2 |
19 | 7.383645 | 0.138110 | 4.130709 | 0.250048 | 1 | False | 0.05 | 20 | {'classifier__alpha': 1, 'classifier__fit_prio... | 0.496825 | 0.501045 | 0.496854 | 0.498241 | 0.001983 | 3 |
11 | 10.435154 | 0.305144 | 3.882101 | 0.128886 | 0.5 | False | 0.1 | 100 | {'classifier__alpha': 0.5, 'classifier__fit_pr... | 0.493247 | 0.497814 | 0.502245 | 0.497769 | 0.003674 | 4 |
8 | 13.643193 | 0.310696 | 4.173707 | 0.142980 | 0.5 | False | 0.05 | 100 | {'classifier__alpha': 0.5, 'classifier__fit_pr... | 0.489609 | 0.495207 | 0.498154 | 0.494323 | 0.003544 | 5 |
22 | 7.048340 | 0.050070 | 3.172948 | 0.152418 | 1 | False | 0.1 | 20 | {'classifier__alpha': 1, 'classifier__fit_prio... | 0.487456 | 0.493865 | 0.491157 | 0.490826 | 0.002627 | 6 |
23 | 7.564685 | 0.146092 | 2.374111 | 0.285026 | 1 | False | 0.1 | 100 | {'classifier__alpha': 1, 'classifier__fit_prio... | 0.485160 | 0.494039 | 0.490127 | 0.489776 | 0.003633 | 7 |
20 | 7.172353 | 0.212599 | 3.747219 | 0.130217 | 1 | False | 0.05 | 100 | {'classifier__alpha': 1, 'classifier__fit_prio... | 0.481303 | 0.490002 | 0.488269 | 0.486524 | 0.003759 | 8 |
6 | 14.276345 | 0.456576 | 8.318859 | 0.268701 | 0.5 | False | 0.05 | 5 | {'classifier__alpha': 0.5, 'classifier__fit_pr... | 0.482429 | 0.487735 | 0.484888 | 0.485017 | 0.002168 | 9 |
2 | 14.902358 | 0.737693 | 5.975091 | 0.171150 | 0.5 | True | 0.05 | 100 | {'classifier__alpha': 0.5, 'classifier__fit_pr... | 0.469598 | 0.474490 | 0.473637 | 0.472575 | 0.002134 | 10 |
9 | 12.677349 | 0.145143 | 4.374204 | 0.175674 | 0.5 | False | 0.1 | 5 | {'classifier__alpha': 0.5, 'classifier__fit_pr... | 0.468872 | 0.476451 | 0.470921 | 0.472082 | 0.003201 | 11 |
5 | 13.423686 | 0.482872 | 8.008324 | 0.442975 | 0.5 | True | 0.1 | 100 | {'classifier__alpha': 0.5, 'classifier__fit_pr... | 0.465726 | 0.474548 | 0.471879 | 0.470718 | 0.003694 | 12 |
1 | 13.819117 | 0.838347 | 6.161175 | 0.336590 | 0.5 | True | 0.05 | 20 | {'classifier__alpha': 0.5, 'classifier__fit_pr... | 0.463395 | 0.473982 | 0.471262 | 0.469546 | 0.004489 | 13 |
4 | 13.281476 | 0.588822 | 8.335852 | 0.254627 | 0.5 | True | 0.1 | 20 | {'classifier__alpha': 0.5, 'classifier__fit_pr... | 0.458734 | 0.468053 | 0.464418 | 0.463735 | 0.003835 | 14 |
14 | 7.282247 | 0.444940 | 3.567094 | 0.044519 | 1 | True | 0.05 | 100 | {'classifier__alpha': 1, 'classifier__fit_prio... | 0.438189 | 0.450160 | 0.446180 | 0.444843 | 0.004978 | 15 |
17 | 7.098797 | 0.196241 | 3.838628 | 0.091128 | 1 | True | 0.1 | 100 | {'classifier__alpha': 1, 'classifier__fit_prio... | 0.436488 | 0.444503 | 0.445900 | 0.442297 | 0.004147 | 16 |
18 | 7.492791 | 0.288889 | 3.843224 | 0.073438 | 1 | False | 0.05 | 5 | {'classifier__alpha': 1, 'classifier__fit_prio... | 0.428196 | 0.431945 | 0.430160 | 0.430100 | 0.001531 | 17 |
21 | 7.229826 | 0.099823 | 3.656332 | 0.073780 | 1 | False | 0.1 | 5 | {'classifier__alpha': 1, 'classifier__fit_prio... | 0.403130 | 0.410170 | 0.409801 | 0.407700 | 0.003235 | 18 |
13 | 7.158370 | 0.169818 | 3.765632 | 0.082924 | 1 | True | 0.05 | 20 | {'classifier__alpha': 1, 'classifier__fit_prio... | 0.399237 | 0.412872 | 0.407982 | 0.406697 | 0.005640 | 19 |
16 | 7.064643 | 0.119529 | 3.810983 | 0.125897 | 1 | True | 0.1 | 20 | {'classifier__alpha': 1, 'classifier__fit_prio... | 0.388060 | 0.399247 | 0.396325 | 0.394544 | 0.004737 | 20 |
0 | 13.749660 | 0.465174 | 6.407841 | 0.549166 | 0.5 | True | 0.05 | 5 | {'classifier__alpha': 0.5, 'classifier__fit_pr... | 0.384852 | 0.386487 | 0.386796 | 0.386045 | 0.000853 | 21 |
3 | 15.954147 | 0.318013 | 6.337361 | 0.261697 | 0.5 | True | 0.1 | 5 | {'classifier__alpha': 0.5, 'classifier__fit_pr... | 0.369785 | 0.375645 | 0.375858 | 0.373763 | 0.002814 | 22 |
12 | 7.120198 | 0.050452 | 3.833905 | 0.069540 | 1 | True | 0.05 | 5 | {'classifier__alpha': 1, 'classifier__fit_prio... | 0.277741 | 0.280564 | 0.285337 | 0.281214 | 0.003135 | 23 |
15 | 7.497707 | 0.183054 | 3.870714 | 0.062888 | 1 | True | 0.1 | 5 | {'classifier__alpha': 1, 'classifier__fit_prio... | 0.255578 | 0.263381 | 0.266184 | 0.261714 | 0.004487 | 24 |
In [5]:
Copied!
from sklearn import set_config
set_config(display="diagram")
classifier = create_pipeline()
classifier.set_params(**optimisation_pipeline.best_params_)
classifier.fit(X, y)
from sklearn import set_config
set_config(display="diagram")
classifier = create_pipeline()
classifier.set_params(**optimisation_pipeline.best_params_)
classifier.fit(X, y)
Out[5]:
Pipeline(steps=[('vectorizer', TfidfVectorizer(max_df=0.05, min_df=20, sublinear_tf=True)), ('classifier', MultinomialNB(alpha=0.5, fit_prior=False))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('vectorizer', TfidfVectorizer(max_df=0.05, min_df=20, sublinear_tf=True)), ('classifier', MultinomialNB(alpha=0.5, fit_prior=False))])
TfidfVectorizer(max_df=0.05, min_df=20, sublinear_tf=True)
MultinomialNB(alpha=0.5, fit_prior=False)
Export the model using GreatAI¶
In [6]:
Copied!
from great_ai import save_model
save_model(classifier, key="small-domain-prediction", keep_last_n=5)
from great_ai import save_model
save_model(classifier, key="small-domain-prediction", keep_last_n=5)
Fetching cached versions of small-domain-prediction Copying file for small-domain-prediction-2 Compressing small-domain-prediction-2 Model small-domain-prediction uploaded with version 2
Out[6]:
'small-domain-prediction:2'
Last update:
July 16, 2022