When we want to build a predicitve model with a bunch of data, we want to make the best model possible but we'd be incredibly lucky to get this with the chosen algorithm right out of the gate. We'd need to tune the model to get the best predictive capability from it - but how do we know what we need to tune?

A model's hyperparameters are the parameters that we can set before fitting it to any data that determine the behaviour of the model.

Let's take the K-Nearest neighbours classifier algorithm[1] . The documentation shows that there are several arguments that we can pass the instantiation of the model, let's pick 3:

  • Number of neighbours
  • Weights
  • The distance metric: Manhattan or Euclidian

There would be no way ahead of time to know the optimum number of neighbours we should use, how we should weight the votes, or which distance metric we should use to identify the closest neigbours. If we wanted to establish the optimum settings, we'd have to go through all combinations of the three hyperparameters and evaluate the model with each setting - this would take us some time if we were to try and do it manually.

Luckily our computers are ideally suited to the repetetive tasks we are not so quick at...

Sklearn provides us a nice way to iterate through all the combinations of hyperparameters and is even helpful enough to store the best combination of hyperparameters based on cross validation of each model against the data.

Setting up the Gridsearch

We can set up a grid search like the below - we are going to set up the three hyperparameters in a dictionary with lists of the possible values we want to try, and then let the gridsearcher loose, cross validating on the training set as we go.

from sklearn.model_selection import GridSearchCV

knn_parameters = {
    'n_neighbors':[1,3,5,7,9],
    'weights':['uniform','distance'],
    'metric':['euclidean','manhattan']
}

knn_gridsearcher = GridSearchCV(KNeighborsClassifier(), 
                                knn_parameters, cv=5, 
                                verbose=1)
knn_gridsearcher.fit(X_train, y_train)

Once the gridsearch finishes, we are able to see the best set of hyperparameters, the best accuracy score, the mean accuracy score from each combination of hyperparameters, and also the model with the best hyperparameters already set, so we can then validate the model against the test set.

# Access the above attributes with the below
best_params = knn_gridsearch.best_params_
best_score = knn_gridsearch.best_score_
all_scores = knn_gridsearch.grid_scores_
best_model = knn_gridsearch.best_estimator_

Eventually we will build this into a pipeline, but for stand alone modelling this will help us get to a better result quicker: Which gives us time for a nice cup of tea.

Photo by Sean Patrick Murphy / Unsplash


  1. The SKlearn implementation documentation can be found here ↩︎