Let's start with Titanic: Hyperparameter tuning using GridSearchCV class: Our Kaggle challenge

notebook-laptop
05/12/2022
980 Views

Until the last time, we basically created a deep neural network (DNN) created with PyTorch, input data from the Titanic competition, and imitated EDA, feature engineering, and hyperparameter tuning. I've been trying to improve my score. I personally think that the result is passable.

My highest score “0.78229” was about 2372nd as of February 2, 2022, which is in the top 20%. If you ask me if it's good or bad, it's subtle, but it's getting hard to pursue a higher score. [Kawasaki]

There are a lot of people who got a score of 1.0 in the Titanic competition, and it seems that they are cheating in some way, such as entering their answers. Also, since the training data is relatively small at 891 cases, there seems to be a limit to the accuracy of deep learning. [one color]

This time I will try my hand at a machine learning library called scikit-learn. I'm not satisfied with the scores so far, but I wonder what it would be like if I used scikit-learn to tune the hyperparameters without worrying too much about the scores.

In addition, this time we are using the data frame that was created two years ago and modified last time (please refer to this notebook for the procedure for processing the data frame).

The data frame (heat map) used this time

The code for this time is left in this notebook. If you are interested, please refer to it.

The goal of the Titanic competition is to correctly guess whether a passenger is alive or dead. And scikit-learn implements many such classifiers. Of course, there are also classifiers using neural networks. That's the MLPClassifier class in the sklearn.neural_network module. The head "MLP" is an abbreviation of "Multi Layer Perceptron".

Let's use this class as an example to see how to use scikit-learn's classifier. Basic usage is very simple.

This is all you need to learn and guess. In fact, instantiating, learning, and inferring can be done in three lines (importing the MLPClassifier class):

from sklearn.neural_network import MLPClassifierclf = MLPClassifier(max_iter=500, hidden_layer_sizes=(8,))clf = clf.fit(X_train, y_train)pred = clf.predict(X_val)

The code that creates an instance of the MLPClassifier class, performs training, and performs life-or-death guessing

So far, I have defined the class myself, written a for loop that performs training and verification, and input the test data. It's so easy that you wonder where the trouble of writing ... went.

The reason max_iter and hidden_layer_size are specified when creating an instance of the MLPClassifier class above is that if the defaults were used, learning would not have finished (of course, before that, reading the CSV file and setting the dataset etc. I do, see this notebook for them).

What about the number of nodes in the hidden layer, the choice of optimization algorithm, and the batch size? Speaking of which, such parameters (hyperparameters) can be specified at the time of instance creation. For the MLPClassifier class, the following can be specified (part).

It seems that the term iteration is used synonymously with epoch in scikit-learn. In deep learning, generally speaking, iteration refers to the number of batch-sized data updates to weights and biases, and epoch refers to the number of times all data is used, so they have different meanings. In other words, there are multiple iterations in one epoch. I thought it was mentioned when I looked at the help document.

Thank you. You don't have to pay attention to things like this.

In hyperparameter tuning, you specify what values (some of them) these values will take, and call the instance method of the class that performs the actual tuning.

The training and inference pattern we saw above calls the fit method, then the predict method, and the pattern is similar for other classifiers. So by defining a function like

def fit_and_pred(clf, X_train, X_val, y_train, y_val): clf = clf.fit(X_train, y_train)pred = clf.predict(X_val)cnt = len(pred)result = sum(pred == y_val) clf_name = clf.__class__.__name__print(f'{clf_name}: {result} / {len(y_val)} = {result / len(y_val):.4}')

Prepare an instance of a class that can be used for function

classification (here, two-class classification) that performs learning and inference (verification) as follows,

clfs = [DecisionTreeClassifier(random_state=0),# Decision tree RandomForestClassifier(random_state=0),# Random forest classifier KNeighborsClassifier(),# K-neighbors method SVC(random_state=0),# Classification by support vector machine MLPClassifier (max_iter=2000, random_state=0)# multi-layer perceptron]

It is easy to learn and infer with a for loop below a list

containing classifier instances.

for clf in clfs:fit_and_pred(clf, X_train, X_val, y_train, y_val)

Learning and Guessing Loop

So here's what happens when I run the above code without specifically tuning the hyperparameters.

Execution result

It seems that a reasonable result was obtained. Next, let's use the trained model to guess life and death from test data and submit it. Here, we made guesses on all five instances generated above, and as a result, if three or more models guessed 1 (survival), the final guess result was 1, otherwise it was 0.

preds = [clf.predict(df_test) for clf in clfs]result = sum(preds)result[result <= 2] = 0result[result > 2] = 1print(f'{sum(result)} / {len(result)} = {sum(result) / len(result)}')submission = pd.read_csv('../input/titanic/test.csv')submission['Survived'] = resultdrop_columns = [ 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']submission = submission.drop(drop_columns, axis= 1) submission.to_csv('submission_sklearn.csv', index=False)

Estimate life and death using 5 models from test data

By the way, this score was 0.75837. Lower score than last time. While I'm relieved, I don't think I can do more.

Start with Titanic: GridSearchCV class Using Hyperparameter Tuning: Our Kaggle Challenge

Now let's tune the hyperparameters of the above model using the GridSearchCV class provided by scikit-learn. The basic steps are as follows.

For example, taking the MLPClassifier class we saw earlier, the first two (instantiating and writing candidate hyperparameters) would look like this:

clf4 = MLPClassifier()params4 = {'max_iter': [2000],'hidden_layer_sizes': [(16,), (32,), (64,), (128,), (256,)], 'random_state': [0]}

Instance Generation and Hyperparameter Value Candidates (Part 1)

Note that the hyperparameter name is used as a key, and a list of the candidates is registered in the dictionary as the value.

In addition, since there is only one candidate value for max_iter and random_state above, the above code could have been written as follows (actually, this also worked).

clf4 = MLPClassifier(max_iter=2000, random_state=0)params4 = {'hidden_layer_sizes': [(16,), (32,), (64,), (128,), (256,)],}

Instance Generation and Hyperparameter Value Candidates (Part 2)

The following code generates an instance of the GridSearchCV class using these, performs tuning, and displays the best score and the hyperparameter values at that time. That's right.

from sklearn.model_selection import GridSearchCVgscv = GridSearchCV(clf4, params4, scoring='accuracy', cv=5)result = gscv.fit(X, y)print(result.best_score_)print(result.best_params_)

Tune Hyperparameters

"CV" in the GridSearchCV class stands for "Cross Validation" (probably). In other words, just calling the fit method automatically performs cross-validation. At this time, specify how many pieces the original data is to be divided in the cv argument. Scoring is what the evaluation index is, but here we specify 'accuracy' (number of correct answers / total number of elements) (For specifiable values, see the scikit-learn document "The scoring parameter: defining model evaluation rules”).

In addition, the GridSearchCV class automatically performs cross-validation, so here we passed all the data read from the CSV to the fit method as is (X and y are unnecessary from the data read from the CSV file. This is just to separate the training data and the training data by deleting the columns that are not valid, and not using the train_test_split function).

Since we can do the procedure we just saw for the five classifier classes we saw earlier, we first described the instance and hyperparameter candidates as follows.

clf0 = DecisionTreeClassifier() params0 = {'criterion': ['gini', 'entropy'],'max_depth': list(range(3, 10)),'min_samples_split': list(range(2, 5 )),'min_samples_leaf': list(range(1, 4)),'random_state': [0]}clf1 = RandomForestClassifier()params1 = {'n_estimators': [10, 50, 100, 300, 500],' max_depth': [5, 10, 50, None],'max_features': ['sqrt', 'log2'],'random_state': [0]}clf2 = KNeighborsClassifier()params2 = {'n_neighbors': list(range (3, 15)), 'weights': ['uniform', 'distance'], 'metric': ['minkowski', 'euclidean', 'manhattan'], 'p': [1, 2]}clf3 = SVC() params3 = {'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],'C': [1, 5, 10, 30],'gamma': ['auto ', 'scale'],'random_state': [0]} clf4 = MLPClassifier(max_iter=2000, random_state=0)params4 = {#'max_iter': [2000],'hidden_layer_sizes': [(16,), ( 32,), (64,), (128,), (256,)],#'random_state': [0]}

Classifier instantiation and setting of hyperparameter candidates

The hyperparameter candidates for each classifier were randomly selected by the author while looking at the document. It seems that a deep understanding of each classifier is required in this area, but since the author himself is still studying, if asked if he is confident that his choice is appropriate... Just think of it as a sample of hyperparameter tuning with scikit-learn.

Here is the code that uses these to tune the hyperparameters.

clfs = [clf0, clf1, clf2, clf3, clf4]search_params = [params0, params1, params2, params3, params4]results = []for clf, params in zip(clfs, search_params): gscv = GridSearchCV(clf , params, scoring='accuracy', cv=5)result = gscv.fit(X, y)clf_name = clf.__class__.__name__print(f'result of {clf_name}')print(f'score: {result.best_score_ }')print(f'best params: {result.best_params_}')print('----')results.append(result)

Code to tune hyperparameters

The fit method updates itself with the tune and returns it. Instances of the GridSearchCV class have attributes such as best_score_ and best_params_, so here we display the best score and the values of the hyperparameters at that time (note the underscore "_" at the end ). In addition, the best_estimator_ attribute stores the model that got the best score, so I saved the return value in results in order to use it to make life-or-death guesses from the test data.

The execution result is shown below.

Execution result

Hmmm. Depending on the tuning, some scores went up, while others stayed the same or went down.

This may be due to the choice of hyperparameters to be tuned, differences in training with the fit method, etc.

It's kind of weird, but let's use the best-scoring model to guess life or death from the test data.

preds = [item.best_estimator_.predict(df_test) for item in results]result = sum(preds)result[result <= 2] = 0result[result > 2] = 1print(f'{sum(result )} / {len(result)} = {sum(result) / len(result)}')submission = pd.read_csv('../input/titanic/test.csv')submission['Survived'] = resultdrop_columns = ['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']submission = submission.drop(drop_columns, axis=1)submission.to_csv('submission_gridsearch.csv', index=False)

Use the best-scoring model to estimate life-or-death from the test data

Use the best_estimator_ attribute of the return value of the fit method (the instance of the classifier that got the best score) to estimate life-or-death and submit it. I am saving it in a CSV file for The score after tuning is 0.76555 (the score before tuning is 0.75837). For the time being, I'm happy that the score increased due to the tune.

I'm glad my score went up. I wonder why it went down by tuning the hyperparameters... I thought that depending on the combination pattern of the hyperparameter candidates, it might be easier to get worse than the combination of default values.

Although it will take time, it may be better to increase the number of hyperparameter candidates. When I tune manually, I change the candidates in various ways and repeat the tuning over and over until I find a combination of candidates that seems good. I tuned it with fine values. But it will take time... Among the hyperparameter automatic optimization frameworks I know, Optuna can be specified with range candidates instead of fixed candidates specified in a grid (table format), so it is easier to tune with finer values more efficiently. It's just

Right. It will take time, but I thought I could find a better value. Optuna? I'll look into it next time.

This time, I used scikit-learn to take a quick look at how to use classifiers and how to tune hyperparameters. Using PyTorch to write various code by yourself allows you to understand what you are doing and enjoy the fun of writing code. However, using scikit-learn makes it possible to achieve complex things with simple code, which is a big attraction.

So, Kaggle life started from the Titanic competition, but at first I tried to put data into DNN without thinking about it and got a not so good score, EDA, feature engineering, hyper parameter tuning I've been trying to imitate. What I realized is that "pretreatment" is important. "Garbage in, garbage out" seems to be an enduring truth in the world of programming.

While participating in the Titanic competition, I feel like I was able to get a few clues about what to do and how to get good results through dialogue with data and hyperparameter tuning. Now it's time to step away from the Titanic competition and take it easy in another competition, gradually improving your experience and score.