Hyperparameter tuning is one of the key concepts in machine learning. Grid search, random search, gradient based optimization are few concepts you could use to perform hyperparameter tuning automatically [1].
In this article, I am going to explain how you could do the hyperparameter tuning manually by performing few tests. I am going to use WSO2 Machine Learner 1.0 for this purpose (refer [2] to understand what WSO2 ML 1.0 is capable of doing). Dataset I have used to perform this analysis is the well-known Pima Indians Diabetes dataset [3] and the algorithm picked was Logistic regression with mini batch gradient descent algorithm. For this algorithm, there are few hyperparameters namely,
- Iterations - Number of times optimizer runs before completing the optimization process
- Learning rate - Step size of the optimization algorithm
- Regularization type - Type of the regularization. WSO2 Machine Learner supports L2 and L1 regularizations.
- Regularization parameter - Regularization parameter controls the model complexity and hence, helps to control model overfitting.
- SGD Data Fraction - Fraction of the training dataset use in a single iteration of the optimization algorithm
From the above set of hyperparameters, what I wanted to know was, the optimal learning rate and the number of iterations keeping other hyperparameters at a constant value.
Goals
- Finding the optimal learning rate and the number of iterations which improves AUC (Area under curve of ROC curve [4])
- Finding the relationship between Learning rate and AUC
- Finding the relationship between number of iterations and AUC
Approach
Firstly, Pima Indians Diabetes dataset was uploaded to WSO2 ML 1.0. Then, I wanted to understand a fair number for the iterations so that I could find the optimal learning rate. For that the learning rate was kept at a fixed value (0.1) and varied the number of iterations and recorded the AUC against each iterations number.
LR = 0.1
| |||||||
Iterations
|
100
|
1000
|
5000
|
10000
|
20000
|
30000
|
50000
|
AUC
|
0.475
|
0.464
|
0.507
|
0.526
|
0.546
|
0.562
|
0.592
|
According to the plotted graph, it is quite evident that the AUC increases with the number of iterations. Hence, I picked 10000 as a fair number of iterations to find the optimal learning rate (of course I could have picked any number > 5000 (where learning rate started to climb over 0.5)). Increasing number of iterations extensively would lead to an overfitted model.
Since, I have picked a ‘fair’ number for iterations, next step is to find the optimal learning rate. For that, the number of iterations was kept at a fixed value (10000) and varied the learning rate and recorded the AUC against each learning rate.
Iterations=10000
| ||||||
LR
|
0.0001
|
0.0005
|
0.001
|
0.005
|
0.01
|
0.1
|
AUC
|
0.529
|
0.558
|
0.562
|
0.59
|
0.599
|
0.526
|
According to the above observations, we can see that the AUC has a global maxima at 0.01 learning rate (to be precise it is between 0.005 and 0.01). Hence, we could conclude that AUC is get maximized when learning rate approaches 0.01 i.e. 0.01 is the optimal learning rate for this particular dataset and algorithm.
Now, we could change the learning rate to 0.01 and re-run the first test mentioned in the article.
LR = 0.01
| |||||||||
Iterations
|
100
|
1000
|
5000
|
10000
|
20000
|
30000
|
50000
|
100000
|
150000
|
AUC
|
0.512
|
0.522
|
0.595
|
0.599
|
0.601
|
0.604
|
0.607
|
0.612
|
0.616
|
Above graph depicts that the AUC increases ever so slightly when we increase the number of iterations. So, how to find the optimal number of iterations? Well, it depends on how much computing power you have and also what level of AUC you expect. AUC will probably not improve drastically, even though you improve number of iterations.
How can I increase the AUC then? You can of course use another binary classification algorithm (Support Vector Machine) or else you could do some feature engineering on the dataset so that it reduces the noise of the training data.
Summary
This article tries to explain the process of tuning hyperparameters for a selected dataset and an algorithm. Same approach could be used with different datasets and algorithms too.
References: