Saturday, July 25, 2015

How to tune hyperparameters?

Hyperparameter tuning is one of the key concepts in machine learning. Grid search, random search, gradient based optimization are few concepts you could use to perform hyperparameter tuning automatically [1].

In this article, I am going to explain how you could do the hyperparameter tuning manually by performing few tests. I am going to use WSO2 Machine Learner 1.0 for this purpose (refer [2] to understand what WSO2 ML 1.0 is capable of doing). Dataset I have used to perform this analysis is the well-known Pima Indians Diabetes dataset [3] and the algorithm picked was Logistic regression with mini batch gradient descent algorithm. For this algorithm, there are few hyperparameters namely,

  • Iterations - Number of times optimizer runs before completing the optimization process
  • Learning rate - Step size of the optimization algorithm
  • Regularization type - Type of the regularization. WSO2 Machine Learner                                supports L2 and L1 regularizations.
  • Regularization parameter - Regularization parameter controls the model complexity and hence, helps to control model overfitting.
  • SGD Data Fraction - Fraction of the training dataset use in a single iteration of the optimization algorithm

From the above set of hyperparameters, what I wanted to know was, the optimal learning rate and the number of iterations keeping other hyperparameters at a constant value.

Goals
  • Finding the optimal learning rate and the number of iterations which improves AUC (Area under curve of ROC curve [4])
  • Finding the relationship between Learning rate and AUC
  • Finding the relationship between number of iterations and AUC

Approach

Firstly, Pima Indians Diabetes dataset was uploaded to WSO2 ML 1.0. Then, I wanted to understand a fair number for the iterations so that I could find the optimal learning rate. For that the learning rate was kept at a fixed value (0.1) and varied the number of iterations and recorded the AUC against each iterations number.


LR = 0.1







Iterations
100
1000
5000
10000
20000
30000
50000
AUC
0.475
0.464
0.507
0.526
0.546
0.562
0.592





According to the plotted graph, it is quite evident that the AUC increases with the number of iterations. Hence, I picked 10000 as a fair number of iterations to find the optimal learning rate (of course I could have picked any number > 5000 (where learning rate started to climb over 0.5)). Increasing number of iterations extensively would lead to an overfitted model.
Since, I have picked a ‘fair’ number for iterations, next step is to find the optimal learning rate. For that, the number of iterations was kept at a fixed value (10000) and varied the learning rate and recorded the AUC against each learning rate.

Iterations=10000






LR
0.0001
0.0005
0.001
0.005
0.01
0.1
AUC
0.529
0.558
0.562
0.59
0.599
0.526
Learning Rate / AUC graph

According to the above observations, we can see that the AUC has a global maxima at 0.01 learning rate (to be precise it is between 0.005 and 0.01). Hence, we could conclude that AUC is get maximized when learning rate approaches 0.01 i.e. 0.01 is the optimal learning rate for this particular dataset and algorithm.

Now, we could change the learning rate to 0.01 and re-run the first test mentioned in the article.

LR = 0.01









Iterations
100
1000
5000
10000
20000
30000
50000
100000
150000
AUC
0.512
0.522
0.595
0.599
0.601
0.604
0.607
0.612
0.616


Above graph depicts that the AUC increases ever so slightly when we increase the number of iterations. So, how to find the optimal number of iterations? Well, it depends on how much computing power you have and also what level of AUC you expect. AUC will probably not improve drastically, even though you improve number of iterations.

How can I increase the AUC then? You can of course use another binary classification algorithm (Support Vector Machine) or else you could do some feature engineering on the dataset so that it reduces the noise of the training data.

Summary
This article tries to explain the process of tuning hyperparameters for a selected dataset and an algorithm. Same approach could be used with different datasets and algorithms too.
References:



Monday, July 6, 2015

Sneak Peek into WSO2 Machine Learner 1.0


This article is about one of the newest products of WSO2, WSO2 Machine Learner (WSO2 ML). We have released the very first general availability release of WSO2 ML. For people who are wondering, when did I move from Stratos team to ML team, it happened January this year (2015) on my request (Yes, WSO2 was kind enough to accommodate my request :-)). We are a 7 member team now (effectively 3 in R&D) and lead by Dr. Srinath Perera, VP Research. We also get the assistance from a member of UX team and a member of documentation team.

What is Machine Learning?
 

“Machine learning is a subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. Machine learning explores the construction and study of algorithms that can learn from and make predictions on data.”

More simplified definition from Professor Andrew Ng of Stanford University;

“Machine learning is the science of getting computers to act without being explicitly programmed. In the past decade, machine learning has given us self-driving cars, practical speech recognition, effective web search, and a vastly improved understanding of the human genome. Machine learning is so pervasive today that you probably use it dozens of times a day without knowing it. Many researchers also think it is the best way to make progress towards human-level AI.” (source: https://www.coursera.org/course/ml)

In simple terms, with machine learning we are trying to make the computer learn patterns from a vast amount of historical data and then use the learnt patterns to make predictions.

What is WSO2 Machine Learner?

WSO2 Machine Learner is a product which helps you to manage and explore your data, build machine learning models after analyzing the data using machine learning algorithms, compare and manage generated machine learning models and predict using the built models. Following image depicts the high level architecture of WSO2 ML.



WSO2 ML exposes all its operations via a REST API. We use well-known Apache Spark to perform various operations on datasets in a scalable and efficient manner. Currently, we support number of machine learning algorithms, covering regression and classification types from supervised learning techniques and clustering type from unsupervised learning techniques. We use Apache Spark's MLLib to provide support for all currently implemented algorithms.

In this post, my main focus is to go through the feature list of WSO2 ML 1.0.0 release, so that you could see, whether it can be used to improve the way you do machine learning.

Manage Your Datasets

We help you manage your data, through our dataset versioning support. In a typical use case, you would have an X amount of data now and you would collect another Y amount of data in a month time. With WSO2 ML you could create a dataset with version 1.0.0 which points to X data and in a month time you could create version 1.1.0 which points to (X+Y) data. Then, you could pick these different dataset versions, run a machine learning analysis on top of them and generate models.


WSO2 ML accepts CSV, TSV data formats and the dataset files can reside in file system or in an HDFS. In addition to these storages, we support pulling data from a WSO2 Data Analytics Server generated data table [doc].

Explore Your Data

Once you uploaded datasets into WSO2 ML, you could explore few key details about your dataset such as feature set, scatter plots to understand the relationship of two selected features, histogram of each feature, parallel sets to explore categorical features, trellis charts and cluster diagrams [doc].





Manage Your ML Projects

WSO2 ML has a concept call 'Project' which is basically a logical grouping of set of machine learning analyses you would perform on a selected dataset. Note that when I say a dataset, it implies multiple dataset versions belong to a particular dataset. WSO2 ML allows you to manage your machine learning projects based on datasets and also based on users.




Build and Manage Analyses

WSO2 ML has a concept call 'Analysis' which holds a pre-processed feature set, a selected machine learning algorithm and its calibrated set of hyper-parameters. Each analysis belongs to a project and a project can have multiple analyses. Once you create an analysis, you cannot edit it but you can view it and also delete it. Analysis creation can be done using the wizard provided by WSO2 ML.





Run Analyses and Manage Models

Once you followed the wizard and generate an analysis, final step is to pick a dataset version from the available versions of the project's dataset and run the analysis. Outcome of this process is a machine learning model. Same analysis can be run on different dataset versions and generate multiple models.




Once a model is generated you could perform various operations on it such as viewing the model summary, downloading the model object as a file, publishing the model into WSO2 registry and predicting.






Compare Models

The ultimate goal of you is to build an accurate model which can later be used for prediction. To help you out here, i.e. to allow you to easily compare all the different models got created using different analyses, we have a model comparison view.



In a Classification problem case, we will sort the models using their accuracy values, and in numerical prediction case we sort base on the mean squared error.

ML REST API

All the underlying WSO2 ML operations are exposed using the REST API and in fact our UI client is built on top of the ML REST API [doc]. If you wish, you could write a client in any language, on top of our REST API. It currently supports basic auth and session based authentication.

ML UI

Our Jaggery based UI is built using latest UX designs and you probably have felt it from the screenshots seen thus far in this post.

ML-WSO2 ESB Integration

We have written a ML-ESB mediator which could be used to do prediction of data collected from an incoming request against a ML model generated using WSO2 ML [doc].

ML-WSO2 CEP Integration

In addition to ESB mediator, we have written a ML-CEP extension, which could use to do real-time predictions against a generated model [doc].

External Spark Cluster Support

WSO2 ML by default ships an embedded Spark runtime, so that you could simply unzip the pack and start playing with it. But it can be configured to connect to an external Spark cluster [doc].

The Future

* Deep Learning algorithm support using H2O - this is currently underway as a GSoC project.
* Data pre-processing using DataWrangler - current GSoC project
* Recommendation algorithm support - current GSoC project
 ... whole lot of other new features and improvements.


This is basically a summary of what WSO2 ML 1.0 is all about. Please follow our GitHub repository for more information. You are most welcome to try it out and report any issues in our Jira.