Programming: Supervised and UnSupervised Learning

Machine learning can be broken into two broad regimes: supervised learning and unsupervised learning. We’ll introduce these concepts here, and discuss them in more detail below.

In Supervised Learning, we have a dataset consisting of both features and labels. The task is to construct an estimator which is able to predict the label of an object given the set of features. A relatively simple example is predicting the species of iris given a set of measurements of its flower. This is a relatively simple task. Some more complicated examples are:

given a multicolor image of an object through a telescope, determine whether that object is a star, a quasar, or a galaxy.
given a photograph of a person, identify the person in the photo.
given a list of movies a person has watched and their personal rating of the movie, recommend a list of movies they would like (A famous example is the Netflix Prize).

What these tasks have in common is that there is one or more unknown quantities associated with the object which needs to be determined from other observed quantities. Supervised learning is further broken down into two categories, classification and regression. In classification, the label is discrete, while in regression, the label is continuous. For example, in astronomy, the task of determining whether an object is a star, a galaxy, or a quasar is a classification problem: the label is from three distinct categories. On the other hand, we might wish to determine the age of an object based on such observations: this would be a regression problem: the label (age) is a continuous quantity.

Unsupervised Learning addresses a different sort of problem. Here the data has no labels, and we are interested in finding similarities between the objects in question. In a sense, you can think of unsupervised learning as a means of discovering labels from the data itself. Unsupervised learning comprises tasks such as dimensionality reduction, clustering, and density estimation. For example, in the iris data discussed above, we can used unsupervised methods to determine combinations of the measurements which best display the structure of the data. As we’ll see below, such a projection of the data can be used to visualize the four-dimensional dataset in two dimensions. Some more involved unsupervised learning problems are:

given detailed observations of distant galaxies, determine which features or combinations of features are most important in distinguishing between galaxies.
given a mixture of two sound sources (for example, a person talking over some music), separate the two (this is called the blind source separationproblem).
given a video, isolate a moving object and categorize in relation to other moving objects which have been seen.

scikit-learn strives to have a uniform interface across all methods, and we’ll see examples of these below. Given a scikit-learn estimator object namedmodel, the following methods are available:

Available in all Estimators
- model.fit() : fit training data. For supervised learning applications, this accepts two arguments: the data X and the labels y (e.g. model.fit(X,y)). For unsupervised learning applications, this accepts only a single argument, the data X (e.g. model.fit(X)).
Available in supervised estimators
- model.predict() : given a trained model, predict the label of a new set of data. This method accepts one argument, the new data X_new (e.g.model.predict(X_new)), and returns the learned label for each object in the array.
- model.predict_proba() : For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.predict().
Available in unsupervised estimators
- model.transform() : given an unsupervised model, transform new data into the new basis. This also accepts one argument X_new, and returns the new representation of the data based on the unsupervised model.
- model.fit_transform() : some estimators implement this method, which performs a fit and a transform on the same input data.

Supervised Learning: `model.fit(X, y)`

Flow diagram for supervised learning with scikit-learn

Overview of supervised Learning with scikit-learn

As mentioned above, a supervised learning algorithm makes the distinction between the raw observed data X with shape (n_samples, n_features) and some label given to the model during training. In scikit-learn this array is often noted y and has generally the shape (n_samples,). After training, the fitted model will try to predict the most likely labels y_new for new a set of samples X_new.

Depending on the nature of the target y, supervised learning can be given different names:

If y has values in a fixed set of categorical outcomes (represented by integers) the task to predict y is called classification.

If y has floating point values (e.g. to represent a price, a temperature, a size...), the task to predict y is called regression.

Classification

Classification is the task of predicting the value of a categorical variable given some input variables (a.k.a. the features or “predictors”). This section includes a first exploration of classification with scikit-learn. We’ll explore a detailed example of classification with astronomical data in Classification: Learning Labels of Astronomical Sources.

A first classifier example with `scikit-learn`

Note

The information in this section is available in an interactive notebook 02_iris_classification.ipynb, which can be viewed using iPython notebook. An online static view can be seen here

In the iris dataset example, suppose we are assigned the task to guess the class of an individual flower given the measurements of petals and sepals. This is a classification task, hence we have:

>>> X, y = iris.data, iris.target

Once the data has this format it is trivial to train a classifier, for instance a support vector machine with a linear kernel:

>>> from sklearn.svm import LinearSVC
>>> clf = LinearSVC()

Note

Whenever you import a scikit-learn class or function for the first time, you are advised to read the docstring by using the ? magic suffix of ipython, for instance type: LinearSVC?.

clf is a statistical model that has parameters that control the learning algorithm (those parameters are sometimes called the hyperparameters). Those hyperparameters can be supplied by the user in the constructor of the model. We will explain later how to choose a good combination using either simple empirical rules or data driven selection:

>>> clf
LinearSVC(C=1.0, dual=True, fit_intercept=True, intercept_scaling=1,
     loss='l2', multi_class=False, penalty='l2', tol=0.0001)

By default the real model parameters are not initialized. They will be tuned automatically from the data by calling the fit method:

>>> clf = clf.fit(X, y)

>>> clf.coef_                         
array([[ 0.18...,  0.45..., -0.80..., -0.45...],
       [ 0.05..., -0.89...,  0.40..., -0.93...],
       [-0.85..., -0.98...,  1.38...,  1.86...]])

>>> clf.intercept_                    
array([ 0.10...,  1.67..., -1.70...])

Once the model is trained, it can be used to predict the most likely outcome on unseen data. For instance let us define a list of simple sample that looks like the first sample of the iris dataset:

>>> X_new = [[ 5.0,  3.6,  1.3,  0.25]]

>>> clf.predict(X_new)
array([0], dtype=int32)

The outcome is 0 which is the id of the first iris class, namely ‘setosa’.

The following figure places the location of the fit and predict calls on the previous flow diagram. The vec object is a vectorizer used for feature extraction that is not used in the case of the iris data (it already comes as vectors of features):

Some scikit-learn classifiers can further predict probabilities of the outcome. This is the case of logistic regression models:

>>> from sklearn.linear_model import LogisticRegression
>>> clf2 = LogisticRegression().fit(X, y)
>>> clf2
LogisticRegression(C=1.0, dual=False, fit_intercept=True, intercept_scaling=1,
          penalty='l2', tol=0.0001)
>>> clf2.predict_proba(X_new)
array([[  9.07512928e-01,   9.24770379e-02,   1.00343962e-05]])

This means that the model estimates that the sample in X_new has:

90% likelyhood to belong to the ‘setosa’ class

9% likelyhood to belong to the ‘versicolor’ class

1% likelyhood to belong to the ‘virginica’ class

Of course, the predict method that outputs the label id of the most likely outcome is also available:

>>> clf2.predict(X_new)
array([0], dtype=int32)

Notable implementations of classifiers

sklearn.linear_model.LogisticRegression

Regularized Logistic Regression based on liblinear

sklearn.svm.LinearSVC

Support Vector Machines without kernels based on liblinear

sklearn.svm.SVC

Support Vector Machines with kernels based on libsvm

sklearn.linear_model.SGDClassifier

Regularized linear models (SVM or logistic regression) using a Stochastic Gradient Descent algorithm written in Cython

sklearn.neighbors.NeighborsClassifier

k-Nearest Neighbors classifier based on the ball tree datastructure for low dimensional data and brute force search for high dimensional data

sklearn.naive_bayes.GaussianNB

Gaussian Naive Bayes model. This is an unsophisticated model which can be trained very quickly. It is often used to obtain baseline results before moving to a more sophisticated classifier.

Sample application of classifiers

The following table gives examples of applications of classifiers for some common engineering tasks:

Task	Predicted outcomes
E-mail classification	Spam, normal, priority mail
Language identification in text documents	en, es, de, fr, ja, zh, ar, ru...
News articles categorization	Business, technology, sports...
Sentiment analysis in customer feedback	Negative, neutral, positive
Face verification in pictures	Same / different person
Speaker verification in voice recordings	Same / different person
Astronomical Sources	Object type or class

Regression

Regression is the task of predicting the value of a continuously varying variable (e.g. a price, a temperature, a conversion rate...) given some input variables (a.k.a. the features, “predictors” or “regressors”). We’ll explore a detailed example of regression with astronomical data in Regression: Photometric Redshifts of Galaxies.

Some notable implementations of regression models in scikit-learn include:

sklearn.linear_model.Ridge

L2-regularized least squares linear model

sklearn.linear_model.ElasticNet

L1+L2-regularized least squares linear model trained using Coordinate Descent

sklearn.linear_model.LassoLARS

L1-regularized least squares linear model trained with Least Angle Regression

sklearn.linear_model.SGDRegressor

L1+L2-regularized least squares linear model trained using Stochastic Gradient Descent

sklearn.linear_model.ARDRegression

Bayesian Automated Relevance Determination regression

sklearn.svm.SVR

Non-linear regression using Support Vector Machines (wrapper for libsvm)

sklearn.ensemble.RandomForestRegressor

An ensemble method which constructs multiple decision trees from subsets of the data.

Unsupervised Learning: `model.fit(X)`

Unsupervised Learning overview

An unsupervised learning algorithm only uses a single set of observations X with shape (n_samples, n_features) and does not use any kind of labels.

An unsupervised learning model will try to fit its parameters so as to best summarize regularities found in the data.

The following introduces the main variants of unsupervised learning algorithms, namely dimensionality reduction and clustering.

Dimensionality Reduction and visualization

Dimensionality reduction is the task of deriving a set of new artificial features that is smaller than the original feature set while retaining most of the variance of the original data.

Normalization and visualization with PCA

Note

The information in this section is available in an interactive notebook 03_iris_dimensionality.ipynb, which can be viewed using iPython notebook. An online static view can be seen here

The most common technique for dimensionality reduction is called Principal Component Analysis.

PCA can be done using linear combinations of the original features using a truncated Singular Value Decomposition of the matrix X so as to project the data onto a base of the top singular vectors.

If the number of retained components is 2 or 3, PCA can be used to visualize the dataset:

>>> from sklearn.decomposition import PCA
>>> pca = PCA(n_components=2, whiten=True).fit(X)

Once fitted, the pca model exposes the singular vectors in the components_ attribute:

>>> pca.components_                                      
array([[ 0.17..., -0.04...,  0.41...,  0.17...],
       [-1.33..., -1.48...,  0.35...,  0.15...]])

>>> pca.explained_variance_ratio_                        
array([ 0.92...,  0.05...])

>>> pca.explained_variance_ratio_.sum()                  
0.97...

Let us project the iris dataset along those first 2 dimensions:

>>> X_pca = pca.transform(X)

The dataset has been “normalized”, which means that the data is now centered on both components with unit variance:

>>> import numpy as np
>>> np.round(X_pca.mean(axis=0), decimals=5)
array([-0.,  0.])

>>> np.round(X_pca.std(axis=0), decimals=5)
array([ 1.,  1.])

Furthermore the samples components do no longer carry any linear correlation:

>>> import numpy as np
>>> np.round(np.corrcoef(X_pca.T), decimals=5)
array([[ 1., -0.],
       [-0.,  1.]])

We can visualize the dataset using pylab, for instance by defining the following utility function:

>>> import pylab as pl
>>> from itertools import cycle
>>> def plot_2D(data, target, target_names):
...     colors = cycle('rgbcmykw')
...     target_ids = range(len(target_names))
...     pl.figure()
...     for i, c, label in zip(target_ids, colors, target_names):
...         pl.scatter(data[target == i, 0], data[target == i, 1],
...                    c=c, label=label)
...     pl.legend()
...     pl.show()
...

Calling plot_2D(X_pca, iris.target, iris.target_names) will display the following:

2D PCA projection of the iris dataset

Note that this projection was determined without any information about the labels (represented by the colors): this is the sense in which the learning is unsupervised. Nevertheless, we see that the projection gives us insight into the distribution of the different flowers in parameter space: notably, iris setosais much more distinct than the other two species.

Note

The default implementation of PCA computes the SVD of the full data matrix, which is not scalable when both n_samples and n_features are big (more that a few thousands).

If you are interested in a number of components that is much smaller than both n_samples and n_features, consider usingsklearn.decomposition.RandomizedPCA instead.

Other applications of dimensionality reduction

Dimensionality Reduction is not just useful for visualization of high dimensional datasets. It can also be used as a preprocessing step (often called data normalization) to help speed up supervised machine learning methods that are not computationally efficient with high n_features such as SVM classifiers with gaussian kernels for instance or that do not work well with linearly correlated features.

Note

scikit-learn also features an implementation of Independant Component Analysis (ICA) and several manifold learning methods (See Exercise 3: Dimensionality Reduction of Spectra)

Clustering

Clustering is the task of gathering samples into groups of similar samples according to some predefined similarity or dissimilarity measure (such as the Euclidean distance).

For example, let us reuse the output of the 2D PCA of the iris dataset and try to find 3 groups of samples using the simplest clustering algorithm (KMeans):

>>> from sklearn.cluster import KMeans
>>> from numpy.random import RandomState
>>> rng = RandomState(42)

>>> kmeans = KMeans(n_clusters=3, random_state=rng).fit(X_pca)

>>> np.round(kmeans.cluster_centers_, decimals=2)
array([[ 1.02, -0.71],
       [ 0.33,  0.89],
       [-1.29, -0.44]])

>>> kmeans.labels_[:10]
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

>>> kmeans.labels_[-10:]
array([0, 0, 1, 0, 0, 0, 1, 0, 0, 1])

We can plot the assigned cluster labels instead of the target names with:

plot_2D(X_pca, kmeans.labels_, ["c0", "c1", "c2"])

KMeans cluster assignements on 2D PCA iris data

Exercise

Repeat the clustering algorithm from above, but fit the clusters to the full dataset X rather than the projection X_pca. Do the labels computed this way better match the true labels?

Notable implementations of clustering models

The following are two well-known clustering algorithms. Like most unsupervised learning models in the scikit, they expect the data to be clustered to have the shape (n_samples, n_features):

sklearn.cluster.KMeans

The simplest, yet effective clustering algorithm. Needs to be provided with the number of clusters in advance, and assumes that the data is normalized as input (but use a PCA model as preprocessor).

sklearn.cluster.MeanShift

Can find better looking clusters than KMeans but is not scalable to high number of samples.

sklearn.cluster.DBSCAN: Can detect irregularly shaped clusters based on density, i.e. sparse regions in the input space are likely to become inter-cluster boundaries. Can also detect outliers (samples that are not part of a cluster).

sklearn.manifold.LocallyLinearEmbedding

Locally Linear Embedding is a nonlinear neighbors-based manifold learning technique. The scikit-learn implementation makes available several variants to the basic algorithm.

sklearn.manifold.Isomap

Isomap is another neighbors-based manifold learning method that can find nonlinear projections of data.

Other clustering algorithms do not work with a data array of shape (n_samples, n_features) but directly with a precomputed affinity matrix of shape(n_samples, n_samples):

sklearn.cluster.AffinityPropagation

Clustering algorithm based on message passing between data points.

sklearn.cluster.SpectralClustering

KMeans applied to a projection of the normalized graph Laplacian: finds normalized graph cuts if the affinity matrix is interpreted as an adjacency matrix of a graph.

sklearn.cluster.Ward

Ward implements hierarchical clustering based on the Ward algorithm,

a variance-minimizing approach. At each step, it minimizes the sum of squared differences within all clusters (inertia criterion).

DBSCAN can work with either an array of samples or an affinity matrix.

Applications of clustering

Here are some common applications of clustering algorithms:

Building customer profiles for market analysis
Grouping related web news (e.g. Google News) and websearch results
Grouping related stock quotes for investment portfolio management
Can be used as a preprocessing step for recommender systems
Can be used to build a code book of prototype samples for unsupervised feature extraction for supervised learning algorithms

Programming

Tuesday, 12 August 2014

Supervised and UnSupervised Learning

Supervised Learning: `model.fit(X, y)`

Classification

A first classifier example with `scikit-learn`

Notable implementations of classifiers

Sample application of classifiers

Regression

Unsupervised Learning: `model.fit(X)`

Dimensionality Reduction and visualization

Normalization and visualization with PCA

Other applications of dimensionality reduction

Clustering

Notable implementations of clustering models

Applications of clustering

No comments:

Post a Comment

Tuesday, 12 August 2014

Supervised and UnSupervised Learning

Supervised Learning: model.fit(X, y)

Classification

A first classifier example with scikit-learn

Notable implementations of classifiers

Sample application of classifiers

Regression

Unsupervised Learning: model.fit(X)

Dimensionality Reduction and visualization

Normalization and visualization with PCA

Other applications of dimensionality reduction

Clustering

Notable implementations of clustering models

Applications of clustering

No comments:

Post a Comment

Supervised Learning: `model.fit(X, y)`

A first classifier example with `scikit-learn`

Unsupervised Learning: `model.fit(X)`