Programming: Machine Learning and AI

In this section we will begin to explore the basic principles of machine learning. Machine Learning is about building programs with tunable parameters(typically an array of floating point values) that are adjusted automatically so as to improve their behavior by adapting to previously seen data.

Machine Learning can be considered a subfield of Artificial Intelligence since those algorithms can be seen as building blocks to make computers learn to behave more intelligently by somehow generalizing rather that just storing and retrieving data items like a database system would do.

A very simple example of a machine learning task can be seen in the following figure: it shows a collection of two-dimensional data, colored according to two different class labels. A classification algorithm is used to draw a dividing boundary between the two clusters of points:

_images/plot_sgd_separating_hyperplane_1.png

Example Linear Decision Boundary

As with all figures in this tutorial, the above image has a hyper-link to the python source code which is used to generate it.

Features and feature extraction

Most machine learning algorithms implemented in scikit-learn expect a numpy array as input X. The expected shape of X is (n_samples, n_features).

`n_samples`:	The number of samples: each sample is an item to process (e.g. classify). A sample can be a document, a picture, a sound, a video, a row in database or CSV file, or whatever you can describe with a fixed set of quantitative traits.
`n_features`:	The number of features or distinct traits that can be used to describe each item in a quantitative manner.

The number of features must be fixed in advance. However it can be very high dimensional (e.g. millions of features) with most of them being zeros for a given sample. In this case we may use scipy.sparse matrices instead of numpy arrays so as to make the data fit in memory.

A simple example: the iris dataset

Note

The information in this section is available in an interactive notebook 01_datasets.ipynb, which can be viewed using iPython notebook. An online static view can be seen here.

The machine learning community often uses a simple flowers database where each row in the database (or CSV file) is a set of measurements of an individual iris flower. Each sample in this dataset is described by 4 features and can belong to one of the target classes:

Features in the Iris dataset:

sepal length in cm

sepal width in cm

petal length in cm

petal width in cm

Target classes to predict:

Iris Setosa

Iris Versicolour

Iris Virginica

Features in the Iris dataset:
	sepal length in cm sepal width in cm petal length in cm petal width in cm
Target classes to predict:
	Iris Setosa Iris Versicolour Iris Virginica

scikit-learn embeds a copy of the iris CSV file along with a helper function to load it into numpy arrays:

>>> from sklearn.datasets import load_iris
>>> iris = load_iris()

The features of each sample flower are stored in the data attribute of the dataset:

>>> n_samples, n_features = iris.data.shape

>>> n_samples
150

>>> n_features
4

>>> iris.data[0]
array([ 5.1,  3.5,  1.4,  0.2])

The information about the class of each sample is stored in the target attribute of the dataset:

>>> len(iris.target) == n_samples
True

>>> iris.target
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

The names of the classes are stored in the last attribute, namely target_names:

>>> list(iris.target_names)
['setosa', 'versicolor', 'virginica']

Handling categorical features

Sometimes people describe samples with categorical descriptors that have no obvious numerical representation. For instance assume that each flower is further described by a color name among a fixed list of color names:

color in ['purple', 'blue', 'red']

The simple way to turn this categorical feature into numerical features suitable for machine learning is to create new features for each distinct color name that can be valued to 1.0 if the category is matching or 0.0 if not.

The enriched iris feature set would hence be in this case:

sepal length in cm

sepal width in cm

petal length in cm

petal width in cm

color#purple (1.0 or 0.0)

color#blue (1.0 or 0.0)

color#red (1.0 or 0.0)

Extracting features from unstructured data

The previous example deals with features that are readily available in a structured dataset with rows and columns of numerical or categorical values.

However, most of the produced data is not readily available in a structured representation such as SQL, CSV, XML, JSON or RDF.

Here is an overview of strategies to turn unstructed data items into arrays of numerical features.

Text documents:
Count the frequency of each word or pair of consecutive words in each document. This approach is called Bag of Words

Note: we include other file formats such as HTML and PDF in this category: an ad-hoc preprocessing step is required to extract the plain text in UTF-8 encoding for instance.

Images:

Rescale the picture to a fixed size and take all the raw pixels values (with or without luminosity normalization)

Take some transformation of the signal (gradients in each pixel, wavelets transforms...)

Compute the Euclidean, Manhattan or cosine similarities of the sample to a set reference prototype imagesaranged in a code book. The code book may have been previously extracted from the same dataset using an unsupervised learning algorithm on the raw pixel signal.

Each feature value is the distance to one element of the code book.

Perform local feature extraction: split the picture into small regions and perform feature extraction locally in each area.

Then combine all the features of the individual areas into a single array.

Sounds:
Same strategy as for images within a 1D space instead of 2D

Text documents:	Count the frequency of each word or pair of consecutive words in each document. This approach is called Bag of Words Note: we include other file formats such as HTML and PDF in this category: an ad-hoc preprocessing step is required to extract the plain text in UTF-8 encoding for instance.
Images:	Rescale the picture to a fixed size and take all the raw pixels values (with or without luminosity normalization) Take some transformation of the signal (gradients in each pixel, wavelets transforms...) Compute the Euclidean, Manhattan or cosine similarities of the sample to a set reference prototype imagesaranged in a code book. The code book may have been previously extracted from the same dataset using an unsupervised learning algorithm on the raw pixel signal. Each feature value is the distance to one element of the code book. Perform local feature extraction: split the picture into small regions and perform feature extraction locally in each area. Then combine all the features of the individual areas into a single array.
Sounds:	Same strategy as for images within a 1D space instead of 2D

Practical implementations of such feature extraction strategies will be presented in the last sections of this tutorial.

Programming

Saturday, 9 August 2014

Machine Learning and AI

Features and feature extraction

A simple example: the iris dataset

Handling categorical features

Extracting features from unstructured data

No comments:

Post a Comment