-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathSKL05-SupervisedLearning.tex
39 lines (28 loc) · 2.09 KB
/
SKL05-SupervisedLearning.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
% 2. Supervised learning: predicting an output variable from high-dimensional observations
The problem solved in supervised learning
Supervised learning consists in learning the link between two datasets: the observed data X, and an external variable y that we are trying to predict, usually called target or labels. Most often, y is a 1D array of length \texttt{n\_samples}.
All supervised estimators in the scikit-learn implement a fit(X, y) method to fit the model, and a predict(X) method that, given unlabeled observations X, returns predicts the corresponding labels y.
%================================================================================ %
Vocabulary: classification and regression
If the prediction task is to classify the observations in a set of finite labels, in other words to “name” the objects observed, the task is said to be a classification task. On the opposite, if the goal is to predict a continous target variable, it is said to be a regression task.
In the scikit-learn, for classification tasks, y is a vector of integers.
%================================================================================ %
2.1. Nearest neighbor and the curse of dimensionality
\subsection{Classifying irises:}
The iris dataset is a classification task consisting in identifying 3 different types of irises (Setosa, Versicolour, and Virginica) from their petal and sepal length and width:
<pre>
\begin{verbatim}
>>> import numpy as np
>>> from scikits.learn import datasets
>>> iris = datasets.load_iris()
>>> iris_X = iris.data
>>> iris_y = iris.target
>>> np.unique(iris_y)
array([0, 1, 2])
\end{verbatim}
\end{framed}
%============================================================================= %
\subsubsection{k-Nearest neigbhors classifier}
The simplest possible classifier is the nearest neighbor: given a new observation \texttt{x\_test}, find in the training set (i.e. the data used to train the estimator) the observation with the closest feature vector.
%============================================================================== %
\end{document}