From ac5546535a4c4bf4f5d2195ae15f9ebd5718322a Mon Sep 17 00:00:00 2001 From: Nicu Tofan Date: Wed, 10 Jun 2015 03:00:27 +0300 Subject: [PATCH] A page about using models to generate predictions --- .gitignore | 3 ++ doc/index.txt | 1 + doc/predicting.txt | 110 +++++++++++++++++++++++++++++++++++++++++++++ 3 files changed, 114 insertions(+) create mode 100644 doc/predicting.txt diff --git a/.gitignore b/.gitignore index fbe741d310..397efe9a2d 100644 --- a/.gitignore +++ b/.gitignore @@ -37,3 +37,6 @@ pylearn2/utils/_video.so pylearn2/utils/_window_flip.c pylearn2/utils/_window_flip.so pylearn2/utils/build/ + +# HTML documentation generated bu default in html/ +/html/ diff --git a/doc/index.txt b/doc/index.txt index 010a3c7aef..c5dd9120e9 100644 --- a/doc/index.txt +++ b/doc/index.txt @@ -220,6 +220,7 @@ Developer api_change cluster features + predicting internal/index internal/metadocumentation internal/data_specs diff --git a/doc/predicting.txt b/doc/predicting.txt new file mode 100644 index 0000000000..5a9df83c62 --- /dev/null +++ b/doc/predicting.txt @@ -0,0 +1,110 @@ +.. _predicting: + +========================================== +Predicting values using your trained model +========================================== + +This page presents a simple way to generate predictions +using a trained model. + +Prerequisites +============= + +The tutorial assumes that the reader has a trained, pickled +model at hand: + +.. code-block:: python + + from pylearn2.utils import serial + model = serial.load('model.pkl', retry=False) + +``serial.load()`` is a nice little wrapper that brings together +loading from numpy ``.npy`` files, Matlab ``.mat`` files and pickled +``.pkl`` files., among others. It can also wait for the resource +to become available by making a number of attempts (``retry`` +parameter above). + +The data used to generate predictions is delivered as a dataset +with same characteristics and type as the dataset used in training. +For the sake of simplicity we use a +:class:`~pylearn2.datasets.csv_dataset.CSVDataset>` here: + +.. code-block:: python + + from pylearn2.datasets.csv_dataset import CSVDataset + dataset = CSVDataset(path='data_to_predict.csv', + task='classification', + expect_headers=True) + +The code expects a file called ``data_to_predict.csv`` in current +directory that has headers on first row. Internally, the dataset +uses `numpy.loadtxt() `_ +to process the file. If a preprocessor was used in training it may +also be applied at this point: + +.. code-block:: python + + from pylearn2.datasets.csv_dataset import CSVDataset + dataset = CSVDataset(path='data_to_predict.csv', + task='classification', + expect_headers=True, + preprocessor=serial.load("preprocessor.pkl")) + +Setting the stage +================= + +We need to get the description of the data expected by the +model as input (see :ref:`data_specs` for an overview): + +.. code-block:: python + + data_space = model.get_input_space() + data_source = model.get_input_source() + data_specs = (data_space, data_source) + +We also need a symbolic variable to represent the input and +a `Theano `_ +function that will compute forward propagation. +`Theano documentation `_ +can provide insights in what's going on here: + +.. code-block:: python + + import theano + X = data_space.make_theano_batch('X') + predict = theano.function([X], model.fprop(X)) + +Each dataset is expected to create its own iterators according to +user preferences: + +.. code-block:: python + + iter = dataset.iterator(mode='sequential', + batch_size=1, + data_specs=data_specs) + +The size of the batches can be adjusted based on the specifics of the +dataset being used, but ``1`` is a safe bet. If the dataset +has a reasonable size the code above may be replaced by: + +.. code-block:: python + + iter = dataset.iterator(mode='sequential', + batch_size=dataset.get_num_examples(), + data_specs=data_specs) + +Predictions +=========== + +With all pieces in place we can now compute actual predictions: + +.. code-block:: python + + predictions = [] + for item in iter: + predictions.append(predict(item)) + + print predictions + +``predict()`` Theano function is used to generate predictions for all +batches returned by our iterator as numpy arrays.