Skip to content
This repository was archived by the owner on Jul 7, 2023. It is now read-only.

Commit 4ffae90

Browse files
authored
Merge pull request #318 from rsepassi/push
v1.2.3
2 parents 1b6905f + 76706ef commit 4ffae90

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

57 files changed

+2217
-928
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ Issues](https://img.shields.io/github/issues/tensorflow/tensor2tensor.svg)](http
88
welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CONTRIBUTING.md)
99
[![Gitter](https://img.shields.io/gitter/room/nwjs/nw.js.svg)](https://gitter.im/tensor2tensor/Lobby)
1010
[![License](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://opensource.org/licenses/Apache-2.0)
11+
[![Travis](https://img.shields.io/travis/tensorflow/tensor2tensor.svg)](https://travis-ci.org/tensorflow/tensor2tensor)
1112

1213
[T2T](https://github.com/tensorflow/tensor2tensor) is a modular and extensible
1314
library and binaries for supervised learning with TensorFlow and with support
@@ -123,8 +124,7 @@ t2t-decoder \
123124
--model=$MODEL \
124125
--hparams_set=$HPARAMS \
125126
--output_dir=$TRAIN_DIR \
126-
--decode_beam_size=$BEAM_SIZE \
127-
--decode_alpha=$ALPHA \
127+
--decode_hparams="beam_size=$BEAM_SIZE,alpha=$ALPHA" \
128128
--decode_from_file=$DECODE_FILE
129129
130130
cat $DECODE_FILE.$MODEL.$HPARAMS.beam$BEAM_SIZE.alpha$ALPHA.decodes

docs/example_life.md

Lines changed: 179 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -9,26 +9,189 @@ welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](CO
99
[![Gitter](https://img.shields.io/gitter/room/nwjs/nw.js.svg)](https://gitter.im/tensor2tensor/Lobby)
1010
[![License](https://img.shields.io/badge/License-Apache%202.0-brightgreen.svg)](https://opensource.org/licenses/Apache-2.0)
1111

12-
This document show how a training example passes through the T2T pipeline,
13-
and how all its parts are connected to work together.
12+
This doc explains how a training example flows through T2T, from data generation
13+
to training, evaluation, and decoding. It points out the various hooks available
14+
in the `Problem` and `T2TModel` classes and gives an overview of the T2T code
15+
(key functions, files, hyperparameters, etc.).
1416

15-
## The Life of an Example
17+
Some key files and their functions:
1618

17-
A training example passes the following stages in T2T:
18-
* raw input (text from command line or file)
19-
* encoded input after [Problem.feature_encoder](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/problem.py#L173) function `encode` is usually a sparse tensor, e.g., a vector of `tf.int32`s
20-
* batched input after [data input pipeline](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/data_reader.py#L242) where the inputs, after [Problem.preprocess_examples](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/problem.py#L188) are grouped by their length and made into batches.
21-
* dense input after being processed by a [Modality](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/modality.py#L30) function `bottom`.
22-
* dense output after [T2T.model_fn_body](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/t2t_model.py#L542)
23-
* back to sparse output through [Modality](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/modality.py#L30) function `top`.
24-
* if decoding, back through [Problem.feature_encoder](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/problem.py#L173) function `decode` to display on the screen.
19+
* [`trainer_utils.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/utils/trainer_utils.py):
20+
Constructs and runs all the main components of the system (the `Problem`,
21+
the `HParams`, the `Estimator`, the `Experiment`, the `input_fn`s and
22+
`model_fn`).
23+
* [`common_hparams.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/layers/common_hparams.py):
24+
`basic_params1` serves as the base for all model hyperparameters. Registered
25+
model hparams functions always start with this default set of
26+
hyperparameters.
27+
* [`problem.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/problem.py):
28+
Every dataset in T2T subclasses `Problem`.
29+
* [`t2t_model.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/utils/t2t_model.py):
30+
Every model in T2T subclasses `T2TModel`.
2531

26-
We go into these phases step by step below.
32+
## Data Generation
2733

28-
## Feature Encoders
34+
The `t2t-datagen` binary is the entrypoint for data generation. It simply looks
35+
up the `Problem` specified by `--problem` and calls
36+
`Problem.generate_data(data_dir, tmp_dir)`.
2937

30-
TODO: describe [Problem.feature_encoder](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/data_generators/problem.py#L173) which is a dict of encoders that have `encode` and `decode` functions.
38+
All `Problem`s are expected to generate 2 sharded `TFRecords` files - 1 for
39+
training and 1 for evaluation - with `tensorflow.Example` protocol buffers. The
40+
expected names of the files are given by `Problem.{training, dev}_filepaths`.
41+
Typically, the features in the `Example` will be `"inputs"` and `"targets"`;
42+
however, some tasks have a different on-disk representation that is converted to
43+
`"inputs"` and `"targets"` online in the input pipeline (e.g. image features are
44+
typically stored with features `"image/encoded"` and `"image/format"` and the
45+
decoding happens in the input pipeline).
3146

32-
## Modalities
47+
For tasks that require a vocabulary, this is also the point at which the
48+
vocabulary is generated and all examples are encoded.
3349

34-
TODO: describe [Modality](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/modality.py#L30) which has `bottom` and `top` but also sharded versions and one for targets.
50+
There are several utility functions in
51+
[`generator_utils`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/generator_utils.py)
52+
that are commonly used by `Problem`s to generate data. Several are highlighted
53+
below:
54+
55+
* `generate_dataset_and_shuffle`: given 2 generators, 1 for training and 1 for
56+
eval, yielding dictionaries of `<feature name, list< int or float or
57+
string >>`, will produce sharded and shuffled `TFRecords` files with
58+
`tensorflow.Example` protos.
59+
* `maybe_download`: downloads a file at a URL to the given directory and
60+
filename (see `maybe_download_from_drive` if the URL points to Google
61+
Drive).
62+
* `get_or_generate_vocab_inner`: given a target vocabulary size and a
63+
generator that yields lines or tokens from the dataset, will build a
64+
`SubwordTextEncoder` along with a backing vocabulary file that can be used
65+
to map input strings to lists of ids.
66+
[`SubwordTextEncoder`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/text_encoder.py)
67+
uses word pieces and its encoding is fully invertible.
68+
69+
## Data Input Pipeline
70+
71+
Once the data is produced on disk, training, evaluation, and inference (if
72+
decoding from the dataset) consume it by way of T2T input pipeline. This section
73+
will give an overview of that pipeline with specific attention to the various
74+
hooks in the `Problem` class and the model's `HParams` object (typically
75+
registered in the model's file and specified by the `--hparams_set` flag).
76+
77+
The entire input pipeline is implemented with the new `tf.data.Dataset` API
78+
(previously `tf.contrib.data.Dataset`).
79+
80+
The key function in the codebase for the input pipeline is
81+
[`data_reader.input_pipeline`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/utils/data_reader.py).
82+
The full input function is built in
83+
[`input_fn_builder.build_input_fn`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/utils/input_fn_builder.py)
84+
(which calls `data_reader.input_pipeline`).
85+
86+
### Reading and decoding data
87+
88+
`Problem.dataset_filename` specifies the prefix of the files on disk (they will
89+
be suffixed with `-train` or `-dev` as well as their sharding).
90+
91+
The features read from the files and their decoding is specified by
92+
`Problem.example_reading_spec`, which returns 2 items:
93+
94+
1. Dict mapping from on-disk feature name to on-disk types (`VarLenFeature` or
95+
`FixedLenFeature`.
96+
2. Dict mapping output feature name to decoder. This return value is optional
97+
and is only needed for tasks whose features may require additional decoding
98+
(e.g. images). You can find the available decoders in
99+
`tf.contrib.slim.tfexample_decoder`.
100+
101+
At this point in the input pipeline, the example is a `dict<feature name,
102+
Tensor>`.
103+
104+
### Preprocessing
105+
106+
The read `Example` now runs through `Problem.preprocess_example`, which by
107+
default runs
108+
[`problem.preprocess_example_common`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/data_generators/problem.py),
109+
which may truncate the inputs/targets or prepend to targets, governed by some
110+
hyperparameters.
111+
112+
### Batching
113+
114+
Examples are bucketed by sequence length and then batched out of those buckets.
115+
This significantly improves performance over a naive batching scheme for
116+
variable length sequences because each example in a batch must be padded to
117+
match the example with the maximum length in the batch.
118+
119+
There are several hyperparameters that affect how examples are batched together:
120+
121+
* `hp.batch_size`: this is the approximate total number of tokens in the batch
122+
(i.e. for a sequence problem, long sequences will have smaller actual batch
123+
size and short sequences will have a larger actual batch size in order to
124+
generally have an equal number of tokens in the batch).
125+
* `hp.max_length`: sequences with length longer than this will be dropped
126+
during training (and also during eval if `hp.eval_drop_long_sequences` is
127+
`True`). If not set, the maximum length of examples is set to
128+
`hp.batch_size`.
129+
* `hp.batch_size_multiplier`: multiplier for the maximum length
130+
* `hp.min_length_bucket`: example length for the smallest bucket (i.e. the
131+
smallest bucket will bucket examples up to this length).
132+
* `hp.length_bucket_step`: controls how spaced out the length buckets are.
133+
134+
## Building the Model
135+
136+
At this point, the input features typically have `"inputs"` and `"targets"`,
137+
each of which is a batched 4-D Tensor (e.g. of shape `[batch_size,
138+
sequence_length, 1, 1]` for text input or `[batch_size, height, width, 3]` for
139+
image input).
140+
141+
A `T2TModel` is composed of transforms of the input features by `Modality`s,
142+
then the body of the model, then transforms of the model output to predictions
143+
by a `Modality`, and then a loss (during training).
144+
145+
The `Modality` types for the various input features and for the target are
146+
specified in `Problem.hparams`. A `Modality` is a feature adapter that enables
147+
models to be agnostic to input/output spaces. You can see the various
148+
`Modality`s in
149+
[`modalities.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/layers/modalities.py).
150+
151+
The sketch structure of a T2T model is as follows:
152+
153+
```python
154+
features = {...} # output from the input pipeline
155+
input_modaly = ... # specified in Problem.hparams
156+
target_modality = ... # specified in Problem.hparams
157+
158+
transformed_features = {}
159+
transformed_features["inputs"] = input_modality.bottom(
160+
features["inputs"])
161+
transformed_features["targets"] = target_modality.targets_bottom(
162+
features["targets"]) # for autoregressive models
163+
164+
body_outputs = model.model_fn_body(transformed_features)
165+
166+
predictions = target_modality.top(body_outputs, features["targets"])
167+
loss = target_modality.loss(predictions, features["targets"])
168+
```
169+
170+
Most `T2TModel`s only override `model_fn_body`.
171+
172+
## Training, Eval, Inference modes
173+
174+
Both the input function and model functions take a mode in the form of a
175+
`tf.estimator.ModeKeys`, which allows the functions to behave differently in
176+
different modes.
177+
178+
In training, the model function constructs an optimizer and minimizes the loss.
179+
180+
In evaluation, the model function constructs the evaluation metrics specified by
181+
`Problem.eval_metrics`.
182+
183+
In inference, the model function outputs predictions.
184+
185+
## `Estimator` and `Experiment`
186+
187+
With the input function and model functions constructed, the actual training
188+
loop and related services (checkpointing, summaries, continuous evaluation,
189+
etc.) are all handled by `Estimator` and `Experiment` objects, constructed in
190+
[`trainer_utils.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/utils/trainer_utils.py).
191+
192+
## Decoding
193+
194+
* [`decoding.py`](https://github.com/tensorflow/tensor2tensor/tree/master/tensor2tensor/utils/decoding.py)
195+
196+
TODO(rsepassi): Explain decoding (interactive, from file, and from dataset) and
197+
`Problem.feature_encoders`.

docs/index.md

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -24,11 +24,6 @@ documentation, from basic tutorials to full code documentation.
2424

2525
## Deep Dive
2626

27-
* [Life of an Example](example_life.md): how all parts of T2T are connected and work together
27+
* [Life of an Example](example_life.md): how all parts of T2T are connected and
28+
work together
2829
* [Distributed Training](distributed_training.md)
29-
30-
## Code documentation
31-
32-
See our
33-
[README](https://github.com/tensorflow/tensor2tensor/blob/master/README.md)
34-
for now, code docs coming.

0 commit comments

Comments
 (0)