|
| 1 | +# [fit] Deep Learning Practical Guidance |
| 2 | +_**Getting Started with Image & Text**_ |
| 3 | +<br> |
| 4 | +<br> |
| 5 | +<br> |
| 6 | + |
| 7 | + |
| 8 | +**Amit Kapoor** [@amitkaps](http://amitkaps.com) |
| 9 | +**Bargava Subramanian** [@bargava](http://bargava.com) |
| 10 | +**Anand Chitipothu** [@anandology](http://anandology.com) |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +# Bootcamp Approach |
| 15 | + |
| 16 | +- **Domain**: Image & Text |
| 17 | +- **Applied**: Proven & Practical |
| 18 | +- **Intuition**: Visualisation & Analogies |
| 19 | +- **Code**: Learning by Doing |
| 20 | +- **Math**: Attend HackerMath! |
| 21 | + |
| 22 | +--- |
| 23 | + |
| 24 | +# Learning Paradigm |
| 25 | + |
| 26 | + |
| 27 | + |
| 28 | +--- |
| 29 | + |
| 30 | +# Learning Types & Applications |
| 31 | + |
| 32 | +- **Supervised**: Regression, Classification, ... |
| 33 | +- Unsupervised: Dimensionality Reduction, Clustering, ... |
| 34 | +- Self (Semi)-supervised: Auto-encoders, Generative Adversarial Network, ... |
| 35 | +- Reinforcement Learning: Games, Self-Driving Car, Robotics, ... |
| 36 | + |
| 37 | +--- |
| 38 | + |
| 39 | +# Focus: Supervised Learning |
| 40 | +- **Classification**: Image, Text, Speech, Translation |
| 41 | +- Sequence generation: Given a picture, predict a caption describing it. |
| 42 | +- Syntax tree prediction: Given a sentence, predict its decomposition into a syntax tree. |
| 43 | +- Object detection: Given a picture, draw a bounding box around certain objects inside the picture. |
| 44 | +- Image segmentation: Given a picture, draw a pixel-level mask on a specific object. |
| 45 | + |
| 46 | +--- |
| 47 | + |
| 48 | +# Learning Approach |
| 49 | + |
| 50 | +<br> |
| 51 | + |
| 52 | + |
| 53 | + |
| 54 | +--- |
| 55 | + |
| 56 | +# Data Representation: Tensors |
| 57 | + |
| 58 | +- Numpy arrays (aka Tensors) |
| 59 | +- Generalised form of matrix (2D array) |
| 60 | +- Attributes |
| 61 | + - Axes or Rank: `ndim` |
| 62 | + - Dimensions: `shape` e.g. (5, 3) |
| 63 | + - Data Type: `dtype` e.g. `float32`, `uint8`, `float64` |
| 64 | + |
| 65 | +--- |
| 66 | + |
| 67 | +# Tensor Types |
| 68 | + |
| 69 | +- **Scalar**: 0D Tensor |
| 70 | +- **Vector**: 1D Tensor |
| 71 | +- **Matrix**: 2D Tensor |
| 72 | +- **Higher-order**: 3D, 4D or 5D Tensor |
| 73 | + |
| 74 | +--- |
| 75 | + |
| 76 | +# Input $$X$$ |
| 77 | + |
| 78 | +| Tensor | Example | Shape | |
| 79 | +|:-------|:---------|:----------------------| |
| 80 | +| 2D | Tabular | (samples, features) | |
| 81 | +| 3D | Sequence | (samples, steps, features) | |
| 82 | +| 4D | Images | (samples, height, width, channels) | |
| 83 | +| 5D | Videos | (samples, frames, height, width, channels) | |
| 84 | + |
| 85 | +--- |
| 86 | + |
| 87 | +# Learning Unit |
| 88 | + |
| 89 | +$$ y = RELU(dot(w,x) + bias) $$ |
| 90 | +**weights** are $$ w_1 ... w_n$$ & **activation** is RELU $$ f(z) = max(z,0) $$ |
| 91 | + |
| 92 | + |
| 93 | + |
| 94 | + |
| 95 | +--- |
| 96 | + |
| 97 | +# Model Architecture |
| 98 | + |
| 99 | +Basic Model: **Sequential** - A linear stack of layers. |
| 100 | + |
| 101 | +Core Layers |
| 102 | +- Dense Layers: Fully connected layer of learning units (Also called Multi-layer Perceptron) |
| 103 | +- Flatten |
| 104 | + |
| 105 | +--- |
| 106 | + |
| 107 | +# Output $$y$$ & Loss |
| 108 | + |
| 109 | + |
| 110 | +| $$y$$ | Last Layer Activation | Loss Function | |
| 111 | +|:-------|:---------|:----------------------| |
| 112 | +| Binary Class | sigmoid | Binary Crossentropy | |
| 113 | +| Multi Class | softmax | Categorical Crossentropy | |
| 114 | +| Multi Class Multi Label | sigmoid | Binary Crossentropy | |
| 115 | +| Regression | None | Mean Square Error | |
| 116 | +| Regression (0-1) | sigmoid | MSE or Binary Crossentropy | |
| 117 | + |
| 118 | +--- |
| 119 | + |
| 120 | +# Optimizers |
| 121 | + |
| 122 | +- **SGD**: Excellent but requires tuning learning-rate decay, and momentum parameters |
| 123 | +- **RMSProp**: Good for RNNs |
| 124 | +- **Adam**: Adaptive momentum optimiser, generally a good starting point. |
| 125 | + |
| 126 | + |
| 127 | +--- |
| 128 | + |
| 129 | +# Guidance for DL |
| 130 | +> *General guidance on building and training neural networks. Treat them as heuristics (derived from experimentation) and as good starting points for your own explorations.* |
| 131 | +
|
| 132 | +--- |
| 133 | + |
| 134 | +# Pre-Processing |
| 135 | +- **Normalize** / **Whiten** your data (Not for text!) |
| 136 | +- **Scale** your data appropriately (for outlier) |
| 137 | +- Handle **Missing Values** - Make them 0 (Ensure it exists in training) |
| 138 | +- Create **Training & Validation Split** |
| 139 | +- **Stratified** split for multi-class data |
| 140 | +- **Shuffle** data for non-sequence data. Careful for sequence!! |
| 141 | + |
| 142 | +--- |
| 143 | + |
| 144 | +# General Architecture |
| 145 | +- Use **ADAM** Optimizer (to start with) |
| 146 | +- Use **RELU** for non-linear activation (Faster for learning than others) |
| 147 | +- Add **Bias** to each layer |
| 148 | +- Use **Xavier** or **Variance-Scaling** initialisation (Better than random initialisation) |
| 149 | +- Refer to output layers activation & loss function guidance for tasks |
| 150 | + |
| 151 | +--- |
| 152 | + |
| 153 | +# Dense / MLP Architecture |
| 154 | +- No. of units reduce in deeper layer |
| 155 | +- Units are typically $$2^n$$ |
| 156 | +- Don't use more than 4 - 5 layers in dense networks |
| 157 | + |
| 158 | +--- |
| 159 | + |
| 160 | +# CNN Architecture (for Images) |
| 161 | +- Increase **Convoluton filters** as you go deeper from 32 to 64 or 128 (Max) |
| 162 | +- Use **Pooling** to subsample: Makes image robust from translation, scaling, rotation |
| 163 | +- Use **pre-trained models** as *feature extractors* for similar tasks |
| 164 | +- Progressively **train n-last layers** if the model is not learning |
| 165 | +- **Image Augmentation** is key for small data and for faster learning |
| 166 | + |
| 167 | +--- |
| 168 | + |
| 169 | +# RNN / CNN Architecture (for NLP) |
| 170 | + |
| 171 | +- **Embedding** layer is critical. **Words** are better than **Characters** |
| 172 | +- Learn the embedding with the task or use pre-trained embedding as starting point |
| 173 | +- Use BiLSTM / LSTM vs Simple RNN. Remember, RNNs are really slow to train |
| 174 | +- Experiment with 1D CNN with larger kernel size (7 or 9) than used for images. |
| 175 | +- MLP can work with bi-grams for many simple tasks. |
| 176 | + |
| 177 | +--- |
| 178 | + |
| 179 | +# Learning Process |
| 180 | +- **Validation Process** |
| 181 | + - Large Data: Hold-Out Validation |
| 182 | + - Smaller Data: K-Fold (Stratified) Validation |
| 183 | +- **For Underfitting** |
| 184 | + - Add more layers: **go Deeper** |
| 185 | + - Make the layers bigger: **go wider** |
| 186 | + - Train for more epochs |
| 187 | + |
| 188 | +--- |
| 189 | + |
| 190 | +# Learning Process |
| 191 | +- **For Overfitting** |
| 192 | + - Get **more training data** (e.g. actual or image augmentation) |
| 193 | + - Reduce **Model Capacity** |
| 194 | + - Add **weight regularisation** (e.g. L1, L2) |
| 195 | + - Add **Dropouts** or use **Batch Normalization** |
| 196 | + |
| 197 | + |
| 198 | + |
0 commit comments