Google - Isolated Sign Language Recognition

Summary

We used 5-layers Transformer encoder with cross entropy and sub-class Arcface loss and ensemble 6-models trained with different combintation of cross entropy and sub-class Arcface loss.

Data Preprocessing

129 keypoints are used. (21 for each hand, 40 for lip, 11 for pose, 16 for each eye, 4 for nose)
Only use x and y.
Pipeline: Augmentations, frame sampling, normalization, feature engineering and 0 imputations for NaN values.
Frame sampling: When a length exceeds the maximum length, sampling was done at regular intervals up to the maximum length. It is better than center-crop in local CV and it minimized performance degradation while reducing maximum length.
```
L = len(xyz)
if L > max_len:
    step = (L - 1) // (max_len - 1)
    indices = [i * step for i in range(max_len)]
    xyz = xyz[indices]
L = len(xyz)
```
Normalization: Standardization x and y independently.
Feature Engineering
- motion: current xy - future xy
- hand joint distance
- time reverse difference: xy - xy.flip(dims=[0])
  - +0.004 on CV

Augmentations

Flip pose:
- x_new = f(-(x - 0.5) + 0.5), where f is index change function.
- +0.01 on Public LB
- xy values are in [0, 1]. Shift to origin by subtracting 0.5 before flip, Back to original coordinates by adding 0.5
- There was no performance improvement when not moving to the origin.

Rotate pose:

def rotate(xyz, theta):
    radian = np.radians(theta)
    mat = np.array(
        [[np.cos(radian), -np.sin(radian)], [np.sin(radian), np.cos(radian)]]
    )
    xyz[:, :, :2] = xyz[:, :, :2] - 0.5
    xyz_reshape = xyz.reshape(-1, 2)
    xyz_rotate = np.dot(xyz_reshape, mat).reshape(xyz.shape)

    return xyz_rotate[:, :, :2] + 0.5

angle between -13 degree ~ 13 degree
Reference

Interpolation (Up and Down)

xyz = F.interpolate(
        xyz, size=(V * C, resize), mode="bilinear", align_corners=False
    ).squeeze()

Up-sampling or Down-sampling up to 25% of original length

Training

5-layers Transformer encoder with weighted cross entropy
- Class weight is based on performance of the class.
- Accuracy on LB and CV consistently improved until stacking 4~5 layers.
  - LBs => 1-layer(5 seeds): 0.714, 2-layers(5 seeds): 0.738, 3-layers(5 seeds): 0.748, 4-layers(5 seeds): 0.751
  - We add dropout layers after self-attention and fc layer based on the PyTorch official code.
5-layers Transformer encoder with weighted cross entropy and sub-class Arcface loss.
- loss = cross entropy + 0.2 * subclass(K=3) Arcface
- loss = 0.2 * cross entropy + subclass(K=3) Arcface
Arcface
- +0.01 on Public LB
- Using arcface loss alone resulted in worse performance.
- Arcface armed with cross entropy converges much faster and better than cross entropy alone.
- Subclass K=3, margin=0.2, scale=32
Scheduled Dropout
- +0.002 on CV
- Dropout rate on final [CLS] increased x2 after half of training epochs
Label Smoothing
- parameter: 0.2
- +0.01 on CV
Hyper parameters
- Epochs: 140
- Max lenght: 64
- batch size: 64
- embed dim: 256
- num head: 4
- num layers: 5
- CosineAnnealingWarmRestarts w/ lr 1e-3 and AdamW

Ensemble

2 different seeds of Transformer with weighted cross entropy
- Single model LB: 0.75
2 different seeds of weighted cross entropy + 0.2 * subclass(K=3) Arcface
- Inference: Weighted ensemble of cross entropy and Arcface head.
- Single model LB: 0.76
2 different seeds of 0.2 * cross entropy + subclass(K=3) Arcface
- Inference: Weighted ensemble of cross entropy and Arcface head.
- Single model LB: 0.75
6 models ensemble Public LB: 0.77+
All models are fp16
Total Size: 20Mb
Latency: 60ms/sample

Working on CV but not included in final submission

TTA
- +0.000x on CV
- Sumbission Scoring Error. It might be a memory issue.
Angle between bones of each arm
- 0.000x on 1 fold. We couldn't fully validate it due to time limits.

Not working

GCN embedding layer instead of Linear
Stacking Spatial Attention & Temporal Conv. blocks
Distance between pose keypoints
Removing outlier and Retraining
- We used anlge between learned Arcface subclass vector
- About 5% (~4000 samples) are removed
Knowledge distillation with bigger Transformer
Stochastic Weight Average
Using all [CLS] in every layer
Average all tokens instead of [CLS] token
Stacking with MLP as meta-learner

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
deprecated		deprecated
notebook		notebook
scripts		scripts
.gitignore		.gitignore
README.md		README.md
arcface.py		arcface.py
dataset.py		dataset.py
model.py		model.py
options.py		options.py
preproc.py		preproc.py
split_cv.py		split_cv.py
train.py		train.py
train_seed.py		train_seed.py
transforms.py		transforms.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Google - Isolated Sign Language Recognition

Summary

Data Preprocessing

Augmentations

Training

Ensemble

Working on CV but not included in final submission

Not working

About

Uh oh!

Releases

Packages

Uh oh!

Languages

wonjun-dev/Google-Isolated-Sign-Language-Recognition

Folders and files

Latest commit

History

Repository files navigation

Google - Isolated Sign Language Recognition

Summary

Data Preprocessing

Augmentations

Training

Ensemble

Working on CV but not included in final submission

Not working

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages