|
| 1 | +[简体中文](../../../zh-CN/model_zoo/localization/yowo.md) | English |
| 2 | + |
| 3 | +# YOWO |
| 4 | + |
| 5 | +## Content |
| 6 | + |
| 7 | +- [Introduction](#Introduction) |
| 8 | +- [Data](#DATA) |
| 9 | +- [Train](#Train) |
| 10 | +- [Test](#Test) |
| 11 | +- [Inference](#Inference) |
| 12 | +- [Reference](#Reference) |
| 13 | + |
| 14 | + |
| 15 | +## Introduction |
| 16 | + |
| 17 | +YOWO is a single-stage network with two branches. One branch extracts spatial features of key frames (i.e., the current frame) via 2D-CNN, while the other branch acquires spatio-temporal features of clips consisting of previous frames via 3D-CNN. To accurately aggregate these features, YOWO uses a channel fusion and attention mechanism that maximizes the inter-channel dependencies. Finally, the fused features are subjected to frame-level detection. |
| 18 | + |
| 19 | + |
| 20 | +<div align="center"> |
| 21 | +<img src="../../../images/yowo.jpg"> |
| 22 | +</div> |
| 23 | + |
| 24 | + |
| 25 | +## Data |
| 26 | + |
| 27 | +UCF101-24 data download and preparation please refer to [UCF101-24 data preparation](../../dataset/ucf24.md) |
| 28 | + |
| 29 | + |
| 30 | +## Train |
| 31 | + |
| 32 | +### UCF101-24 data set training |
| 33 | + |
| 34 | +#### Download and add pre-trained models |
| 35 | + |
| 36 | +1. Download the pre-training model [resnext-101-kinetics](https://videotag.bj.bcebos.com/PaddleVideo-release2.3/resnext101_kinetics.pdparams) 和 [darknet](https://videotag.bj.bcebos.com/PaddleVideo-release2.3/darknet.pdparam) as Backbone initialization parameters, or download through the wget command |
| 37 | + |
| 38 | + ```bash |
| 39 | + wget -nc https://videotag.bj.bcebos.com/PaddleVideo-release2.3/darknet.pdparam |
| 40 | + wget -nc https://videotag.bj.bcebos.com/PaddleVideo-release2.3/resnext101_kinetics.pdparams |
| 41 | + ``` |
| 42 | + |
| 43 | +2. Open `PaddleVideo/configs/localization/yowo.yaml`, and fill in the downloaded weight storage path below `pretrained_2d:` and `pretrained_3d:` respectively |
| 44 | + |
| 45 | + ```yaml |
| 46 | + MODEL: |
| 47 | + framework: "YOWOLocalizer" |
| 48 | + backbone: |
| 49 | + name: "YOWO" |
| 50 | + num_class: 24 |
| 51 | + pretrained_2d: fill in the path of 2D pre-training model here |
| 52 | + pretrained_3d: fill in the path of 3D pre-training model here |
| 53 | + ``` |
| 54 | +
|
| 55 | +#### Start training |
| 56 | +
|
| 57 | +- The UCF101-24 data set uses 1 card for training, and the start command of the training method is as follows: |
| 58 | +
|
| 59 | + ```bash |
| 60 | + python3 main.py -c configs/localization/yowo.yaml --validate --seed=1 |
| 61 | + ``` |
| 62 | + |
| 63 | +- Turn on amp mixed-precision training to speed up the training process. The training start command is as follows: |
| 64 | + |
| 65 | + ```bash |
| 66 | + python3 main.py --amp -c configs/localization/yowo.yaml --validate --seed=1 |
| 67 | + ``` |
| 68 | + |
| 69 | +- In addition, you can customize and modify the parameter configuration to achieve the purpose of training/testing on different data sets. It is recommended that the naming method of the configuration file is `model_dataset name_file format_data format_sampling method.yaml` , Please refer to [config](../../tutorials/config.md) for parameter usage. |
| 70 | + |
| 71 | + |
| 72 | +## Test |
| 73 | + |
| 74 | +- The YOWO model is verified synchronously during training. You can find the keyword `best` in the training log to obtain the model test accuracy. The log example is as follows: |
| 75 | + |
| 76 | + ``` |
| 77 | + Already save the best model (fsocre)0.8779 |
| 78 | + ``` |
| 79 | + |
| 80 | +- Since the verification index of the YOWO model test mode is **Frame-mAP (@ IoU 0.5)**, which is different from the **fscore** used in the verification mode during the training process, so the verification index recorded in the training log, called `fscore `, does not represent the final test score, so after the training is completed, you can use the test mode to test the best model to obtain the final index, the command is as follows: |
| 81 | + |
| 82 | + ```bash |
| 83 | + python3 main.py -c configs/localization/yowo.yaml --test --seed=1 -w 'output/YOWO/YOWO_epoch_00005.pdparams' |
| 84 | + ``` |
| 85 | + |
| 86 | + |
| 87 | + When the test configuration uses the following parameters, the test indicators on the validation data set of UCF101-24 are as follows: |
| 88 | + |
| 89 | + |
| 90 | + | Model | 3D-CNN backbone | 2D-CNN backbone | Dataset |Input | Frame-mAP <br>(@ IoU 0.5) | checkpoints | |
| 91 | + | :-----------: | :-----------: | :-----------: | :-----------: | :-----------: | :-----------: | :-----------: | |
| 92 | + | YOWO | 3D-ResNext-101 | Darknet-19 | UCF101-24 | 16-frames, d=1 | 80.94 | [YOWO.pdparams](https://videotag.bj.bcebos.com/PaddleVideo-release2.3/YOWO_epoch_00005.pdparams) | |
| 93 | + |
| 94 | + |
| 95 | + |
| 96 | +## Inference |
| 97 | + |
| 98 | +### Export inference model |
| 99 | + |
| 100 | +```bash |
| 101 | +python3 tools/export_model.py -c configs/localization/yowo.yaml -p 'output/YOWO/YOWO_epoch_00005.pdparams' |
| 102 | +``` |
| 103 | + |
| 104 | +The above command will generate the model structure file `YOWO.pdmodel` and the model weight file `YOWO.pdiparams` required for prediction. |
| 105 | + |
| 106 | +- For the meaning of each parameter, please refer to [Model Reasoning Method](../../usage.md#2-infer) |
| 107 | + |
| 108 | +### Use prediction engine inference |
| 109 | + |
| 110 | +- Download the test video [HorseRiding.avi](https://videotag.bj.bcebos.com/Data/HorseRiding.avi) for a quick experience, or via the wget command. The downloaded video should be placed in the `data/ucf24` directory: |
| 111 | + |
| 112 | +```bash |
| 113 | +wget -nc https://videotag.bj.bcebos.com/Data/HorseRiding.avi |
| 114 | +``` |
| 115 | + |
| 116 | +- Run the following command for inference: |
| 117 | + |
| 118 | +```bash |
| 119 | +python3 tools/predict.py -c configs/localization/yowo.yaml -i 'data/ucf24/HorseRiding.avi' --model_file ./inference/YOWO.pdmodel --params_file ./inference/YOWO.pdiparams |
| 120 | +``` |
| 121 | + |
| 122 | +- When inference is over, the prediction results in image form will be saved in the `inference/YOWO_infer` directory. The image sequence can be converted to a gif by running the following command to complete the final visualisation. |
| 123 | + |
| 124 | +``` |
| 125 | +python3 data/ucf24/visualization.py --frames_dir ./inference/YOWO_infer/HorseRiding --duration 0.04 |
| 126 | +``` |
| 127 | + |
| 128 | +The resulting visualization is as follows: |
| 129 | + |
| 130 | +<div align="center"> |
| 131 | + <img src="../../../images/horse_riding.gif" alt="Horse Riding"> |
| 132 | +</div> |
| 133 | + |
| 134 | +It can be seen that using the YOWO model trained on UCF101-24 to predict `data/ucf24/HorseRiding.avi`, the category of each frame output is HorseRiding with a confidence level of about 0.80. |
| 135 | + |
| 136 | +## Reference |
| 137 | + |
| 138 | +- [You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization](https://arxiv.org/pdf/1911.06644.pdf), Köpüklü O, Wei X, Rigoll G. |
0 commit comments