Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
202 changes: 63 additions & 139 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,139 +1,63 @@
<center>
<img src="README_files/overview.png" alt="overview" style="float:middle;">
</center>

# Deep Classification

## updates
- 27/9/2017: provide [subset of dataset](https://drive.google.com/drive/folders/0B3fKFm-j0RqeWGdXZUNRUkpybU0?usp=sharing), separated into train/test set
- 27/9/2017: in this homework, we only evaluat the performance of object classification. You can use other label for multi-task learning, etc.
- 4/10/2017: ~~Due: Oct. 5, 11:59pm.~~ => Due: Oct. 12, 11:59pm.
## Brief
* ***+2 extra credit of the whole semester***
* Due: <b>Oct. 5</b>, 11:59pm.
* Required files: results/index.md, and code/
* [Project reference](http://aliensunmin.github.io/project/handcam/)


## Overview


Recently, the technological advance of wearable devices has led to significant interests in recognizing human behaviors in daily life (i.e., uninstrumented environment). Among many devices, egocentric camera systems have drawn significant attention, since the camera is aligned with the field-of-view of wearer, it naturally captures what a person sees. These systems have shown great potential in recognizing daily activities(e.g., making meals, watching TV, etc.), estimating hand poses, generating howto videos, etc.

Despite many advantages of egocentric camera systems, there exists two main issues which are much less discussed. Firstly, hand localization is not solved especially for passive camera systems. Even for active camera systems like Kinect, hand localization is challenging when two hands are interacting or a hand is interacting with an object. Secondly, the limited field-of-view of an egocentric camera implies that hands will inevitably move outside the images sometimes.

HandCam (Fig. 1), a novel wearable camera capturing activities of hands, for recognizing human behaviors. HandCam has two main advantages over egocentric systems : (1) it avoids the need to detect hands and manipulation regions; (2) it observes the activities of hands almost at all time.

## Requirement

- Python
- [TensorFlow](https://github.com/tensorflow/tensorflow)

## Data

### Introduction

This is a [dataset](https://drive.google.com/drive/folders/0BwCy2boZhfdBdXdFWnEtNWJYRzQ) recorded by hand camera system.

The camera system consist of three wide-angle cameras, two mounted on the left and right wrists to
capture hands (referred to as HandCam) and one mounted on the head (referred to as HeadCam).

The dataset consists of 20 sets of video sequences (i.e., each set includes two HandCams and one
HeadCam synchronized videos) captured in three scenes: a small office, a mid-size lab, and a large home.)

We want to classify some kinds of hand states including free v.s. active (i.e., hands holding objects or not),
object categories, and hand gestures. At the same time, a synchronized video has two sequence need to be labeled,
the left hand states and right hand states.

For each classification task (i.e., free vs. active, object categories, or hand gesture), there are forty
sequences of data. We split the dataset into two parts, half for training, half for testing. The object instance is totally separated into training and testing.

### Zip files

`frames.zip` contains all the frames sample from the original videos by 6fps.

`labels.zip` conatins the labels for all frames.

FA : free vs. active (only 0/1)

obj: object categories (24 classes, including free)

ges: hand gesture (13 gestures, including free)


### Details of obj. and ges.

```
Obj = { 'free':0,
'computer':1,
'cellphone':2,
'coin':3,
'ruler':4,
'thermos-bottle':5,
'whiteboard-pen':6,
'whiteboard-eraser':7,
'pen':8,
'cup':9,
'remote-control-TV':10,
'remote-control-AC':11,
'switch':12,
'windows':13,
'fridge':14,
'cupboard':15,
'water-tap':16,
'toy':17,
'kettle':18,
'bottle':19,
'cookie':20,
'book':21,
'magnet':22,
'lamp-switch':23}

Ges= { 'free':0,
'press'1,
'large-diameter':2,
'lateral-tripod':3,
'parallel-extension':4,
'thumb-2-finger':5,
'thumb-4-finger':6,
'thumb-index-finger':7,
'precision-disk':8,
'lateral-pinch':9,
'tripod':10,
'medium-wrap':11,
'light-tool':12}
```

## Writeup

You are required to implement a **deep-learning-based method** to recognize hand states (free vs. active hands, hand gestures, object categories). Moreover, You might need to further take advantage of both HandCam and HeadCam. You will have to compete the performance with your classmates, so try to use as many techniques as possible to improve. **Your score will based on the performance ranking.**

For this project, and all other projects, you must do a project report in results folder using [Markdown](https://help.github.com/articles/markdown-basics). We provide you with a placeholder [index.md](./results/index.md) document which you can edit. In the report you will describe your algorithm and any decisions you made to write your algorithm a particular way. Then, you will describe how to run your code and if your code depended on other packages. You also need to show and discuss the results of your algorithm. Discuss any extra credit you did, and clearly show what contribution it had on the results (e.g. performance with and without each extra credit component).

You should also include the precision-recall curve of your final classifier and any interesting variants of your algorithm.

## Rubric
<ul>
<li> 40 pts: According to performance ranking in class </li>
<li> 60 pts: Outperform the AlexNet baseline </li>
<li> -5*n pts: Lose 5 points for every time (after the first) you do not follow the instructions for the hand in format </li>
</ul>

## Get start & hand in
* Publicly fork version (+2 extra points)
- [Fork the homework](https://education.github.com/guide/forks) to obtain a copy of the homework in your github account
- [Clone the homework](http://gitref.org/creating/#clone) to your local space and work on the code locally
- Commit and push your local code to your github repo
- Once you are done, submit your homework by [creating a pull request](https://help.github.com/articles/creating-a-pull-request)

* [Privately duplicated version](https://help.github.com/articles/duplicating-a-repository)
- Make a bare clone
- mirror-push to new repo
- [make new repo private](https://help.github.com/articles/making-a-private-repository-public)
- [add aliensunmin as collaborator](https://help.github.com/articles/adding-collaborators-to-a-personal-repository)
- [Clone the homework](http://gitref.org/creating/#clone) to your local space and work on the code locally
- Commit and push your local code to your github repo
- I will clone your repo after the due date

## Credits
Assignment designed by Cheng-Sheng Chan. Contents in this handout are from <a href="https://drive.google.com/file/d/0BwCy2boZhfdBM0ZDTV9lZW1rZzg/view">Chan et al.</a>.
# CEDL HW1 Deep Classification
## 黃柏瑜 105061535
## Installation & usage
1. Install pytorch in conda
```
conda install pytorch torchvision cuda80 -c soumith
```
2. Clone this repo
3. Download data in ''data'' directory. The structure looks like below
```
./data/frames
./data/labels
```

4. Training
```
python main.py --batch-size 128 --arch resnet18 --workers 4 --pretrained --model_name resnet18
python main.py --batch-size 128 --arch resnet50 --workers 4 --pretrained --model_name resnet50
```

## Implementation
1. Dataloader
首先,用 pytorch 定義的 Dataset Class ,寫一個HandCam_Dataset。裏面會取出對應的圖片以及標註。
接著對圖片做處理:
將圖片normalize,縮小到224*224大小,再隨機flip圖片做到Data Augmentation
```
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
self.transform = transforms.Compose([
transforms.Scale(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
normalize,
])
```

2. Model
使用 pytorch 提供的 resnet18 model 和 resnet50 model。並且直接使用已經已經 pretrained 在 imagenet 的參數,這樣開始 train 會比較快。
pretrain model 再最後一層是 Fully-Connected (512 -> 1000) 和 Fully-Connected (2048 -> 1000),因為 imagenet 有 1000 class,而現在的 dataset 只有 24 class 所以要改成 Fully-Connected (512 -> 24) 和 Fully-Connected (2048 -> 24)

3. Training Procedure
Learning rate : 0.01 (learning rate decay every 5 epoch)
Batch size : 128
Epoch : 30

## Results
### Best Result

Resnet18 : 70.28%

Resnet50 : 72.3%


### Training Curve

#### Resnet18
<img src="Resnet18_loss" alt="overview" style="float:middle;">
<img src="Resnet18_acc" alt="overview" style="float:middle;">


#### Resnet50
<img src="Resnet50_loss" alt="overview" style="float:middle;">
<img src="Resnet50_acc" alt="overview" style="float:middle;">
Binary file added Resnet18_acc
Binary file not shown.
Binary file added Resnet18_loss
Binary file not shown.
Binary file added Resnet50_acc
Binary file not shown.
Binary file added Resnet50_loss
Binary file not shown.
103 changes: 103 additions & 0 deletions handcam_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
import numpy as np
import pickle
import os
import collections
import random
import matplotlib.pyplot as plt
import pdb
import glob

import torch
from torch.utils.data.dataset import Dataset
from torchvision import datasets, transforms, utils
from torch.utils.data import TensorDataset, DataLoader
from PIL import Image

class HandCam_Dataset(Dataset):
def __init__(self, split):
#print split

self.image_path = 'data/frames/'+split+'/'
self.label_path = 'data/labels/'

self.scene_list = ['house', 'lab', 'office']
self.video_list = ['1', '2', '3', '4']
self.view_list = ['Lhand', 'Rhand']

self.image_list = []
for scene in self.scene_list:
for video in self.video_list:
if video == '4' and scene != 'lab':
break
for view in self.view_list:
print 'scene', scene, 'video', video, 'view', view
frame_list = glob.glob(self.image_path +scene+'/'+video+'/'+view+'/*')
frame_list.sort(key=lambda x:int(x.split('Image')[1].split('.')[0]))
self.image_list.extend(frame_list)
self.labels = np.array([])

for scene in self.scene_list:
for video in self.video_list:
for view in self.view_list:
if video == '4' and scene != 'lab':
break
label_index = video
if split == 'test':
if scene == 'lab':
label_index = str(int(label_index) + 4)
else:
label_index = str(int(label_index) + 3)
view = 'left' if view == 'Lhand' else 'right'

print self.label_path + scene + '/obj_' + view + label_index + '.npy'
self.labels = np.append(self.labels, np.load(self.label_path + scene + '/obj_' + view + label_index + '.npy'))

normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
if split == 'train':

self.transform = transforms.Compose([
transforms.Scale(224),
#transforms.RandomCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
normalize,
])

elif split == 'test':
self.transform = transforms.Compose([
transforms.Scale(224),
#transforms.CenterCrop(224),
transforms.ToTensor(),
normalize,
])
def __getitem__(self, index):


image = self.transform(Image.open(self.image_list[index]).convert('RGB'))
label = torch.LongTensor([int(self.labels[index])])
return image, label

def __len__(self):
return len(self.image_list)


def HandCam_Dataloader(args):
kwargs = {'num_workers': args.workers, 'pin_memory': True}
train_dataset = HandCam_Dataset('train')
train_loader = DataLoader(train_dataset, batch_size=args.batch_size,
shuffle=True, **kwargs)
val_dataset = HandCam_Dataset('test')
val_loader = DataLoader(val_dataset, batch_size=args.batch_size,
shuffle=False, **kwargs)

return train_loader, val_loader


if __name__ == '__main__':
train_data = HandCam_Dataset('train')
test_data = HandCam_Dataset('test')




Loading