Movie-Poster-Classification for IFML

Usage

use ./postercollector.py byGenre <N> to collect the ./movielist
- <N>: number of top-pages per genre to crawl
use ./postercollector.py download to download all posters to ./posters/
use ./datasetgenerator.py to split the dataset into train/val/test
- also reminds you of missing image files for manual download
use ./features/extract_features.py <dev> to generate ./sets/features_all.h5
- <dev>: CUDA device number, -1 is CPU

Downloading the sets

The crawled, downloaded and split data can be found here.

posterset_*k.zip contains data and labels
- after datasetgenerator.py
posterset_*k_full.zip additionally contains the features
- after extract_features.py

The zip can be extracted in the root-dir of the repo, all files will be at their desired position. All files are encryped with a password, because we live in deep fear of the copyright law.

Using the sets

The directory "sets" contains information about the split (train/val/test) of the dataset.
It also contains extracted features, as well as information about the existing labels.

./sets/gen_d.p

Contains a dict, that translates labels to numbers. Works in both directions.
Labels with too less representation have been erased from the dataset.

e.g.:

gen_d["Action"] == 0
gen_d["Some Genre"] == 99
den_d[0] == "Action"

len(gen_d) / 2 == {number of classes}

./sets/**.csv

Conatain train/val/test-sets. Each row is a sample, first entry is the imdb-id (the poster name).
Following entrys are the labels of the sample.

e.g.:

tt12346,Action,Romance,Drama
tt57890,Adventure

./sets/set_splits.p

Contains a dict with an entry for ["train"]/["val"]/["test"] and ["drop"]ed (not used) movies.
Each entry is a dict itself, with ["ids"] containing a list of imdb-ids in this set, and ["labels"] being a list of the corresponding genres as strings.
(WARNING: could contain leading spaces).

The same information is in the *.csv-files, but extract_feature.py needs this one.

./sets/features_all.h5

Generated by ./features/extract_features.py Contains features from different neural networks.
All following datasets are in the same order, so index [0] from all sets belongs to the same sample:

["lables"]: multi-hot-vectors, 1 for each label the sample has, 0 if not. For indices, see gen_d.
["ids"]: the imdb-id (=poster-id) of each sample
["alex_fc6"], ["alex_fc7"]: features from AlexNet (pytorch)
["vgg19bn_fc6"], ["vgg19bn_fc7"]: features from VGG19 with batchnorm (pytorch)
["res50_avg"]: features from ResNet50s layer (pytorch)
["dense161_last"]: features from DenseNet (pytorch)

There are also datasets to index the set splits:

["train_idx"]: a list of all indizes (w.r. to the top datasets) that belong to the train set
["val_idx"]: a list of all indizes (w.r. to the top datasets) that belong to the val set
["test_idx"]: a list of all indizes (w.r. to the top datasets) that belong to the test set

./crawling/movielist

Contains a list of dicts. Each dict represents a movie has the following entries:

["imdb-id"]: the imdb-id (=poster-id) as string.
["title"]: the movie title as string.
["genres"]: all genres of the movie as list of strings. WARNING: all genres have a leading blank!
["poster"]: the URL of the movie-poster as string.

Duplications are possible. This file is not needed anymore, after datasetgenerator.py has been successfully executed.

Coding

When adding a new feature in a new branch, you can use ln -s ../core/core.py core.py (maybe change .. to the projects root directory) to access features from the core utillity, like:

class PosterSet(torch.utils.data.Dataset) #wrapper class for dataset, preprocessing images

Name		Name	Last commit message	Last commit date
Latest commit History 152 Commits
cnn_training		cnn_training
core		core
crawling		crawling
deepdream		deepdream
features		features
finetune		finetune
posters		posters
report		report
saliencymaps		saliencymaps
sets		sets
simple_approaches		simple_approaches
toy		toy
.gitignore		.gitignore
README.MD		README.MD
__init__.py		__init__.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Movie-Poster-Classification for IFML

Usage

Downloading the sets

Using the sets

./sets/gen_d.p

./sets/**.csv

./sets/set_splits.p

./sets/features_all.h5

./crawling/movielist

Coding

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

SinForest/ifml-project

Folders and files

Latest commit

History

Repository files navigation

Movie-Poster-Classification for IFML

Usage

Downloading the sets

Using the sets

./sets/gen_d.p

./sets/**.csv

./sets/set_splits.p

./sets/features_all.h5

./crawling/movielist

Coding

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages