- use
./postercollector.py byGenre <N>to collect the./movielist<N>: number of top-pages per genre to crawl
- use
./postercollector.py downloadto download all posters to./posters/ - use
./datasetgenerator.pyto split the dataset into train/val/test- also reminds you of missing image files for manual download
- use
./features/extract_features.py <dev>to generate./sets/features_all.h5<dev>: CUDA device number,-1is CPU
The crawled, downloaded and split data can be found here.
posterset_*k.zipcontains data and labels- after
datasetgenerator.py
- after
posterset_*k_full.zipadditionally contains the features- after
extract_features.py
- after
The zip can be extracted in the root-dir of the repo, all files will be at their desired position. All files are encryped with a password, because we live in deep fear of the copyright law.
The directory "sets" contains information about the split (train/val/test) of the dataset.
It also contains extracted features, as well as information about the existing labels.
Contains a dict, that translates labels to numbers. Works in both directions.
Labels with too less representation have been erased from the dataset.
e.g.:
gen_d["Action"] == 0gen_d["Some Genre"] == 99den_d[0] == "Action"
len(gen_d) / 2 == {number of classes}
Conatain train/val/test-sets. Each row is a sample, first entry is the imdb-id (the poster name).
Following entrys are the labels of the sample.
e.g.:
tt12346,Action,Romance,Dramatt57890,Adventure
Contains a dict with an entry for ["train"]/["val"]/["test"] and ["drop"]ed (not used) movies.
Each entry is a dict itself, with ["ids"] containing a list of imdb-ids in this set,
and ["labels"] being a list of the corresponding genres as strings.
(WARNING: could contain leading spaces).
The same information is in the *.csv-files, but extract_feature.py needs this one.
Generated by ./features/extract_features.py
Contains features from different neural networks.
All following datasets are in the same order, so index [0] from all sets belongs to the same sample:
["lables"]: multi-hot-vectors, 1 for each label the sample has, 0 if not. For indices, see gen_d.["ids"]: the imdb-id (=poster-id) of each sample["alex_fc6"],["alex_fc7"]: features from AlexNet (pytorch)["vgg19bn_fc6"],["vgg19bn_fc7"]: features from VGG19 with batchnorm (pytorch)["res50_avg"]: features from ResNet50s layer (pytorch)["dense161_last"]: features from DenseNet (pytorch)
There are also datasets to index the set splits:
["train_idx"]: a list of all indizes (w.r. to the top datasets) that belong to the train set["val_idx"]: a list of all indizes (w.r. to the top datasets) that belong to the val set["test_idx"]: a list of all indizes (w.r. to the top datasets) that belong to the test set
Contains a list of dicts. Each dict represents a movie has the following entries:
["imdb-id"]: the imdb-id (=poster-id) as string.["title"]: the movie title as string.["genres"]: all genres of the movie as list of strings. WARNING: all genres have a leading blank!["poster"]: the URL of the movie-poster as string.
Duplications are possible. This file is not needed anymore, after datasetgenerator.py has been successfully executed.
When adding a new feature in a new branch, you can use ln -s ../core/core.py core.py (maybe change .. to the projects root directory) to access features from the core utillity, like:
class PosterSet(torch.utils.data.Dataset) #wrapper class for dataset, preprocessing images