Skip to content


Repository files navigation

Movie-Poster-Classification for IFML


  • use ./ byGenre <N> to collect the ./movielist
    • <N>: number of top-pages per genre to crawl
  • use ./ download to download all posters to ./posters/
  • use ./ to split the dataset into train/val/test
    • also reminds you of missing image files for manual download
  • use ./features/ <dev> to generate ./sets/features_all.h5
    • <dev>: CUDA device number, -1 is CPU

Downloading the sets

The crawled, downloaded and split data can be found here.

  • posterset_* contains data and labels
    • after
  • posterset_* additionally contains the features
    • after

The zip can be extracted in the root-dir of the repo, all files will be at their desired position. All files are encryped with a password, because we live in deep fear of the copyright law.

Using the sets

The directory "sets" contains information about the split (train/val/test) of the dataset.
It also contains extracted features, as well as information about the existing labels.


Contains a dict, that translates labels to numbers. Works in both directions.
Labels with too less representation have been erased from the dataset.


  • gen_d["Action"] == 0
  • gen_d["Some Genre"] == 99
  • den_d[0] == "Action"

len(gen_d) / 2 == {number of classes}


Conatain train/val/test-sets. Each row is a sample, first entry is the imdb-id (the poster name).
Following entrys are the labels of the sample.


  • tt12346,Action,Romance,Drama
  • tt57890,Adventure


Contains a dict with an entry for ["train"]/["val"]/["test"] and ["drop"]ed (not used) movies.
Each entry is a dict itself, with ["ids"] containing a list of imdb-ids in this set, and ["labels"] being a list of the corresponding genres as strings.
(WARNING: could contain leading spaces).

The same information is in the *.csv-files, but needs this one.


Generated by ./features/ Contains features from different neural networks.
All following datasets are in the same order, so index [0] from all sets belongs to the same sample:

  • ["lables"]: multi-hot-vectors, 1 for each label the sample has, 0 if not. For indices, see gen_d.
  • ["ids"]: the imdb-id (=poster-id) of each sample
  • ["alex_fc6"], ["alex_fc7"]: features from AlexNet (pytorch)
  • ["vgg19bn_fc6"], ["vgg19bn_fc7"]: features from VGG19 with batchnorm (pytorch)
  • ["res50_avg"]: features from ResNet50s layer (pytorch)
  • ["dense161_last"]: features from DenseNet (pytorch)

There are also datasets to index the set splits:

  • ["train_idx"]: a list of all indizes (w.r. to the top datasets) that belong to the train set
  • ["val_idx"]: a list of all indizes (w.r. to the top datasets) that belong to the val set
  • ["test_idx"]: a list of all indizes (w.r. to the top datasets) that belong to the test set


Contains a list of dicts. Each dict represents a movie has the following entries:

  • ["imdb-id"]: the imdb-id (=poster-id) as string.
  • ["title"]: the movie title as string.
  • ["genres"]: all genres of the movie as list of strings. WARNING: all genres have a leading blank!
  • ["poster"]: the URL of the movie-poster as string.

Duplications are possible. This file is not needed anymore, after has been successfully executed.


When adding a new feature in a new branch, you can use ln -s ../core/ (maybe change .. to the projects root directory) to access features from the core utillity, like:

class PosterSet( #wrapper class for dataset, preprocessing images


A deep learning project for the course IFML.






No releases published


No packages published

Contributors 3

