generate_questions produces a dataset that is more balanced than the original one

There must be another discrepancy between `generate_questions.py` and the original script that was used to generate CLEVR. I have noticed that in CLEVR the answer distribution for counting questions is very skewed. For example, for one of the question families I have the following answer counts:

{'1': 2658, '0': 2555, '2': 1911, '5': 52, '3': 579, '6': 17, '4': 136, '7': 2, '9': 1}

Here the 6th popular answer is "6" with the count of 17. This could not have happened if the current version of `generate_questions.py` were used, since it has a heuristic that forces all answer to occur at most 5 times as often as the 6th popular answer:

https://github.com/facebookresearch/clevr-dataset-gen/blob/master/question_generation/generate_questions.py#L322

The main reason I have created this here is for the record, because it's unclear how this issue can be addressed. But I guess people who are using the code should be made aware. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

generate_questions produces a dataset that is more balanced than the original one #19

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

generate_questions produces a dataset that is more balanced than the original one #19

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions