Simplification of Preprocessing

We wanted to make preprocessing for all our tasks (e.g. QA, DPR, NER, classification) more understandable for FARM users, so that it is easier to adjust to specific use cases or extend the functionality to new tasks.

To achieve this we followed two design choices:

Avoid deeply nested calls
Keep all high-level descriptions in a single place

Question Answering Preprocessing

We especially focussed on making QA processing more sequential and divided the code into meaningful snippets #649

The code snippets are (see related method):

convert the input into FARM specific QA format
tokenize the questions and texts
split texts into passages to fit the sequence length constraint of Language Models
[optionally] convert labels (disabled during inference)
convert question, text, labels and additional information to PyTorch tensors

Breaking changes

Switching to FastTokenizers (based on Huggingface tokenizer project written in Rust) as default Tokenizer. We changed the use_fast=True parameter in the Tokenizer.load() method. Support for slow, python-based Tokenizers will be implemented for all tasks in the next release.
The Processor.dataset_from_dicts method by default returns an additional parameter problematic_sample_ids that keeps track of which input sample caused problems during preprocessing:

dataset, tensor_names, problematic_sample_ids = processor.dataset_from_dicts(dicts=dicts)

Update to transformers version 4.1.1 and torch version 1.7.0

Transformers comes with many new features, including model versioning, that we do not want to miss out on. #665
Model versions can now be specified like:

    model = Inferencer.load(
        model_name_or_path="deepset/roberta-base-squad2",
        revision="v2.0",
        task_type="question_answering",
    )

DPR enhancements

MultiGPU support #619
Added tests #643
Bugfixes and smaller enhancements #629 #655 #663

Misc

Cleaner logging and error handling #639
Benchmark automation via CML #646
Disable DPR tests on Windows, since they do not work with PyTorch 1.6.1 #637
Option to disable MLflow logger #650
Fix to Earlystopping and custom head #617
Adding probability of masking a token parameter for LM task #630

Big thanks to all contributors!
@ftesser @pashok3d @Timoeller @tanaysoni @brandenchan @bogdankostic @kolk @tholor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.0