Skip to content

v0.6.0

Compare
Choose a tag to compare
@Timoeller Timoeller released this 30 Dec 11:45
· 84 commits to master since this release

Simplification of Preprocessing

We wanted to make preprocessing for all our tasks (e.g. QA, DPR, NER, classification) more understandable for FARM users, so that it is easier to adjust to specific use cases or extend the functionality to new tasks.

To achieve this we followed two design choices:

  1. Avoid deeply nested calls
  2. Keep all high-level descriptions in a single place

Question Answering Preprocessing

We especially focussed on making QA processing more sequential and divided the code into meaningful snippets #649

The code snippets are (see related method):

  • convert the input into FARM specific QA format
  • tokenize the questions and texts
  • split texts into passages to fit the sequence length constraint of Language Models
  • [optionally] convert labels (disabled during inference)
  • convert question, text, labels and additional information to PyTorch tensors

Breaking changes

  1. Switching to FastTokenizers (based on Huggingface tokenizer project written in Rust) as default Tokenizer. We changed the use_fast=True parameter in the Tokenizer.load() method. Support for slow, python-based Tokenizers will be implemented for all tasks in the next release.
  2. The Processor.dataset_from_dicts method by default returns an additional parameter problematic_sample_ids that keeps track of which input sample caused problems during preprocessing:
dataset, tensor_names, problematic_sample_ids = processor.dataset_from_dicts(dicts=dicts)

Update to transformers version 4.1.1 and torch version 1.7.0

Transformers comes with many new features, including model versioning, that we do not want to miss out on. #665
Model versions can now be specified like:

    model = Inferencer.load(
        model_name_or_path="deepset/roberta-base-squad2",
        revision="v2.0",
        task_type="question_answering",
    )

DPR enhancements

Misc

  • Cleaner logging and error handling #639
  • Benchmark automation via CML #646
  • Disable DPR tests on Windows, since they do not work with PyTorch 1.6.1 #637
  • Option to disable MLflow logger #650
  • Fix to Earlystopping and custom head #617
  • Adding probability of masking a token parameter for LM task #630

Big thanks to all contributors!
@ftesser @pashok3d @Timoeller @tanaysoni @brandenchan @bogdankostic @kolk @tholor