v0.6.0
Simplification of Preprocessing
We wanted to make preprocessing for all our tasks (e.g. QA, DPR, NER, classification) more understandable for FARM users, so that it is easier to adjust to specific use cases or extend the functionality to new tasks.
To achieve this we followed two design choices:
- Avoid deeply nested calls
- Keep all high-level descriptions in a single place
Question Answering Preprocessing
We especially focussed on making QA processing more sequential and divided the code into meaningful snippets #649
The code snippets are (see related method):
- convert the input into FARM specific QA format
- tokenize the questions and texts
- split texts into passages to fit the sequence length constraint of Language Models
- [optionally] convert labels (disabled during inference)
- convert question, text, labels and additional information to PyTorch tensors
Breaking changes
- Switching to FastTokenizers (based on Huggingface tokenizer project written in Rust) as default Tokenizer. We changed the
use_fast=True
parameter in the Tokenizer.load() method. Support for slow, python-based Tokenizers will be implemented for all tasks in the next release. - The Processor.dataset_from_dicts method by default returns an additional parameter
problematic_sample_ids
that keeps track of which input sample caused problems during preprocessing:
dataset, tensor_names, problematic_sample_ids = processor.dataset_from_dicts(dicts=dicts)
Update to transformers version 4.1.1 and torch version 1.7.0
Transformers comes with many new features, including model versioning, that we do not want to miss out on. #665
Model versions can now be specified like:
model = Inferencer.load(
model_name_or_path="deepset/roberta-base-squad2",
revision="v2.0",
task_type="question_answering",
)
DPR enhancements
Misc
- Cleaner logging and error handling #639
- Benchmark automation via CML #646
- Disable DPR tests on Windows, since they do not work with PyTorch 1.6.1 #637
- Option to disable MLflow logger #650
- Fix to Earlystopping and custom head #617
- Adding probability of masking a token parameter for LM task #630
Big thanks to all contributors!
@ftesser @pashok3d @Timoeller @tanaysoni @brandenchan @bogdankostic @kolk @tholor