Train T5 on MATH #5054

dirkgr · 2021-03-15T21:10:22Z

We should try to train T5 on the MATH dataset (https://arxiv.org/abs/2103.03874). Performance is expected to be poor: GPT2 gets under 7%. This dataset was specifically constructed to thwart transformer models, so this is expected. This model will serve as a baseline for later attempts.

More details on the MATH dataset are here: https://github.com/hendrycks/math/

There is also a pre-training dataset available. We should try some experiments to see if we can boost performance with it.

KaiserWhoLearns · 2021-09-19T03:03:31Z

From the MATH dataset paper, the authors mentioned that T5 removed a lot of Latex symbols thus they cannot get it working. (Page 6)

An example of T5 tokenizer output on a sentence from MATH:

original: 'We let $\mathbf{a}$ denote $\overrightarrow{A},$ etc.'
T5 tokenizer: 'We let $<unk> mathbf<unk> a<unk> $ denote $<unk> overrightarrow<unk> A<unk>,$ etc.</s>'

The tokenizer couldn't correctly tokenize \boxed (which is used for accuracy evaluation) as well.

Do you have any suggestions?

dirkgr · 2021-09-19T10:06:07Z

Aw, that sucks. So this is not straightforward. On the other hand, that makes it interesting.

I can think of three ways of doing it.

Pre-process the text to replace the raw Latex with something that the T5 tokenizer can handle. Maybe it's good enough to add spaces where they make no difference to Latex, but do make a difference to T5?
Expand the vocabulary. New word piece embeddings will start random. It might be a good idea to pre-train on some other content that is rich in Latex, for example ArXiv sources.
Use a different tokenizer, and only use the T5 architecture, not the weights. Pre-training is even more important then, and it becomes quite computationally intensive to do this.

Whatever way you try, doing nothing will still be a baseline.

dirkgr added Contributions welcome Models Issues related to the allennlp-models repo medium Tasks of medium difficulty. labels Mar 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train T5 on MATH #5054

Train T5 on MATH #5054

dirkgr commented Mar 15, 2021 •

edited

Loading

KaiserWhoLearns commented Sep 19, 2021

dirkgr commented Sep 19, 2021

Train T5 on MATH #5054

Train T5 on MATH #5054

Comments

dirkgr commented Mar 15, 2021 • edited Loading

KaiserWhoLearns commented Sep 19, 2021

dirkgr commented Sep 19, 2021

dirkgr commented Mar 15, 2021 •

edited

Loading