Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

Train T5 on MATH #5054

Open
dirkgr opened this issue Mar 15, 2021 · 2 comments
Open

Train T5 on MATH #5054

dirkgr opened this issue Mar 15, 2021 · 2 comments
Labels
Contributions welcome medium Tasks of medium difficulty. Models Issues related to the allennlp-models repo

Comments

@dirkgr
Copy link
Member

dirkgr commented Mar 15, 2021

We should try to train T5 on the MATH dataset (https://arxiv.org/abs/2103.03874). Performance is expected to be poor: GPT2 gets under 7%. This dataset was specifically constructed to thwart transformer models, so this is expected. This model will serve as a baseline for later attempts.

More details on the MATH dataset are here: https://github.com/hendrycks/math/

There is also a pre-training dataset available. We should try some experiments to see if we can boost performance with it.

@dirkgr dirkgr added Contributions welcome Models Issues related to the allennlp-models repo medium Tasks of medium difficulty. labels Mar 15, 2021
@KaiserWhoLearns
Copy link

From the MATH dataset paper, the authors mentioned that T5 removed a lot of Latex symbols thus they cannot get it working. (Page 6)
image

An example of T5 tokenizer output on a sentence from MATH:

original: 'We let $\mathbf{a}$ denote $\overrightarrow{A},$ etc.'
T5 tokenizer: 'We let $<unk> mathbf<unk> a<unk> $ denote $<unk> overrightarrow<unk> A<unk>,$ etc.</s>'

The tokenizer couldn't correctly tokenize \boxed (which is used for accuracy evaluation) as well.

Do you have any suggestions?

@dirkgr
Copy link
Member Author

dirkgr commented Sep 19, 2021

Aw, that sucks. So this is not straightforward. On the other hand, that makes it interesting.

I can think of three ways of doing it.

  1. Pre-process the text to replace the raw Latex with something that the T5 tokenizer can handle. Maybe it's good enough to add spaces where they make no difference to Latex, but do make a difference to T5?
  2. Expand the vocabulary. New word piece embeddings will start random. It might be a good idea to pre-train on some other content that is rich in Latex, for example ArXiv sources.
  3. Use a different tokenizer, and only use the T5 architecture, not the weights. Pre-training is even more important then, and it becomes quite computationally intensive to do this.

Whatever way you try, doing nothing will still be a baseline.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Contributions welcome medium Tasks of medium difficulty. Models Issues related to the allennlp-models repo
Projects
None yet
Development

No branches or pull requests

2 participants