I am currently using my own prepared data set, 48 videos, each video contains different people. Among them, there are 40 training data, 3 validation data, and 5 test data. But my current result is that the mouth opening of the human face is very small, and the mouth barely moves.