Links to related papers are https://arxiv.org/abs/2010.11929
The data can be downloaded from the official website or in the image place, or the Download setting to True when the code loads the data
This project implements the Vision Transformer (ViT) model using the CIFAR-10 dataset. The Vision Transformer is a state-of-the-art architecture for image classification tasks, leveraging the power of self-attention mechanisms.
Utilizes the latest advancements in deep learning for image classification. Integrates the powerful capabilities of transformers into computer vision tasks. Provides a robust and efficient solution for handling image data.
Patch Embedding: Extracts image patches and converts them into token embeddings. Attention Mechanism: Captures global dependencies and relationships between tokens. MLP Layers: Employs multi-layer perceptrons for non-linear transformations. Transformer Blocks: Comprises attention layers followed by feed-forward neural networks. Vision Transformer (ViT) Model: Combines these components into a cohesive architecture.
Contributions are welcome! Feel free to fork the repository and submit pull requests for improvements or bug fixes.
The final accuracy of the test set was 74% by training,Of course, there would have been better hyperparameter selection and a better way to define the model that would have made VIT perform better on the CIFAR-10 dataset, but I stopped at this level of accuracy because of the resources and time. If you have a higher level of accuracy, I hope you can give me a lot of advice, thank you.