Sacbe
/

ViT_SAM_Classification

Image Classification

Transformers

biology

Inference Endpoints

Model card Files Files and versions Community

Sacbe commited on Feb 7, 2024

Commit

e389554

verified ·

1 Parent(s): 8a53a8d

Update README.md

Browse files

Files changed (1) hide show

README.md +14 -5

README.md CHANGED Viewed

@@ -1,8 +1,15 @@
 ---
 license: apache-2.0
 ---
-## VisionTransformer
 **Attention-based neural networks such as the Vision Transformer** (ViT) have recently attained state-of-the-art results on many computer vision benchmarks. Scale is a primary ingredient in attaining excellent results, therefore, understanding a model's scaling properties is a key to designing future generations effectively. While the laws for scaling Transformer language models have been studied, it is unknown how Vision Transformers scale. To address this, we scale ViT models and data, both up and down, and characterize the relationships between error rate, data, and compute. Along the way, we refine the architecture and training of ViT, reducing memory consumption and increasing accuracy of the resulting models. As a result, we successfully train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy. The model also performs well for few-shot transfer, for example, reaching 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
@@ -10,7 +17,7 @@ license: apache-2.0
 [1] A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”. arXiv, el 3 de junio de 2021. Consultado: el 12 de noviembre de 2023. [En línea]. Disponible en: http://arxiv.org/abs/2010.11929
-## Sharpness Aware Minimization (SAM)
 SAM simultaneously minimizes loss value and loss sharpness. In particular, it seeks parameters that lie in neighborhoods having uniformly low loss. SAM improves model generalization and yields SoTA performance for several datasets. Additionally, it provides robustness to label noise on par with that provided by SoTA procedures that specifically target learning with noisy labels.
@@ -20,7 +27,9 @@ SAM simultaneously minimizes loss value and loss sharpness. In particular, it se
 [2] P. Foret, A. Kleiner, y H. Mobahi, “Sharpness-Aware Minimization For Efficiently Improving Generalization”, 2021.
-**The negative log likelihood loss**. It is useful to train a classification problem with $C$ classes.
 If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set.
@@ -28,7 +37,7 @@ The input given through a forward call is expected to contain log-probabilities
 Obtaining log-probabilities in a neural network is easily achieved by adding a LogSoftmax layer in the last layer of your network. You may use CrossEntropyLoss instead, if you prefer not to add an extra layer.
-The target that this loss expects should be a class index in the range $[0, C-1]$ where $C=$ number of classes; if ignore_index is specified, this loss also accepts this class index (this index may not necessarily be in the class range).
 The unreduced (i.e. with reduction set to 'none ') loss can be described as:
 $$
@@ -41,4 +50,4 @@ $$
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/64ff2131f7f3fa2d7fe256fc/CO6vFEjt3FkxB8JgZTbEd.png)

 ---
 license: apache-2.0
+metrics:
+- accuracy
+- f1
+- precision
+- recall
+library_name: transformers
+pipeline_tag: image-classification
 ---
+# VisionTransformer
 **Attention-based neural networks such as the Vision Transformer** (ViT) have recently attained state-of-the-art results on many computer vision benchmarks. Scale is a primary ingredient in attaining excellent results, therefore, understanding a model's scaling properties is a key to designing future generations effectively. While the laws for scaling Transformer language models have been studied, it is unknown how Vision Transformers scale. To address this, we scale ViT models and data, both up and down, and characterize the relationships between error rate, data, and compute. Along the way, we refine the architecture and training of ViT, reducing memory consumption and increasing accuracy of the resulting models. As a result, we successfully train a ViT model with two billion parameters, which attains a new state-of-the-art on ImageNet of 90.45% top-1 accuracy. The model also performs well for few-shot transfer, for example, reaching 84.86% top-1 accuracy on ImageNet with only 10 examples per class.
 [1] A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”. arXiv, el 3 de junio de 2021. Consultado: el 12 de noviembre de 2023. [En línea]. Disponible en: http://arxiv.org/abs/2010.11929
+# Sharpness Aware Minimization (SAM)
 SAM simultaneously minimizes loss value and loss sharpness. In particular, it seeks parameters that lie in neighborhoods having uniformly low loss. SAM improves model generalization and yields SoTA performance for several datasets. Additionally, it provides robustness to label noise on par with that provided by SoTA procedures that specifically target learning with noisy labels.
 [2] P. Foret, A. Kleiner, y H. Mobahi, “Sharpness-Aware Minimization For Efficiently Improving Generalization”, 2021.
+# The negative log likelihood loss
+It is useful to train a classification problem with $C$ classes.
 If provided, the optional argument weight should be a 1D Tensor assigning weight to each of the classes. This is particularly useful when you have an unbalanced training set.
 Obtaining log-probabilities in a neural network is easily achieved by adding a LogSoftmax layer in the last layer of your network. You may use CrossEntropyLoss instead, if you prefer not to add an extra layer.
+The target that this loss expects should be a class index in the range $\[0, C-1\]$ where $C$ number of classes; if ignore_index is specified, this loss also accepts this class index (this index may not necessarily be in the class range).
 The unreduced (i.e. with reduction set to 'none ') loss can be described as:
 $$
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/64ff2131f7f3fa2d7fe256fc/CO6vFEjt3FkxB8JgZTbEd.png)