CV中token、Patch Embedding、positional encoding的概念（多模态、ViT、Transformer）

在不同的语境下，术语token等有不同的解释。笔者这里的概念解释基于Google的ViT原文。NLP中，token指的是一个单词word。而CV中，token的概念包含：token、class token、patch token等。

北上ing

3645人浏览 · 2024-04-26 09:20:17

北上ing · 2024-04-26 09:20:17 发布

在不同的语境下，术语token等有不同的解释。笔者这里的概念解释基于Google的ViT原文。

NLP中，token指的是一个单词word。而CV中，token的概念包含：token、class token、patch token等。

若要理清晰它们的关系，就需要讲CV中的「一张输入图片」是如何转换成「模型能够计算的数据」：

Step1. Patch Embedding操作

「Patch Embedding」先将输入图像Image分割成一个个小图，称这些小图为「Image Patchs」，有N个小图就有N个patch。

然后通过全连接层将每个patch映射为一个一维向量，称这些向量为「Patch Embeddings」，整个映射过程得到N个长度为D的一维向量「Patch Embedding」。

那么，从「一张输入图像」到「N个一维向量Patch Embedding」的维度变化为：BCHW => BND，其中D指代embed_dim，ViT中默认值为768。

在这里插入图片描述

Tips：

1.原文参考：

Image patches are treated the same way as tokens (words) in an NLP application.

We refer to the output of this projection as the patch embeddings.

2.在一些论文中，也会将「Patch Embedding」称为「Patch Tokens」，如DeiT。

3.全连接处层可以用mlp或cnn实现。

Step2. class token的并入

class token是与「patch embeddings」并在一起输入 Transformer block 的一个向量，它最后的输出结果用于预测类别。
具体地说，若「patch embeddings」的维度为(B,N,D)，则「class token」的维度为B,1,D，且「class token」位置在下标0，如下图所示。

原文参考：

In order to perform classification, we use the standard approach of adding an extra learnable “classification token” to the sequence.

Step3. 位置编码positional encoding的加入

产生一个N+1个长度为D的「positional encoding」。它用于保留一个个小图之间的位置关系，就像是一个句子中的词语是有语序的。如上图所示。

以下为「class token」与「positional encoding」相关源码，可以参照着看：

self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
self.cls_token = nn.Parameter(torch.randn(1, 1, dim))

cls_tokens = repeat(self.cls_token, '1 1 d -> b 1 d', b = b)
x = torch.cat((cls_tokens, x), dim=1)
x += self.pos_embedding[:, :(n + 1)]