【论文笔记】 VIT论文笔记,重构Patch Embedding和Attention部分

这篇具有很好参考价值的文章主要介绍了【论文笔记】 VIT论文笔记,重构Patch Embedding和Attention部分。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

0 前言

相关链接:

  1. VIT论文:https://arxiv.org/abs/2010.11929
  2. VIT视频讲解:https://www.bilibili.com/video/BV15P4y137jb/?spm_id_from=333.999.0.0&vd_source=fff489d443210a81a8f273d768e44c30
  3. VIT源码:https://github.com/vitejs/vite
  4. VIT源码(Pytorch版本,非官方,挺多stars,应该问题不大):https://github.com/lucidrains/vit-pytorch

重点掌握:

  1. 如何将2-D的图像变为1-D的序列,操作:PatchEmbedding,并且加上learnbale embedding 和 Position Embedding
  2. Multi-Head Attention的写法,其中里面有2个Linear层进行维度变换~

VIT历史意义: 展示了在CV中使用纯Transformer结构的可能,并开启了视觉Transformer研究热潮。

1 总体代码

说明: 本文代码是针对VIT的Pytorch版本进行重构修改,若有不对的地方,欢迎交流~
原因: lucidrains的源码中调用了比较高级的封装,如einops包中的rerange等函数,写的确实挺好的,但不好理解shape的变化;
patchembed,AI论文小笔记,论文阅读,重构

import torch
import torch.nn as nn

class PatchAndPosEmbedding(nn.Module):
    def __init__(self, img_size=256, patch_size=32, in_channels=3, embed_dim=1024, drop_out=0.):
        super(PatchAndPosEmbedding, self).__init__()

        num_patches = int((img_size/patch_size)**2)
        patch_size_dim = patch_size*patch_size*in_channels

        # patch_embedding, Note: kernel_size, stride
        # a
        self.patch_embedding = nn.Conv2d(in_channels=in_channels, out_channels=patch_size_dim, kernel_size=patch_size, stride=patch_size)
        self.linear = nn.Linear(patch_size_dim, embed_dim)

        self.cls_token = nn.Parameter(torch.randn(1, 1, embed_dim))   # 添加一个cls_token用于整合信息
        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches+1, embed_dim)) # 给patch embedding加上位置信息

        self.dropout = nn.Dropout(drop_out)

    def forward(self, img):
        x = self.patch_embedding(img) # [B,C,H,W] -> [B, patch_size_dim, N, N] # N = Num_patches = (H*W)/Patch_size,
        x = x.flatten(2)
        x = x.transpose(2, 1)  # [B,N*N, patch_size_dim]
        x = self.linear(x)     # [B,N*N, embed_dim]  # patch_size_dim -> embed_dim = 3072->1024 to reduce the computation when encode.

        cls_token = self.cls_token.expand(x.shape[0], -1, -1)  # cls_token: [1,1 embed_dim] -> [B, 1, embed_dim]
        x = torch.cat([cls_token, x], dim=1) # [B,N*N, embed_dim] -> [B, N*N+1, embed_dim]
        x += self.pos_embedding  # [B, N*N+1, embed_dim]  Consider why not concat , but add?  Trade off due to the computation.

        out = self.dropout(x)

        return out

class Attention(nn.Module):
    def __init__(self, dim, heads=16, head_dim=64, dropout=0.):
        super(Attention, self).__init__()
        inner_dim = heads * head_dim  # 可以通过FC将 input_dim 映射到inner_dim作为注意力表示维度
        self.heads = heads
        self.scale = head_dim ** -0.5

        project_out = not (heads == 1 and head_dim == dim)

        # 构建 k,q,v,可根据VIT原项目中的rerange进行变化
        # 写法一:直接定义to_q, to_k, to_v
        self.to_q = nn.Linear(dim, inner_dim, bias = False)
        self.to_k = nn.Linear(dim, inner_dim, bias = False)
        self.to_v = nn.Linear(dim, inner_dim, bias = False)

        # 写法二:先定义qkv,在forward进行chunk拆开
        # self.to_qkv = nn.Linear(dim, inner_dim*3, bias = False)

        self.atten = nn.Softmax(dim=-1)
        self.dropout = nn.Dropout(dropout)

        self.to_out = nn.Sequential(nn.Linear(dim, inner_dim), nn.Dropout(dropout)) if project_out else nn.Identity()

    def forward(self, x):
        # 续上面写法1:
        q = self.to_q(x)  # [3,65,1024]
        k = self.to_k(x)  # [3,65,1024]
        v = self.to_v(x)  # [3,65,1024]

        # 续上面写法2:
        # toqkv = self.to_qkv(x)  # [3, 65, 3072]
        # q, k, v = toqkv.chunk(3, dim=-1)  # q, k, v.shape    [3,65,1024]

        q = q.reshape(q.shape[0], q.shape[1], self.heads, -1).transpose(1, 2)  # [3,65,1024]  -> [3,16,65,64]
        k = k.reshape(k.shape[0], k.shape[1], self.heads, -1).transpose(1, 2)  # [3,65,1024]  -> [3,16,65,64]
        v = v.reshape(v.shape[0], v.shape[1], self.heads, -1).transpose(1, 2)  # [3,65,1024]  -> [3,16,65,64]


        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale

        atten = self.atten(dots)
        atten = self.dropout(atten)

        out = torch.matmul(atten, v)    # [3, 16, 65, 64]
        out = out.transpose(1, 2)   #
        out = out.reshape(out.shape[0], out.shape[1], -1)   # [3, 65, 16*64]

        return self.to_out(out)

class MLP(nn.Module):  # 搭建2层FC, 使用GELU激活
    def __init__(self, dim, hidden_dim, dropout=0.):
        super(MLP, self).__init__()
        self.net = nn.Sequential(
            nn.Linear(dim, hidden_dim),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(hidden_dim, dim),
            nn.GELU(),
            nn.Dropout(dropout)
        )

    def forward(self, x):
        return self.net(x)

class PreNorm(nn.Module):  # Encoder结构中先LayerNorm再进行Multihead-attention或MLP
    def __init__(self, dim, fn):
        super(PreNorm, self).__init__()
        self.norm = nn.LayerNorm(dim)
        self.fn = fn
    def forward(self, x, **kwargs):
        return self.fn(self.norm(x), **kwargs)

class Transformer(nn.Module):  # 整个Encoder结构
    def __init__(self, dim, depth, heads, head_dim, mlp_hidden_dim, dropout=0.):
        super(Transformer, self).__init__()
        self.layers = nn.ModuleList([])
        for _ in range(depth):
            self.layers.append(
                nn.ModuleList([
                    PreNorm(dim, Attention(dim, heads, head_dim, dropout=dropout)),
                    PreNorm(dim, MLP(dim, mlp_hidden_dim, dropout=dropout))
                ]))

    def forward(self, x):
        for attn, ff in self.layers:
            x = attn(x) + x
            x = ff(x) + x
        return x

class VIT(nn.Module):
    def __init__(self, num_classes=10, img_size=256, patch_size=32, in_channels=3,
                 embed_dim=1024, depth=6, heads=16, head_dim=64, mlp_hidden_dim=2048, pool='cls', dropout=0.1):
        super(VIT, self).__init__()

        assert pool in {'cls', 'mean'}, 'pool type must be either cls (cls token) or mean (mean pooling)'
        self.pool = pool

        self.patchembedding = PatchAndPosEmbedding(img_size, patch_size, in_channels, embed_dim, dropout)

        self.transformer = Transformer(embed_dim, depth, heads, head_dim, mlp_hidden_dim, dropout)

        self.to_latent = nn.Identity()

        self.mlp_head = nn.Sequential(
            nn.LayerNorm(embed_dim),
            nn.Linear(embed_dim, num_classes)
        )
    def forward(self, x):
        x = self.patchembedding(x)
        x = self.transformer(x)

        x = x.mean(dim=1) if self.pool == 'mean' else x[:, 0]

        out = self.mlp_head(x)

        return out


net = VIT()
x = torch.randn(3, 3, 256, 256)
out = net(x)
print(out, out.shape)

2 PatchandPosEmbedding

说明:
1.将256x256x3的图像分为32x32x3大小的patches,主要使用nn.Conv2d实现,主要是ernel_size==patch_sizestride==patch_size, 多看代码就能理解这个图了;
2.由于图像切分重排后失去了位置信息,并且Transformer的运算是与空间位置无关的,因此需要把位置信息编码放进网络,使用一个向量进行编码,即PosEmbedding;

问题:
1. 为什么要在Embedding时加上一个patch0,即代码中的cls_tocken?
原因:假设原始输出的9个向量(以图中假设),若随机选择其中一个用于分类,效果都不好。若全用的话,计算量太大;因此加上一个可学习的向量,即learnable embedding用于整合信息。
2.为什么Position Embedding是直接add,而不是concat?
原因:实际上add是concat的一种特例,而concat容易造成维度太大导致计算量爆炸,实际上,该部分的add是对计算量的一种妥协,但在论文中的Appendix部分可以看出,这种方法的定位效果还是不错的。
patchembed,AI论文小笔记,论文阅读,重构
patchembed,AI论文小笔记,论文阅读,重构

class PatchAndPosEmbedding(nn.Module):
    def __init__(self, img_size=256, patch_size=32, in_channels=3, embed_dim=1024, drop_out=0.):
        super(PatchAndPosEmbedding, self).__init__()

        num_patches = int((img_size/patch_size)**2)
        patch_size_dim = patch_size*patch_size*in_channels

        # patch_embedding, Note: kernel_size, stride
        # a
        self.patch_embedding = nn.Conv2d(in_channels=in_channels, out_channels=patch_size_dim, kernel_size=patch_size, stride=patch_size)
        self.linear = nn.Linear(patch_size_dim, embed_dim)

        self.cls_token = nn.Parameter(torch.randn(1, 1, embed_dim))   # 添加一个cls_token用于整合信息
        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches+1, embed_dim)) # 给patch embedding加上位置信息

        self.dropout = nn.Dropout(drop_out)

    def forward(self, img):
        x = self.patch_embedding(img) # [B,C,H,W] -> [B, patch_size_dim, N, N] # N = Num_patches = (H*W)/Patch_size,
        x = x.flatten(2)
        x = x.transpose(2, 1)  # [B,N*N, patch_size_dim]
        x = self.linear(x)     # [B,N*N, embed_dim]  # patch_size_dim -> embed_dim = 3072->1024 to reduce the computation when encode.

        cls_token = self.cls_token.expand(x.shape[0], -1, -1)  # cls_token: [1,1 embed_dim] -> [B, 1, embed_dim]
        x = torch.cat([cls_token, x], dim=1) # [B,N*N, embed_dim] -> [B, N*N+1, embed_dim]
        x += self.pos_embedding  # [B, N*N+1, embed_dim]  Consider why not concat , but add?  Trade off due to the computation.

        out = self.dropout(x)

        return out

3. Attention

实现Attention机制,需要Q(Query),K(Key),V(Value)三个元素对注意力进行计算,实际上是对各个patches之间计算注意力值,公式为
patchembed,AI论文小笔记,论文阅读,重构文章来源地址https://www.toymoban.com/news/detail-670601.html


class Attention(nn.Module):
    def __init__(self, dim, heads=16, head_dim=64, dropout=0.):
        super(Attention, self).__init__()
        inner_dim = heads * head_dim  # 可以通过FC将 input_dim 映射到inner_dim作为注意力表示维度
        self.heads = heads
        self.scale = head_dim ** -0.5

        project_out = not (heads == 1 and head_dim == dim)

        # 构建 k,q,v,可根据VIT原项目中的rerange进行变化
        # 写法一:直接定义to_q, to_k, to_v
        self.to_q = nn.Linear(dim, inner_dim, bias = False)
        self.to_k = nn.Linear(dim, inner_dim, bias = False)
        self.to_v = nn.Linear(dim, inner_dim, bias = False)

        # 写法二:先定义qkv,在forward进行chunk拆开
        # self.to_qkv = nn.Linear(dim, inner_dim*3, bias = False)

        self.atten = nn.Softmax(dim=-1)
        self.dropout = nn.Dropout(dropout)

        self.to_out = nn.Sequential(nn.Linear(dim, inner_dim), nn.Dropout(dropout)) if project_out else nn.Identity()

    def forward(self, x):
        # 续上面写法1:
        q = self.to_q(x)  # [3,65,1024]
        k = self.to_k(x)  # [3,65,1024]
        v = self.to_v(x)  # [3,65,1024]

        # 续上面写法2:
        # toqkv = self.to_qkv(x)  # [3, 65, 3072]
        # q, k, v = toqkv.chunk(3, dim=-1)  # q, k, v.shape    [3,65,1024]

        q = q.reshape(q.shape[0], q.shape[1], self.heads, -1).transpose(1, 2)  # [3,65,1024]  -> [3,16,65,64]
        k = k.reshape(k.shape[0], k.shape[1], self.heads, -1).transpose(1, 2)  # [3,65,1024]  -> [3,16,65,64]
        v = v.reshape(v.shape[0], v.shape[1], self.heads, -1).transpose(1, 2)  # [3,65,1024]  -> [3,16,65,64]


        dots = torch.matmul(q, k.transpose(-1, -2)) * self.scale

        atten = self.atten(dots)
        atten = self.dropout(atten)

        out = torch.matmul(atten, v)    # [3, 16, 65, 64]
        out = out.transpose(1, 2)   #
        out = out.reshape(out.shape[0], out.shape[1], -1)   # [3, 65, 16*64]

        return self.to_out(out)

到了这里,关于【论文笔记】 VIT论文笔记,重构Patch Embedding和Attention部分的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • 论文笔记:ViT Adapter——Transformer与CNN特征融合,屠榜语义分割!

    论文题目:《VISION TRANSFORMER ADAPTER FOR DENSE PREDICTIONS》 会议时间:ICLR 2023 论文地址:https://openreview.net/pdf?id=plKu2GByCNW 源码地址:https://github.com/czczup/ViT-Adapter   Transformer在计算机视觉领域取得了显著的成功,主要得益于transformer的 动态建模能力(dynamic modeling capability) 和 注

    2024年04月15日
    浏览(33)
  • 【论文笔记】Attention和Visual Transformer

    Attention机制在相当早的时间就已经被提出了,最先是在计算机视觉领域进行使用,但是始终没有火起来。Attention机制真正进入主流视野源自Google Mind在2014年的一篇论文\\\"Recurrent models of visual attention\\\"。在该文当中,首次在RNN上使用了Attention进行图像分类 。 然而,Attention真正得到

    2024年02月07日
    浏览(41)
  • IMAGEBIND: One Embedding Space To Bind Them All论文笔记

    论文 https://arxiv.org/pdf/2305.05665.pdf 代码 https://github.com/facebookresearch/ImageBind 像CLIP这一类的方法只能实现Text-Image这两个模态的 Embedding 对齐,本文提出的ImageBind能够实现六个模态(images, text, audio, depth, thermal, and IMU data)之间的联合Embedding空间对齐。 本文的多模态之间的对齐也不

    2024年02月07日
    浏览(44)
  • Attention Is All Your Need论文笔记

    提出了一个新的简单网络架构——transformer,仅仅是基于注意力机制,完全免去递推和卷积,使得神经网络训练地速度极大地提高。 We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. 用多头注意力取代推导层

    2024年02月19日
    浏览(69)
  • 【论文阅读笔记】Attention Is All You Need

      这是17年的老论文了,Transformer的出处,刚发布时的应用场景是文字翻译。BLUE是机器翻译任务中常用的一个衡量标准。   在此论文之前,序列翻译的主导模型是RNN或者使用编解码器结构的CNN。本文提出的Transformer结构不需要使用循环和卷积结构,是完全基于注意力机制

    2024年04月13日
    浏览(40)
  • 论文笔记:InternImage—基于可变形卷积的视觉大模型,超越ViT视觉大模型,COCO 新纪录 64.5 mAP!

    Title:InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions Paper Link:https://arxiv.org/abs/2211.05778 Code Link:https://github.com/OpenGVLab/InternImage 拿到文章之后先看了一眼在ImageNet1k上的结果,确实很高,超越了同等大小下的VAN、RepLKNet、ConvNext等先进的大核注意力CNN模型,

    2024年02月05日
    浏览(34)
  • Stable-diffusion安装时Can‘t load tokenizer for ‘openai/clip-vit-large-patch14‘问题解决

    如果你在安装stable-diffusion的时候遇到了这个问题,可以下载本博客的绑定资源,然后修改项目中的文件地址就可以了。 例如报错: 这是因为hugginface现在被墙了,所以直接下载无法下载。 首先创建一个文件夹,将本博文中下载的资源放进去,包括6个json文件,一个txt和一个

    2024年02月07日
    浏览(43)
  • 【论文阅读笔记】PraNet: Parallel Reverse Attention Network for Polyp Segmentation

    PraNet: Parallel Reverse Attention Network for Polyp Segmentation PraNet:用于息肉分割的并行反向注意力网络 2020年发表在MICCAI Paper Code 结肠镜检查是检测结直肠息肉的有效技术,结直肠息肉与结直肠癌高度相关。在临床实践中,从结肠镜图像中分割息肉是非常重要的,因为它为诊断和手术

    2024年01月20日
    浏览(57)
  • 【论文笔记】KDD2019 | KGAT: Knowledge Graph Attention Network for Recommendation

    为了更好的推荐,不仅要对user-item交互进行建模,还要将关系信息考虑进来 传统方法因子分解机将每个交互都当作一个独立的实例,但是忽略了item之间的关系(eg:一部电影的导演也是另一部电影的演员) 高阶关系:用一个/多个链接属性连接两个item KG+user-item graph+high orde

    2024年02月16日
    浏览(38)
  • 【论文笔记】BiFormer: Vision Transformer with Bi-Level Routing Attention

    论文地址:BiFormer: Vision Transformer with Bi-Level Routing Attention 代码地址:https://github.com/rayleizhu/BiFormer vision transformer中Attention是极其重要的模块,但是它有着非常大的缺点:计算量太大。 BiFormer提出了Bi-Level Routing Attention,在Attention计算时,只关注最重要的token,由此来降低计算量

    2024年01月25日
    浏览(73)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包