【论文笔记】BiFormer: Vision Transformer with Bi-Level Routing Attention

1年前作者：justld分类：Toy博客阅读(26)违法举报

这篇具有很好参考价值的文章主要介绍了【论文笔记】BiFormer: Vision Transformer with Bi-Level Routing Attention。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

论文地址：BiFormer: Vision Transformer with Bi-Level Routing Attention

代码地址：https://github.com/rayleizhu/BiFormer

vision transformer中Attention是极其重要的模块，但是它有着非常大的缺点：计算量太大。

BiFormer提出了Bi-Level Routing Attention，在Attention计算时，只关注最重要的token，由此来降低计算量。

一、Bi-Level Routing Attention

下图是多个不同的Attention模块关注的区域，（a）是原始的attention，其他的都是稀疏的Attention结构。Bi-Level Routing Attention如下图（f）所示。

biformer: vision transformer with bi-level routing attention,图像处理,注意力机制,深度学习,论文阅读,transformer,深度学习

与其他的稀疏Attention结构有所不同，Bi-Level Routing Attention首先将特征图分为不同的区域（区域大小是SxS），每个区域经过线性映射，得到QKV，然后QK在每个SxS的窗口内取平均作为该区域的token（可参考代码），得到（r表示region，即SxS的窗口）,通过的矩阵运算得到，如下式：

biformer: vision transformer with bi-level routing attention,图像处理,注意力机制,深度学习,论文阅读,transformer,深度学习

得到了邻接矩阵后，取其相关性最高的k个token索引，这样就知道每个窗口与哪k个窗口相关性更高了。

biformer: vision transformer with bi-level routing attention,图像处理,注意力机制,深度学习,论文阅读,transformer,深度学习

得到了后，用gather运算得到和：

biformer: vision transformer with bi-level routing attention,图像处理,注意力机制,深度学习,论文阅读,transformer,深度学习

最后计算Attention：

biformer: vision transformer with bi-level routing attention,图像处理,注意力机制,深度学习,论文阅读,transformer,深度学习

Bi-Level Routing Attention的计算过程如下图所示，其中的k就是计算相关性索引时设置的参数。

biformer: vision transformer with bi-level routing attention,图像处理,注意力机制,深度学习,论文阅读,transformer,深度学习

二、代码

Bi-Level Routing Attention的代码如下：文章来源地址https://www.toymoban.com/news/detail-822946.html

"""
Core of BiFormer, Bi-Level Routing Attention.

To be refactored.

author: ZHU Lei
github: https://github.com/rayleizhu
email: ray.leizhu@outlook.com

This source code is licensed under the license found in the
LICENSE file in the root directory of this source tree.
"""
from typing import Tuple

import torch
import torch.nn as nn
import torch.nn.functional as F
from einops import rearrange
from torch import Tensor


class TopkRouting(nn.Module):
    """
    differentiable topk routing with scaling
    Args:
        qk_dim: int, feature dimension of query and key
        topk: int, the 'topk'
        qk_scale: int or None, temperature (multiply) of softmax activation
        with_param: bool, wether inorporate learnable params in routing unit
        diff_routing: bool, wether make routing differentiable
        soft_routing: bool, wether make output value multiplied by routing weights
    """
    def __init__(self, qk_dim, topk=4, qk_scale=None, param_routing=False, diff_routing=False):
        super().__init__()
        self.topk = topk
        self.qk_dim = qk_dim
        self.scale = qk_scale or qk_dim ** -0.5
        self.diff_routing = diff_routing
        # TODO: norm layer before/after linear?
        self.emb = nn.Linear(qk_dim, qk_dim) if param_routing else nn.Identity()
        # routing activation
        self.routing_act = nn.Softmax(dim=-1)
    
    def forward(self, query:Tensor, key:Tensor)->Tuple[Tensor]:
        """
        Args:
            q, k: (n, p^2, c) tensor
        Return:
            r_weight, topk_index: (n, p^2, topk) tensor
        """
        if not self.diff_routing:
            query, key = query.detach(), key.detach()
        query_hat, key_hat = self.emb(query), self.emb(key) # per-window pooling -> (n, p^2, c) 
        attn_logit = (query_hat*self.scale) @ key_hat.transpose(-2, -1) # (n, p^2, p^2)
        topk_attn_logit, topk_index = torch.topk(attn_logit, k=self.topk, dim=-1) # (n, p^2, k), (n, p^2, k)
        r_weight = self.routing_act(topk_attn_logit) # (n, p^2, k)
        
        return r_weight, topk_index
        

class KVGather(nn.Module):
    def __init__(self, mul_weight='none'):
        super().__init__()
        assert mul_weight in ['none', 'soft', 'hard']
        self.mul_weight = mul_weight

    def forward(self, r_idx:Tensor, r_weight:Tensor, kv:Tensor):
        """
        r_idx: (n, p^2, topk) tensor
        r_weight: (n, p^2, topk) tensor
        kv: (n, p^2, w^2, c_kq+c_v)

        Return:
            (n, p^2, topk, w^2, c_kq+c_v) tensor
        """
        # select kv according to routing index
        n, p2, w2, c_kv = kv.size()
        topk = r_idx.size(-1)
        # print(r_idx.size(), r_weight.size())
        # FIXME: gather consumes much memory (topk times redundancy), write cuda kernel? 
        topk_kv = torch.gather(kv.view(n, 1, p2, w2, c_kv).expand(-1, p2, -1, -1, -1), # (n, p^2, p^2, w^2, c_kv) without mem cpy
                                dim=2,
                                index=r_idx.view(n, p2, topk, 1, 1).expand(-1, -1, -1, w2, c_kv) # (n, p^2, k, w^2, c_kv)
                               )

        if self.mul_weight == 'soft':
            topk_kv = r_weight.view(n, p2, topk, 1, 1) * topk_kv # (n, p^2, k, w^2, c_kv)
        elif self.mul_weight == 'hard':
            raise NotImplementedError('differentiable hard routing TBA')
        # else: #'none'
        #     topk_kv = topk_kv # do nothing

        return topk_kv

class QKVLinear(nn.Module):
    def __init__(self, dim, qk_dim, bias=True):
        super().__init__()
        self.dim = dim
        self.qk_dim = qk_dim
        self.qkv = nn.Linear(dim, qk_dim + qk_dim + dim, bias=bias)
    
    def forward(self, x):
        q, kv = self.qkv(x).split([self.qk_dim, self.qk_dim+self.dim], dim=-1)
        return q, kv
        # q, k, v = self.qkv(x).split([self.qk_dim, self.qk_dim, self.dim], dim=-1)
        # return q, k, v

class BiLevelRoutingAttention(nn.Module):
    """
    n_win: number of windows in one side (so the actual number of windows is n_win*n_win)
    kv_per_win: for kv_downsample_mode='ada_xxxpool' only, number of key/values per window. Similar to n_win, the actual number is kv_per_win*kv_per_win.
    topk: topk for window filtering
    param_attention: 'qkvo'-linear for q,k,v and o, 'none': param free attention
    param_routing: extra linear for routing
    diff_routing: wether to set routing differentiable
    soft_routing: wether to multiply soft routing weights 
    """
    def __init__(self, dim, num_heads=8, n_win=7, qk_dim=None, qk_scale=None,
                 kv_per_win=4, kv_downsample_ratio=4, kv_downsample_kernel=None, kv_downsample_mode='identity',
                 topk=4, param_attention="qkvo", param_routing=False, diff_routing=False, soft_routing=False, side_dwconv=3,
                 auto_pad=False):
        super().__init__()
        # local attention setting
        self.dim = dim
        self.n_win = n_win  # Wh, Ww
        self.num_heads = num_heads
        self.qk_dim = qk_dim or dim
        assert self.qk_dim % num_heads == 0 and self.dim % num_heads==0, 'qk_dim and dim must be divisible by num_heads!'
        self.scale = qk_scale or self.qk_dim ** -0.5


        ################side_dwconv (i.e. LCE in ShuntedTransformer)###########
        self.lepe = nn.Conv2d(dim, dim, kernel_size=side_dwconv, stride=1, padding=side_dwconv//2, groups=dim) if side_dwconv > 0 else \
                    lambda x: torch.zeros_like(x)
        
        ################ global routing setting #################
        self.topk = topk
        self.param_routing = param_routing
        self.diff_routing = diff_routing
        self.soft_routing = soft_routing
        # router
        assert not (self.param_routing and not self.diff_routing) # cannot be with_param=True and diff_routing=False
        self.router = TopkRouting(qk_dim=self.qk_dim,
                                  qk_scale=self.scale,
                                  topk=self.topk,
                                  diff_routing=self.diff_routing,
                                  param_routing=self.param_routing)
        if self.soft_routing: # soft routing, always diffrentiable (if no detach)
            mul_weight = 'soft'
        elif self.diff_routing: # hard differentiable routing
            mul_weight = 'hard'
        else:  # hard non-differentiable routing
            mul_weight = 'none'
        self.kv_gather = KVGather(mul_weight=mul_weight)

        # qkv mapping (shared by both global routing and local attention)
        self.param_attention = param_attention
        if self.param_attention == 'qkvo':
            self.qkv = QKVLinear(self.dim, self.qk_dim)
            self.wo = nn.Linear(dim, dim)
        elif self.param_attention == 'qkv':
            self.qkv = QKVLinear(self.dim, self.qk_dim)
            self.wo = nn.Identity()
        else:
            raise ValueError(f'param_attention mode {self.param_attention} is not surpported!')
        
        self.kv_downsample_mode = kv_downsample_mode
        self.kv_per_win = kv_per_win
        self.kv_downsample_ratio = kv_downsample_ratio
        self.kv_downsample_kenel = kv_downsample_kernel
        if self.kv_downsample_mode == 'ada_avgpool':
            assert self.kv_per_win is not None
            self.kv_down = nn.AdaptiveAvgPool2d(self.kv_per_win)
        elif self.kv_downsample_mode == 'ada_maxpool':
            assert self.kv_per_win is not None
            self.kv_down = nn.AdaptiveMaxPool2d(self.kv_per_win)
        elif self.kv_downsample_mode == 'maxpool':
            assert self.kv_downsample_ratio is not None
            self.kv_down = nn.MaxPool2d(self.kv_downsample_ratio) if self.kv_downsample_ratio > 1 else nn.Identity()
        elif self.kv_downsample_mode == 'avgpool':
            assert self.kv_downsample_ratio is not None
            self.kv_down = nn.AvgPool2d(self.kv_downsample_ratio) if self.kv_downsample_ratio > 1 else nn.Identity()
        elif self.kv_downsample_mode == 'identity': # no kv downsampling
            self.kv_down = nn.Identity()
        elif self.kv_downsample_mode == 'fracpool':
            # assert self.kv_downsample_ratio is not None
            # assert self.kv_downsample_kenel is not None
            # TODO: fracpool
            # 1. kernel size should be input size dependent
            # 2. there is a random factor, need to avoid independent sampling for k and v 
            raise NotImplementedError('fracpool policy is not implemented yet!')
        elif kv_downsample_mode == 'conv':
            # TODO: need to consider the case where k != v so that need two downsample modules
            raise NotImplementedError('conv policy is not implemented yet!')
        else:
            raise ValueError(f'kv_down_sample_mode {self.kv_downsaple_mode} is not surpported!')

        # softmax for local attention
        self.attn_act = nn.Softmax(dim=-1)

        self.auto_pad=auto_pad

    def forward(self, x, ret_attn_mask=False):
        """
        x: NHWC tensor

        Return:
            NHWC tensor
        """
         # NOTE: use padding for semantic segmentation
        ###################################################
        if self.auto_pad:
            N, H_in, W_in, C = x.size()

            pad_l = pad_t = 0
            pad_r = (self.n_win - W_in % self.n_win) % self.n_win
            pad_b = (self.n_win - H_in % self.n_win) % self.n_win
            x = F.pad(x, (0, 0, # dim=-1
                          pad_l, pad_r, # dim=-2
                          pad_t, pad_b)) # dim=-3
            _, H, W, _ = x.size() # padded size
        else:
            N, H, W, C = x.size()
            assert H%self.n_win == 0 and W%self.n_win == 0 #
        ###################################################


        # patchify, (n, p^2, w, w, c), keep 2d window as we need 2d pooling to reduce kv size
        x = rearrange(x, "n (j h) (i w) c -> n (j i) h w c", j=self.n_win, i=self.n_win)

        #################qkv projection###################
        # q: (n, p^2, w, w, c_qk)
        # kv: (n, p^2, w, w, c_qk+c_v)
        # NOTE: separte kv if there were memory leak issue caused by gather
        q, kv = self.qkv(x) 

        # pixel-wise qkv
        # q_pix: (n, p^2, w^2, c_qk)
        # kv_pix: (n, p^2, h_kv*w_kv, c_qk+c_v)
        q_pix = rearrange(q, 'n p2 h w c -> n p2 (h w) c')
        kv_pix = self.kv_down(rearrange(kv, 'n p2 h w c -> (n p2) c h w'))
        kv_pix = rearrange(kv_pix, '(n j i) c h w -> n (j i) (h w) c', j=self.n_win, i=self.n_win)

        q_win, k_win = q.mean([2, 3]), kv[..., 0:self.qk_dim].mean([2, 3]) # window-wise qk, (n, p^2, c_qk), (n, p^2, c_qk)

        ##################side_dwconv(lepe)##################
        # NOTE: call contiguous to avoid gradient warning when using ddp
        lepe = self.lepe(rearrange(kv[..., self.qk_dim:], 'n (j i) h w c -> n c (j h) (i w)', j=self.n_win, i=self.n_win).contiguous())
        lepe = rearrange(lepe, 'n c (j h) (i w) -> n (j h) (i w) c', j=self.n_win, i=self.n_win)

        ############ gather q dependent k/v #################

        r_weight, r_idx = self.router(q_win, k_win) # both are (n, p^2, topk) tensors

        kv_pix_sel = self.kv_gather(r_idx=r_idx, r_weight=r_weight, kv=kv_pix) #(n, p^2, topk, h_kv*w_kv, c_qk+c_v)
        k_pix_sel, v_pix_sel = kv_pix_sel.split([self.qk_dim, self.dim], dim=-1)
        # kv_pix_sel: (n, p^2, topk, h_kv*w_kv, c_qk)
        # v_pix_sel: (n, p^2, topk, h_kv*w_kv, c_v)
        
        ######### do attention as normal ####################
        k_pix_sel = rearrange(k_pix_sel, 'n p2 k w2 (m c) -> (n p2) m c (k w2)', m=self.num_heads) # flatten to BMLC, (n*p^2, m, topk*h_kv*w_kv, c_kq//m) transpose here?
        v_pix_sel = rearrange(v_pix_sel, 'n p2 k w2 (m c) -> (n p2) m (k w2) c', m=self.num_heads) # flatten to BMLC, (n*p^2, m, topk*h_kv*w_kv, c_v//m)
        q_pix = rearrange(q_pix, 'n p2 w2 (m c) -> (n p2) m w2 c', m=self.num_heads) # to BMLC tensor (n*p^2, m, w^2, c_qk//m)

        # param-free multihead attention
        attn_weight = (q_pix * self.scale) @ k_pix_sel # (n*p^2, m, w^2, c) @ (n*p^2, m, c, topk*h_kv*w_kv) -> (n*p^2, m, w^2, topk*h_kv*w_kv)
        attn_weight = self.attn_act(attn_weight)
        out = attn_weight @ v_pix_sel # (n*p^2, m, w^2, topk*h_kv*w_kv) @ (n*p^2, m, topk*h_kv*w_kv, c) -> (n*p^2, m, w^2, c)
        out = rearrange(out, '(n j i) m (h w) c -> n (j h) (i w) (m c)', j=self.n_win, i=self.n_win,
                        h=H//self.n_win, w=W//self.n_win)

        out = out + lepe
        # output linear
        out = self.wo(out)

        # NOTE: use padding for semantic segmentation
        # crop padded region
        if self.auto_pad and (pad_r > 0 or pad_b > 0):
            out = out[:, :H_in, :W_in, :].contiguous()

        if ret_attn_mask:
            return out, r_weight, r_idx, attn_weight
        else:
            return out

到了这里，关于【论文笔记】BiFormer: Vision Transformer with Bi-Level Routing Attention的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：如若内容造成侵权/违法违规/事实不符，请点击违法举报进行投诉反馈，一经查实，立即删除！

分享到：

领支付宝红包赞助服务器费用

论文笔记：ViTGAN: Training GANs with Vision Transformers
2021 论文研究的问题是：ViT是否可以在不使用卷积或池化的情况下完成图像生成任务即不用CNN，而使用ViT来完成图像生成任务将ViT架构集成到GAN中，发现现有的GAN正则化方法与self-attention机制的交互很差，导致训练过程中严重的不稳定 ——引入了新的正则化技术来训练带有
2024年02月07日
浏览(8)
【论文合集】Awesome Low Level Vision
Low-level任务：常见的包括 Super-Resolution，denoise， deblur， dehze， low-light enhancement， deartifacts等。简单来说，是把特定降质下的图片还原成好看的图像，现在基本上用end-to-end的模型来学习这类 ill-posed问题的求解过程，客观指标主要是PSNR，SSIM，大家指标都刷的很高。目前面临以
2024年02月08日
浏览(9)
弱监督实例分割 Box-supervised Instance Segmentation with Level Set Evolution 论文笔记
写在前面这是一篇基于 Box 的弱监督实例分割文章，之前也分享过几篇（主页有，欢迎关注一下呗~），采用旧纸堆里面翻出来的能量函数来做弱监督。论文地址：Box-supervised Instance Segmentation with Level Set Evolution 代码地址：https://github.com/LiWentomng/boxlevelset 收录于：ECCV 202
2023年04月18日
浏览(12)
Neural Geometric Level of Detail: Real-time Rendering with Implicit 3D Shapes 论文笔记&环境配置
发布于 CVPR 2021 论文介绍了一种具有神经SDF的复杂几何实时渲染方法。论文提出了一种神经SDF表示，可以有效地捕获多个LOD，并以最先进的质量重建3D几何图形。论文中的架构可以以比传统方法具有更高视觉保真度的压缩格式表示 3D 形状，并且即使在单个学习示例中也能跨不
2024年01月24日
浏览(11)
【论文笔记】DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets
原文链接：https://arxiv.org/abs/2301.06051 本文提出DSVT，一种通用的、部署友好的、基于transformer的3D主干，可用于多种基于点云处理的3D感知任务中。传统的稀疏点云特征提取方法，如PointNet系列和稀疏卷积，要么需要高计算力进行采样与分组，要么因为子流形卷积导致表达能力受
2024年02月05日
浏览(7)
VL系列 Exchanging-based Multimodal Fusion with Transformer 论文阅读笔记
写在前面又是一个周末教师节，祝老师们节日快乐呀。依惯例，论文读起来~ 这是一篇多模态融合的文章，也算是这些年新出的一种方式了，具体还不知道啥情况，代码已开源，一试便知。论文地址：Exchanging-based Multimodal Fusion with Transformer 代码地址：https://github.
2024年02月05日
浏览(12)
PVT v2: Improved Baselines with Pyramid Vision Transformer
论文地址：https://arxiv.org/pdf/2106.13797.pdf 代码地址： https://github.com/whai362/PVT 最近关于视觉Transformer的研究正在汇聚于主干网络，该主干网络设计用于下游视觉任务，如图像分类、目标检测、实例和语义分割。例如，Vision Transformer（ViT）首先证明了纯Transformer可以实现图像分类最
2024年02月08日
浏览(13)
PolarFormer：Multi-camera 3D Object Detection with Polar Transformer——论文笔记
参考代码：PolarFormer 介绍：在仓库RoboBEV中总结了现有的一些bev感知算法在不同输入情况下的鲁棒性，在这些感知算法中PolarFormer拥有较为不错的泛化性能。这个算法的思想是将之前由直角坐标系栅格化构建bev网格，转换到由极坐标构建栅格化bev网格，这样的bev特征构建方法其
2024年02月11日
浏览(13)
论文阅读 Vision Transformer - VIT
通过将图像切成patch线形层编码成token特征编码的方法，用transformer的encoder来做图像分类解决问题： transformer输入限制: 由于自注意力+backbone，算法复杂度为o(n²)，token长度一般要512才足够运算解决：a) 将图片转为token输入 b) 将特征图转为token输入 c)√ 切patch转为token输入 tra
2024年02月01日
浏览(11)
李沐论文精读系列二：Vision Transformer、MAE、Swin-Transformer
传送门：李沐论文精读系列一： ResNet、Transformer、GAN、BERT 李沐论文精读系列三：MoCo、对比学习综述（MoCov1/v2/v3、SimCLR v1/v2、DINO等）李沐论文精读系列四：CLIP和改进工作串讲（LSeg、GroupViT、VLiD、 GLIPv1、 GLIPv2、CLIPasso）论文名称： An Image Is Worth 16x16 Words: Transformers For Imag
2024年01月17日
浏览(9)