本文介绍一篇视觉BEV经典算法:LSS,论文收录于 ECCV2020,本文通过显示的进行图像离散深度估计完成目标语义分割,重点是如何将二维图像特征转换成BEV特征。
项目链接:https://nv-tlabs.github.io/lift-splat-shoot/
0. 工程结构
整个工程文件结构如下,非常简洁:文件是 data.py、explore.py、models.py、tools.py
和 train.py
。需要重点关注的是 explore.py
和 models.py
两个文件。
.
├── imgs
│ ├── check.gif
│ └── eval.gif
├── LICENSE
├── main.py
├── model525000.pt
├── nuscenes -> /root/bev_baseline/nuscenes
├── README.md
└── src
├── data.py
├── explore.py
├── __init__.py
├── models.py
├── tools.py
└── train.py
也可以看到整个项目代码量非常小,总共只有900多行左右。
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Python 7 309 339 963
Markdown 1 19 0 55
HTML 1 0 0 1
-------------------------------------------------------------------------------
SUM: 9 328 339 1019
-------------------------------------------------------------------------------
1. main.py
首先是main文件,这里使用到了fire
库,解析命令行参数。
from fire import Fire
import src
if __name__ == '__main__':
Fire({
'lidar_check': src.explore.lidar_check,
'cumsum_check': src.explore.cumsum_check,
'train': src.train.train,
'eval_model_iou': src.explore.eval_model_iou,
'viz_model_preds': src.explore.viz_model_preds,
})
2. explore.py
评估模型时,运行下面这条命令,调用的是 eval_model_iou
函数,需要传输的参数有数据版本(nuscenes-mini数据),模型文件(model525000.pt),数据集路径。
python3 main.py eval_model_iou mini --modelf=model525000.pt --dataroot=./nuscenes
函数中传输的参数还有:
- H=900, W=1600:图片大小
- resize_lim=(0.193, 0.225):resize的范围
- final_dim=(128, 352):数据预处理之后最终的图片大小
- bot_pct_lim=(0.0, 0.22):裁剪图片时,图像底部裁剪掉部分所占比例范围
- rot_lim=(-5.4, 5.4):训练时旋转图片的角度范围
- rand_flip=True:是否随机翻转
- xbound=[-50.0, 50.0, 0.5]:限制x方向的范围并划分网格(单位:米)
- ybound=[-50.0, 50.0, 0.5],:限制y方向的范围并划分网格(单位:米)
- zbound=[-10.0, 10.0, 20.0]:限制z方向的范围并划分网格(单位:米)
- dbound=[4.0, 45.0, 1.0]:限制深度方向的范围并划分网格(单位:米)
import torch
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt
from PIL import Image
import matplotlib.patches as mpatches
import os
from .data import compile_data
from .tools import (ego_to_cam, get_only_in_img_mask, denormalize_img,
SimpleLoss, get_val_info, add_ego, gen_dx_bx,
get_nusc_maps, plot_nusc_map)
from .models import compile_model
def eval_model_iou(version,
modelf,
dataroot='/data/nuscenes',
gpuid=-1,
H=900, W=1600,
resize_lim=(0.193, 0.225),
final_dim=(128, 352),
bot_pct_lim=(0.0, 0.22),
rot_lim=(-5.4, 5.4),
rand_flip=True,
xbound=[-50.0, 50.0, 0.5],
ybound=[-50.0, 50.0, 0.5],
zbound=[-10.0, 10.0, 20.0],
dbound=[4.0, 45.0, 1.0],
bsz=4,
nworkers=10,
):
grid_conf = {
'xbound': xbound,
'ybound': ybound,
'zbound': zbound,
'dbound': dbound,
}
data_aug_conf = {
'resize_lim': resize_lim,
'final_dim': final_dim,
'rot_lim': rot_lim,
'H': H, 'W': W,
'rand_flip': rand_flip,
'bot_pct_lim': bot_pct_lim,
'cams': ['CAM_FRONT_LEFT', 'CAM_FRONT', 'CAM_FRONT_RIGHT',
'CAM_BACK_LEFT', 'CAM_BACK', 'CAM_BACK_RIGHT'],
'Ncams': 5,
}
trainloader, valloader = compile_data(version, dataroot, data_aug_conf=data_aug_conf,
grid_conf=grid_conf, bsz=bsz, nworkers=nworkers,
parser_name='segmentationdata')
device = torch.device('cpu') if gpuid < 0 else torch.device(f'cuda:{gpuid}')
model = compile_model(grid_conf, data_aug_conf, outC=1)
print('loading', modelf)
# GPU加载
# model.load_state_dict(torch.load(modelf))
# CPU加载
model.load_state_dict(torch.load(modelf, map_location = torch.device('cpu')))
model.to(device)
# loss_fn = SimpleLoss(1.0).cuda(gpuid)
loss_fn = SimpleLoss(1.0)
model.eval()
val_info = get_val_info(model, valloader, loss_fn, device)
print(val_info)
3. models.py
3.1 LSS模型初始化
将网格参数和数据增强参数传递到网络,这里outC
为1,预测类别个数为1。
初始化对网格大小进行划分,图像下采样倍数(16),图像特征维度(64),视锥生成函数以及CamEncode和BEVEncode初始化。
def compile_model(grid_conf, data_aug_conf, outC):
return LiftSplatShoot(grid_conf, data_aug_conf, outC)
class LiftSplatShoot(nn.Module):
def __init__(self, grid_conf, data_aug_conf, outC):
super(LiftSplatShoot, self).__init__()
# 网格配置参数
self.grid_conf = grid_conf
# 数据增强配置参数
self.data_aug_conf = data_aug_conf
# 划分网格
dx, bx, nx = gen_dx_bx(self.grid_conf['xbound'],
self.grid_conf['ybound'],
self.grid_conf['zbound'],
)
self.dx = nn.Parameter(dx, requires_grad=False) # [0.5,0.5,20]
self.bx = nn.Parameter(bx, requires_grad=False) # [-49.75,-49.75,0]
self.nx = nn.Parameter(nx, requires_grad=False) # [200,200,1]
self.downsample = 16 # 下采样倍数
self.camC = 64 # 图像特征维度
self.frustum = self.create_frustum() # frustum: DxfHxfWx3(41x8x22x3)
self.D, _, _, _ = self.frustum.shape # D: 41
self.camencode = CamEncode(self.D, self.camC, self.downsample)
self.bevencode = BevEncode(inC=self.camC, outC=outC)
# toggle using QuickCumsum vs. autograd
self.use_quickcumsum = True
3.1.1 create_frustum视锥点云生成
生成视锥,最后得到 D × H × W × 3 D \times H \times W \times 3 D×H×W×3的张量,这里的张量存储的是视锥点云坐标,也就是常见的 ( d , u , v ) (d,u,v) (d,u,v)坐标。
- 其中 D D D 的取值范围为:[ 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30., 31., 32., 33., 34., 35., 36., 37., 38., 39., 40., 41., 42., 43., 44.],得到41个离散深度值;
- H H H 的取值范围为:[0.0000, 18.1429, 36.2857, 54.4286, 72.5714, 90.7143, 108.8571, 127.0000],在图像高度上8等分(16倍降采样);
- W W W 的取值范围为:[0.0000, 16.7143, 33.4286, 50.1429, 66.8571, 83.5714, 100.2857, 117.0000, 133.7143, 150.4286, 167.1429, 183.8571, 200.5714, 217.2857, 234.0000, 250.7143, 267.4286, 284.1429, 300.8571, 317.5714, 334.2857, 351.0000],在图像宽度上22等分(16倍降采样);
def create_frustum(self):
# 原始图片大小 ogfH:128 ogfW:352
ogfH, ogfW = self.data_aug_conf['final_dim']
# 下采样16倍后图像大小 fH: 8 fW: 22
fH, fW = ogfH // self.downsample, ogfW // self.downsample
# self.grid_conf['dbound'] = [4, 45, 1]
# 在深度方向上划分网格 ds: DxfHxfW(41x8x22)
# ds: tensor([ 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17.,
# 18., 19., 20., 21., 22., 23., 24., 25., 26., 27., 28., 29., 30., 31.,
# 32., 33., 34., 35., 36., 37., 38., 39., 40., 41., 42., 43., 44.])
ds = torch.arange(*self.grid_conf['dbound'], dtype=torch.float).view(-1, 1, 1).expand(-1, fH, fW)
# D: 41 表示深度方向上网格的数量
D, _, _ = ds.shape
"""
1. torch.linspace(0, ogfW - 1, fW, dtype=torch.float)
tensor([0.0000, 16.7143, 33.4286, 50.1429, 66.8571, 83.5714, 100.2857,
117.0000, 133.7143, 150.4286, 167.1429, 183.8571, 200.5714, 217.2857,
234.0000, 250.7143, 267.4286, 284.1429, 300.8571, 317.5714, 334.2857,
351.0000])
2. torch.linspace(0, ogfH - 1, fH, dtype=torch.float)
tensor([0.0000, 18.1429, 36.2857, 54.4286, 72.5714, 90.7143, 108.8571,
127.0000])
"""
# 在0到351上划分22个格子 xs: DxfHxfW(41x8x22)
xs = torch.linspace(0, ogfW - 1, fW, dtype=torch.float).view(1, 1, fW).expand(D, fH, fW)
# 在0到127上划分8个格子 ys: DxfHxfW(41x8x22)
ys = torch.linspace(0, ogfH - 1, fH, dtype=torch.float).view(1, fH, 1).expand(D, fH, fW)
# D x H x W x 3
# 堆积起来形成网格坐标
frustum = torch.stack((xs, ys, ds), -1)
return nn.Parameter(frustum, requires_grad=False)
3.1.2 CamEncode初始化
图像特征提取网络初始化,使用的网络是efficientnet-b0
,efficientnet-b0
最后两层特征图为:
(
b
s
,
112
,
H
/
16
,
W
/
16
)
,
(
b
s
,
320
,
H
/
32
,
W
/
32
)
(bs, 112, H / 16, W / 16),(bs, 320, H / 32, W / 32)
(bs,112,H/16,W/16),(bs,320,H/32,W/32),对后两层特征进行融合,融合后的特征尺寸大小为
(
b
s
,
412
,
H
/
16
,
W
/
16
)
(bs, 412, H / 16, W / 16)
(bs,412,H/16,W/16)。
import torch
from torch import nn
from efficientnet_pytorch import EfficientNet
from torchvision.models.resnet import resnet18
from .tools import gen_dx_bx, cumsum_trick, QuickCumsum
class Up(nn.Module):
def __init__(self, in_channels, out_channels, scale_factor=2):
super().__init__()
self.up = nn.Upsample(scale_factor=scale_factor, mode='bilinear',
align_corners=True) # 上采样 BxCxHxW->BxCx2Hx2W
self.conv = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1, bias=False),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True), # inplace=True使用原地操作,节省内存
nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1, bias=False),
nn.BatchNorm2d(out_channels),
nn.ReLU(inplace=True)
)
def forward(self, x1, x2):
x1 = self.up(x1)
x1 = torch.cat([x2, x1], dim=1)
return self.conv(x1)
class CamEncode(nn.Module):
def __init__(self, D, C, downsample):
super(CamEncode, self).__init__()
self.D = D
self.C = C
# 使用 efficientnet 提取特征
self.trunk = EfficientNet.from_pretrained("efficientnet-b0")
# 上采样模块,输入输出通道分别为320+112和512
self.up1 = Up(320+112, 512)
# 第二维的105个通道分成两部分;第一部分:前41个维度代表不同深度上41个离散深度;
# 第二部分:后64个维度代表特征图上的不同位置对应的语义特征;
self.depthnet = nn.Conv2d(512, self.D + self.C, kernel_size=1, padding=0)
3.1.3 BEVEncode初始化
BEV特征网络,使用的网络是resnet18
,特征图融合时对第1层和第3层特征图进行融合。
class BevEncode(nn.Module):
def __init__(self, inC, outC):
super(BevEncode, self).__init__()
# level0:(bs, 64, 100, 100)
# level1: (bs, 128, 50, 50)
# level2: (bs, 256, 25, 25)
# 使用resnet的前3个stage作为backbone
trunk = resnet18(pretrained=False, zero_init_residual=True)
self.conv1 = nn.Conv2d(inC, 64, kernel_size=7, stride=2, padding=3,
bias=False)
self.bn1 = trunk.bn1
self.relu = trunk.relu
self.layer1 = trunk.layer1
self.layer2 = trunk.layer2
self.layer3 = trunk.layer3
self.up1 = Up(64+256, 256, scale_factor=4)
self.up2 = nn.Sequential(
nn.Upsample(scale_factor=2, mode='bilinear',
align_corners=True),
nn.Conv2d(256, 128, kernel_size=3, padding=1, bias=False),
nn.BatchNorm2d(128),
nn.ReLU(inplace=True),
nn.Conv2d(128, outC, kernel_size=1, padding=0),
)
3.2 LSS前向推理
LSS前向推理时,输入参数有:
- imgs:输入的环视相机图片,imgs = (bs, N, 3, H, W),N代表环视相机个数;
- rots:由相机坐标系->车身坐标系的旋转矩阵,rots = (bs, N, 3, 3);
- trans:由相机坐标系->车身坐标系的平移向量,trans=(bs, N, 3);
- intrinsic:相机内参,intrinsic = (bs, N, 3, 3);
- post_rots:由图像增强引起的旋转矩阵,post_rots = (bs, N, 3, 3);
- post_trans:由图像增强引起的平移向量,post_trans = (bs, N, 3);
def forward(self, x, rots, trans, intrins, post_rots, post_trans):
# x:[4,6,3,128,352]
# rots: [4,6,3,3]
# trans: [4,6,3]
# intrins: [4,6,3,3]
# post_rots: [4,6,3,3]
# post_trans: [4,6,3]
# 将图像转换到BEV下,x: B x C x 200 x 200 (B x 64 x 200 x 200)
x = self.get_voxels(x, rots, trans, intrins, post_rots, post_trans)
x = self.bevencode(x) # 用resnet18提取特征 x: B x 1 x 200 x 200
print("pred: x[0,0,1,1:10]", x[0,0,1,1:10])
return x
3.2.1 get_geometry(几何坐标转换)
视锥点云由图像坐标系向自车坐标系进行转化,主要是刚体变换,根据相机与自车内外参进行变换。
def get_geometry(self, rots, trans, intrins, post_rots, post_trans):
"""
Determine the (x,y,z) locations (in the ego frame) of the points in the point cloud.
Returns B x N x D x H/downsample x W/downsample x 3
"""
B, N, _ = trans.shape # B:batchsize N:相机数目
# undo post-transformation
# B x N x D x H x W x 3
# 抵消数据增强及预处理对像素的变化
points = self.frustum - post_trans.view(B, N, 1, 1, 1, 3)
points = torch.inverse(post_rots).view(B, N, 1, 1, 1, 3, 3).matmul(points.unsqueeze(-1))
# cam_to_ego
# 图像坐标系 -> 归一化相机坐标系 -> 相机坐标系 -> 自车坐标系
# 但是自认为由于转换过程是线性的,所以反归一化是在图像坐标系完成的,然后再利用
# 求完逆的内参投影回相机坐标系
points = torch.cat((points[:, :, :, :, :, :2] * points[:, :, :, :, :, 2:3],
points[:, :, :, :, :, 2:3]
), 5) # 将像素坐标(u,v,d)变成齐次坐标(du,dv,d)
# d[u,v,1]^T=intrins*rots^(-1)*([x,y,z]^T-trans)
combine = rots.matmul(torch.inverse(intrins))
points = combine.view(B, N, 1, 1, 1, 3, 3).matmul(points).squeeze(-1)
# 将像素坐标d[u,v,1]^T转换到车体坐标系下的[x,y,z]^T
points += trans.view(B, N, 1, 1, 1, 3)
# (bs, N, depth, H, W, 3):其物理含义
# 每个batch中的每个环视相机图像特征点,其在不同深度下位置对应
# 在ego坐标系下的坐标
# B x N x D x H x W x 3 (4 x 6 x 41 x 8 x 22 x 3)
return points
3.2.2 get_cam_feats(获取图像特征)
图像特征提取网络,得到 B × N × D × f H × f W × C B \times N \times D \times fH \times fW \times C B×N×D×fH×fW×C的张量。
def get_cam_feats(self, x):
"""
Return B x N x D x H/downsample x W/downsample x C
"""
B, N, C, imH, imW = x.shape # B: 4 N: 6 C: 3 imH: 128 imW: 352
x = x.view(B*N, C, imH, imW) # B和N两个维度合起来 x: 12 x 3 x 128 x 352
x = self.camencode(x) # 进行图像编码 x: B*N x C x D x fH x fW(12 x 64 x 41 x 8 x 22)
#将前两维拆开 x: B x N x C x D x fH x fW(4 x 6 x 64 x 41 x 8 x 22)
x = x.view(B, N, self.camC, self.D, imH//self.downsample, imW//self.downsample)
x = x.permute(0, 1, 3, 4, 5, 2) # x: B x N x D x fH x fW x C(4 x 6 x 41 x 8 x 22 x 64)
return x
BEVEncode前向推理函数为:
def forward(self, x):
# depth: B*N x D x fH x fW(24 x 41 x 8 x 22)
# x: B*N x C x D x fH x fW(24 x 64 x 41 x 8 x 22)
depth, x = self.get_depth_feat(x)
return x
def get_depth_feat(self, x):
# 使用efficientnet提取特征 x: 24x 512 x 8 x 22
x = self.get_eff_depth(x)
# Depth
# 1x1卷积变换维度 x: 24 x 105(C+D) x 8 x 22
x = self.depthnet(x)
# 第二个维度的前D个作为深度维,进行softmax depth: 24 x 41 x 8 x 22
depth = self.get_depth_dist(x[:, :self.D])
# 概率密度和语义特征做外积,构建图像特征点云 new_x: 24 x 64 x 41 x 8 x 22
# depth.unsqueeze(1): (24,1,41,8,22)
# x[:, self.D:(self.D + self.C)].unsqueeze(2) :(24,64,1,8,22)
new_x = depth.unsqueeze(1) * x[:, self.D:(self.D + self.C)].unsqueeze(2)
return depth, new_x
def get_depth_dist(self, x, eps=1e-20):
return x.softmax(dim=1) # 对深度维进行softmax,得到每个像素不同深度的概率
def get_eff_depth(self, x): # 使用efficientnet提取特征
# adapted from https://github.com/lukemelas/EfficientNetPyTorch/blob/master/efficientnet_pytorch/model.py#L231
endpoints = dict()
# Stem
x = self.trunk._swish(self.trunk._bn0(self.trunk._conv_stem(x)))
prev_x = x
# Blocks
for idx, block in enumerate(self.trunk._blocks):
drop_connect_rate = self.trunk._global_params.drop_connect_rate
if drop_connect_rate:
drop_connect_rate *= float(idx) / len(self.trunk._blocks) # scale drop connect_rate
x = block(x, drop_connect_rate=drop_connect_rate)
if prev_x.size(2) > x.size(2):
endpoints['reduction_{}'.format(len(endpoints)+1)] = prev_x
prev_x = x
# Head
endpoints['reduction_{}'.format(len(endpoints)+1)] = x
# 第5层特征上采样,并于第4层特征融合
x = self.up1(endpoints['reduction_5'], endpoints['reduction_4'])
return x
3.2.3 voxel_pooling(生成BEV特征)
这里使用到了QuickCumsum函数,作者给出了伪代码解释,具体见:https://github.com/nv-tlabs/lift-splat-shoot/issues/14,最终得到 B × 64 × 200 × 200 B \times 64 \times 200 \times200 B×64×200×200 的特征图。
def voxel_pooling(self, geom_feats, x):
# geom_feats;(B x N x D x fH x fW x 3):在ego坐标系下的坐标点;
# x;(B x N x D x fH x fW x C):图像点云特征
# B: 4 N: 6 D: 41 H: 8 W: 22 C: 64
B, N, D, H, W, C = x.shape
# 将特征点云展平,一共有 B*N*D*H*W 个点
Nprime = B*N*D*H*W # Nprime: 173184
# flatten x
# 将图像展平,一共有 B*N*D*H*W 个点
x = x.reshape(Nprime, C)
# flatten indices
# 将[-50,50] [-10 10]的范围平移到[0,100] [0,20],计算栅格坐标并取整
# ego下的空间坐标转换到体素坐标(计算栅格坐标并取整)
geom_feats = ((geom_feats - (self.bx - self.dx/2.)) / self.dx).long()
# 将体素坐标同样展平,geom_feats: (B*N*D*H*W, 3)
geom_feats = geom_feats.view(Nprime, 3)
# 每个点对应于哪个batch
# (Nprimer, 1)
batch_ix = torch.cat([torch.full([Nprime//B, 1], ix, device=x.device, dtype=torch.long) for ix in range(B)])
# geom_feats: B*N*D*H*W x 4(173184 x 4), geom_feats[:,3]表示batch_id
geom_feats = torch.cat((geom_feats, batch_ix), 1)
# filter out points that are outside box
# 过滤掉在边界线之外的点 x:0~199 y: 0~199 z: 0
kept = (geom_feats[:, 0] >= 0) & (geom_feats[:, 0] < self.nx[0])\
& (geom_feats[:, 1] >= 0) & (geom_feats[:, 1] < self.nx[1])\
& (geom_feats[:, 2] >= 0) & (geom_feats[:, 2] < self.nx[2])
x = x[kept] # x: 168648 x 64
geom_feats = geom_feats[kept]
ranks = geom_feats[:, 0] * (self.nx[1] * self.nx[2] * B)\
+ geom_feats[:, 1] * (self.nx[2] * B)\
+ geom_feats[:, 2] * B\
+ geom_feats[:, 3]
sorts = ranks.argsort()
x, geom_feats, ranks = x[sorts], geom_feats[sorts], ranks[sorts] # 按照rank排序,这样rank相近的点就在一起了
# cumsum trick
if not self.use_quickcumsum:
x, geom_feats = cumsum_trick(x, geom_feats, ranks)
else:
x, geom_feats = QuickCumsum.apply(x, geom_feats, ranks) # 一个batch的一个格子里只留一个点 x: 29072 x 64 geom_feats: 29072 x 4
# griddify (B x C x Z x X x Y)
# final: B x 64 x 1 x 200 x 200
final = torch.zeros((B, C, self.nx[2], self.nx[0], self.nx[1]), device=x.device)
# 将x按照栅格坐标放到final中
final[geom_feats[:, 3], :, geom_feats[:, 2], geom_feats[:, 0], geom_feats[:, 1]] = x
# collapse Z
# 消除掉z维
final = torch.cat(final.unbind(dim=2), 1)
# final: B x 64 x 200 x 200
return final
3.2.4 BEVEncode前向推理
最终得到 4 × 1 × 200 × 200 4 \times 1 \times 200 \times 200 4×1×200×200的特征。文章来源:https://www.toymoban.com/news/detail-601064.html
def forward(self, x):
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
x1 = self.layer1(x) # x1: 4 x 64 x 100 x 100
x = self.layer2(x1) # x: 4 x 128 x 50 x 50
x = self.layer3(x) # x: 4 x 256 x 25 x 25
# 给x进行4倍上采样然后和x1 concat 在一起 x: 4 x 256 x 100 x 100
x = self.up1(x, x1)
# 2倍上采样->3x3卷积->1x1卷积 x: 4 x 1 x 200 x 200
x = self.up2(x)
return x
推荐链接:https://blog.csdn.net/weixin_45993900/article/details/128887387,https://mp.weixin.qq.com/s/pb7hEuGYaCsDNo6l4oKDSg。文章来源地址https://www.toymoban.com/news/detail-601064.html
到了这里,关于视觉BEV经典算法:LSS详解与代码的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!