Lift-Splat-Shoot 很巧妙的利用 attention 的方式端到端地学了一个深度,但是因为没有显式的深度作为监督.
当前在BEV下进行感知方法大致分为两类,一类是以Transformer 为主体的隐式深度(Depth)信息进行转换的架构,另一类则是基于显示的深度估计投影到BEV下的方法,也就是本文的主人公——LSS(Lift,Splat,Shoot)。
1 Abstract
The goal of perception for autonomous vehicles is to extrct sematic representations from multiple sensors and fuse these representation into sinle bird’s eye view coordinate frame for consumption by motion planning.
We propose a architecture that support an arbitrary number of camera input.
The early research uaually focused to lift image individually to a frustum, then splat all frustums into a rasterized BEV grid.
We provide evidence that LSS is able to learn how to present image and fuse predictis from all cameras into a single cohesive representation of the scenes while being robust to calibrate error.
In Section 3, we explain how our model “lifts” images into 3D by generating a frustum-shaped point cloud of contextual features, “splats” all frustums onto a reference plane as is convenient for the downstream task of motion planning
We present empirical evidence in Sec 5 that our model learns an effective mechanism for fusing information from a distribution of possible inputs.
2Related work
2.1monocular object detection
1、Using a mature 2D detetor to regress bbox and a second network to regress 2D box to 3D box.
SSD-6D: making rgb-based 3d detection and 6d pose estimation great again.
Fast singleshot detection and pose estimation
Monogrnet: A geometric reasoning network for monoc-ular 3d object localization.
Disentangling monocular 3d object detection.
《Disentangling monocular 3d object detection》 trains a standard 2d detector to also predict depth,and seeks to decouple depth and size of 3-dimensional bounding box, which can disentangle factor of the incorrect depth from the result.
Detector factors out the fundamental cloud of ambiguity that shrouds monocular depth prediction.
《3D Bounding Box Estimation Using Deep Learning and Geometry》
1、Regressing the fisrt parameter is orientation (θ, φ, α)。
2、Regress the size of box,Because Variance usually is smaller than the Translation
3、Using post-processing to get Translation[dx,dy,dz]
2、An approach with recently empirical success is to use depth prediction network and BEV detection network.
3、生成一组类相关的物体推荐候选框的方法,利用3D包围框与2D包围框之间存在的映射联系,用2D空间中的特征来描述3D包围框。
Monocular 3d object detection for autonomous driving.
2.2 Inference in the Bird’s-Eye-View Frame
In concurrent work, Pyramid Occupancy Networks [28] proposes a transformer architecture that converts image representations into bird’s-eye-view representa-tions.
3 Method
images :Xk
extrinsic matrix Ek
intrinsic matrix Ik
rasterized representation:
y
∈
R
C
×
X
×
Y
y ∈ R^{C×X×Y}
y∈RC×X×Y
discrete depths D:
[
d
0
+
∆
,
.
.
.
,
d
0
+
∣
D
∣
∆
]
[{d0 + ∆, ..., d0 + |D|∆}]
[d0+∆,...,d0+∣D∣∆].
The extrinsic and intrinsic matrices together define the mapping from reference coordinates (x, y, z) to local pixel coordinates ( h , w , d ) ∈ R 3 ∣ d ∈ D {(h, w, d) ∈ R^3 | d ∈ D} (h,w,d)∈R3∣d∈D
End-to-end cost map is created by section 3,1 and 3.2
3.1 Lift: Latent Depth Distribution
1、In isolation,depth inference runs on each image.
2. To generate representations at all possible depths for each pixel: a distribution over depth
α
α
α for every
pixel.
3. At pixel p, the network predicts a context
c
∈
R
C
c ∈ R^C
c∈RC
4. To creat a frustum feature(H,W,D,C):
c
d
=
α
∗
c
c_d = α*c
cd=α∗c
在这篇文章中,最关键的就是这个Lift部分,可以简单回顾一下:
整个Lift过程,其实分为三个部分。这个过程其实CaDDN论文中的画的图比较容易理解,我大致画了下:
1、特征提取&深度估计
多视角相机输入后,进入Backbone,同时利用一个深度估计网络估计出Depth的feature。值得注意的是,这里的Depth feature与 Image feature 的size 是相等的,因为后续要进行外积(Outer product)操作。
It is remarkable fact that two types of depth is specific value or relative value. Relative value is depth distribution
2、外积(Outer product)
这一步是LSS的最灵魂的操作。Depth distrbution(H,W,D) and image feature(H,W,C) combine to a frustum feature(H,W,D,C).
- Grid Sampling
这一步的目的就是将上面构造出的Frustum Feature 利用相机外参和内参转换到BEV视角下。具体过程是,通过限定好BEV视角的范围,划定好一个个的grid,将能够投影到相应grid 的 Feature 汇总到一个grid 里,之后再进行 "Splat"操作。这一步虽然听起来平平无奇,但是在具体的代码实现方面却有很多trick值得学习,感兴趣的可以去看我上面分享的大佬代码带读的链接。
3.2 Splat: Pillar Pooling
We follow the pointpillars [18] architecture to convert the large point cloud output by the “lift” step.
We assign every point to its nearest pillar and perform sum pooling to create a C × H × W tensor that can be processed by a standard CNN for bird’s-eye-view inference.
3.3 Shoot: Motion Planning
planning using the cost map can be achieved by “shooting” different trajectories, scoring their cost, then acting according to lowest cost trajectory.
We refer Neural Motion Planner (NMP) ,but Instead of the hard-margin loss proposed in NMP, we frame planning as
classification over a set of K template trajectories.
We frame “planning” as predicting a distribution over K template trajectories for the ego vehicle:
To leverage the cost-volume nature of the planning problem, we enforce the distribution over K template trajectories to take the following form:
conditioned on sensor observations p(τ |o). where co(x, y) is defined by indexing into the cost map predicted given obser-
vations o at location x, y and can therefore be trained end-to-end from data by optimizing for the log probability of expert trajectories.
We visualize the 1K trajectory tem-plates that we “shoot” onto our cost map dur-ing training and testing.
During training, the cost of each template trajectory is computed and interpreted as a 1K-dimensional Boltz-man distribution over the templates.
During testing, we choose the argmax of this distribu-tion and act according to the chosen template
For labels, given a ground-truth trajectory, we compute the nearest neighbor in L2 distance to the template trajectories T then train with the cross entropy loss. This definition of p(τi|o) enables us to learn an interpretable spatial cost function without defining a hard-margin loss as in NMP [41]
4 Implementation
For our bird’s-eye-view network,
- we pass through the first 3 meta-layers of ResNet-18 to get 3 bird’s-eye-view representations at different resolu-
tions x1, x2, x3. We then upsample x3 by a scale factor of 4, concatenate with x1, apply a resnet block, and finally upsample by 2 to return to the resolution of the original input bird’s-eye-view pseudo image. - the size of the input images H × W resize and crop input images to size 128 × 352 and adjust extrinsics and intrinsics accordingly.
- the resolution of the bird’s-eye-view grid X × Y =200×200 with cells of size 0.5 meters × 0.5 meters.
- the resolution of depth restrict between 4.0 meters and 45.0 meters spaced by 1.0 meters.
Frustum Pooling Cumula
We choose sum pooling
across pillars as opposed to max pooling, because our “cumulative sum trick” saves us from excessive memory usage due to padding.
The “cumulative sum trick” is:
- the observation that sum pooling can be performed by sorting all points according to bin id,
- performing a cumulative sum over all features,
- then subtracting the cumulative sum values at the boundaries of the bin sections.
Instead of relying on autograd to backprop through all three steps, the analytic gradient for the module as a whole can be derived, speeding up training by 2x.
We call the layer “Frustum Pooling” because it handles converting the frustums produced by n images into a fixed dimensional C × H × W tensor independent of the number of cameras n
总结和展望
LSS从提出到现在已经经过了时间的验证,大量学者在其基础上进行了更进一步的研究,提出了各种花里胡哨的SOTA模型。总结来看,值得提及的有以下几点:
优点:
1.LSS的方法提供了一个很好的融合到BEV视角下的方法。基于此方法,无论是动态目标检测,还是静态的道路结构认知,甚至是红绿灯检测,前车转向灯检测等等信息,都可以使用此方法提取到BEV特征下进行输出,极大地提高了自动驾驶感知框架的集成度。
2.虽然LSS提出的初衷是为了融合多视角相机的特征,为“纯视觉”模型而服务。但是在实际应用中,此套方法完全兼容其他传感器的特征融合。如果你想融合超声波雷达特征也不是不可以试试。
缺点:
1.极度依赖Depth信息的准确性,且必须显示地提供Depth 特征。当然,这是大部分纯视觉方法的硬伤。如果直接使用此方法通过梯度反传促进Depth网络的优化,如果Depth 网络设计的比较复杂,往往由于反传链过长使得Depth的优化方向比较模糊,难以取得较好效果。当然,一个好的解决方法是先预训练好一个较好的Depth权重,使得LSS过程中具有较为理想的Depth输出。
2.外积操作过于耗时。虽然对于机器学习来说,这样的计算量不足为道,但是对于要部署到车上的模型,当图片的feature size 较大, 且想要预测的Depth距离和精细度高时,外积这一操作带来的计算量则会大大增加。这十分不利于模型的轻量化部署,而这一点上,Transformer的方法反而还稍好一些。文章来源:https://www.toymoban.com/news/detail-689070.html
reference
pdf
LSS(Lift,Splat,Shoot)-实现BEV感知的最佳利器文章来源地址https://www.toymoban.com/news/detail-689070.html
到了这里,关于【MLP-BEV(2)】LSS,2D->3D,Lift-Splat-Shoot:通过隐式反投影到3D空间实现对任意相机图像编码的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!