单目图像深度估计——Monodepth2-Toy模板网

这篇具有很好参考价值的文章主要介绍了单目图像深度估计——Monodepth2。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

深度估计方法

Monodepth2使用基于单目图像的无监督学习法完成深度估计的任务。根据SFM模型原理在卷积神经网络中同时训练两组网络：深度网络和位姿网络。训练网络的输入为一段视频序列的连续多帧图片，深度网络输入目标视图，位姿网络输入目标视图和上一帧视图，深度网络经过卷积神经网络处理输出对应的深度图像，位姿网络计算出相机运动姿态的变化，根据两组网络的输出结果共同构建重投影图像，计算重投影误差引入至损失函数中，通过损失函数来反向传播更新模型参数，优化训练网络模型。

网络架构

深度网络

深度估计网络基于U-Net架构，这种网络架构能够实现更精确的分割。U-Net网络架构的收缩路径和扩展路径对称。深度网络的收缩和扩展分别通过下采样过程和上采样过程来实现，下采样过程来缩小图像生成图像的缩略图，可以用来表示环境特征信息，上采样过程来放大图像，并且结合下采样各层信息来对细节信息进行还原，可以很好的提高输出图像的精度。深度估计网络整体流程为编码过程和解码过程，流程图如下图所示，图中左侧部分为编码过程，图中右侧部分为解码过程。解码过程中将编码器网络中尺寸相同的特征信息进行特征融合后输入至上采样层。单目图像深度估计——Monodepth2
编码器可以进行特征提取，在深度估计网络中使用残差网络ResNet18作为深度编码器，残差网络可以实现跳跃连接，可以将上层网络的信息引入到下层网络，这样来解决深层网络梯度消失问题，并且残差网络的网络层数加深可以提升网络的性能，增强训练网络的鲁棒性，同时降低训练误差和测试误差。但是网络层次越多，训练速度越慢，所以ResNet18与ResNet50模型相比速度更快。
编码器网络的输入为单目相机所拍摄的彩色RGB图像，编码器网络结构如下图所示，彩色图首先进入卷积层和BN层进行处理，BN层可以将输入数据进行归一化处理，防止梯度消失或爆炸现象的出现，并且可以加快训练速度。然后进入ReLU激活函数和最大池化层，最大池化层对提取特征压缩，简化网络复杂度。然后进入Layer1，Layer1由两个残差块组成，残差块内部使用的激活函数为ELU激活函数，Layer2、Layer3、Layer4与Layer1是相同的结构，跨越不同层次之间，采用加大步长的卷积核进行卷积操作来代替下采样过程，特征图的尺寸逐倍缩小，编码器网络完成任务。单目图像深度估计——Monodepth2

from __future__ import absolute_import, division, print_function

import numpy as np

import torch
import torch.nn as nn
import torchvision.models as models
import torch.utils.model_zoo as model_zoo


class ResNetMultiImageInput(models.ResNet):
    """Constructs a resnet model with varying number of input images.
    Adapted from https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py
    """
    def __init__(self, block, layers, num_classes=1000, num_input_images=1):
        super(ResNetMultiImageInput, self).__init__(block, layers)
        self.inplanes = 64
        self.conv1 = nn.Conv2d(
            num_input_images * 3, 64, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, nn.BatchNorm2d):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)


def resnet_multiimage_input(num_layers, pretrained=False, num_input_images=1):
    """Constructs a ResNet model.
    Args:
        num_layers (int): Number of resnet layers. Must be 18 or 50
        pretrained (bool): If True, returns a model pre-trained on ImageNet
        num_input_images (int): Number of frames stacked as input
    """
    assert num_layers in [18, 50], "Can only run with 18 or 50 layer resnet"
    blocks = {18: [2, 2, 2, 2], 50: [3, 4, 6, 3]}[num_layers]
    block_type = {18: models.resnet.BasicBlock, 50: models.resnet.Bottleneck}[num_layers]
    model = ResNetMultiImageInput(block_type, blocks, num_input_images=num_input_images)

    if pretrained:
        loaded = model_zoo.load_url(models.resnet.model_urls['resnet{}'.format(num_layers)])
        loaded['conv1.weight'] = torch.cat(
            [loaded['conv1.weight']] * num_input_images, 1) / num_input_images
        model.load_state_dict(loaded)
    return model


class ResnetEncoder(nn.Module):
    """Pytorch module for a resnet encoder
    """
    def __init__(self, num_layers, pretrained, num_input_images=1):
        super(ResnetEncoder, self).__init__()

        self.num_ch_enc = np.array([64, 64, 128, 256, 512])

        resnets = {18: models.resnet18,
                   34: models.resnet34,
                   50: models.resnet50,
                   101: models.resnet101,
                   152: models.resnet152}

        if num_layers not in resnets:
            raise ValueError("{} is not a valid number of resnet layers".format(num_layers))

        if num_input_images > 1:
            self.encoder = resnet_multiimage_input(num_layers, pretrained, num_input_images)
        else:
            self.encoder = resnets[num_layers](pretrained)

        if num_layers > 34:
            self.num_ch_enc[1:] *= 4

    def forward(self, input_image):
        self.features = []
        x = (input_image - 0.45) / 0.225
        x = self.encoder.conv1(x)
        x = self.encoder.bn1(x)
        self.features.append(self.encoder.relu(x))
        self.features.append(self.encoder.layer1(self.encoder.maxpool(self.features[-1])))
        self.features.append(self.encoder.layer2(self.features[-1]))
        self.features.append(self.encoder.layer3(self.features[-1]))
        self.features.append(self.encoder.layer4(self.features[-1]))

        return self.features

解码器需要对编码器输出图像特征进行整合解析，深度估计网络中使用上采样层和卷积层结合的方式作为深度解码器，解码器的结构下图所示。解码器网络中包括4个相同的Upconv结构，Upconv中包括特征图融合、多个卷积层和上采样过程，每个Upconv的输入为上一层网络的输出和编码器网络中相同尺度的特征图，将相同尺寸的特征图融合后，进行卷积操作、上采样操作，最后Upconv4所输出的图像尺寸与输入图像尺寸相同。在解码器中，使用反射填充来代替零填充，当需要对输入矩阵进行扩充时，扩充值使用附近的像素值，这样可以减少边界处模糊的情况，提高特征图的清晰度。单目图像深度估计——Monodepth2

在每个Upconv的输出端使用Sigmoid函数，使用下式将最后一个Sigmoid函数的输出结果δ转换为深度D，选择a和b将D约束在0.1到100个单位之间。
$\ D =\cfrac{1}{a\delta+b}$

from __future__ import absolute_import, division, print_function

import numpy as np
import torch
import torch.nn as nn

from collections import OrderedDict
from layers import *


class DepthDecoder(nn.Module):
    def __init__(self, num_ch_enc, scales=range(4), num_output_channels=1, use_skips=True):
        super(DepthDecoder, self).__init__()

        self.num_output_channels = num_output_channels
        self.use_skips = use_skips
        self.upsample_mode = 'nearest'
        self.scales = scales

        self.num_ch_enc = num_ch_enc
        self.num_ch_dec = np.array([16, 32, 64, 128, 256])

        # decoder
        self.convs = OrderedDict()
        for i in range(4, -1, -1):
            # upconv_0
            num_ch_in = self.num_ch_enc[-1] if i == 4 else self.num_ch_dec[i + 1]
            num_ch_out = self.num_ch_dec[i]
            self.convs[("upconv", i, 0)] = ConvBlock(num_ch_in, num_ch_out)

            # upconv_1
            num_ch_in = self.num_ch_dec[i]
            if self.use_skips and i > 0:
                num_ch_in += self.num_ch_enc[i - 1]
            num_ch_out = self.num_ch_dec[i]
            self.convs[("upconv", i, 1)] = ConvBlock(num_ch_in, num_ch_out)

        for s in self.scales:
            self.convs[("dispconv", s)] = Conv3x3(self.num_ch_dec[s], self.num_output_channels)

        self.decoder = nn.ModuleList(list(self.convs.values()))
        self.sigmoid = nn.Sigmoid()

    def forward(self, input_features):
        self.outputs = {}

        # decoder
        x = input_features[-1]
        for i in range(4, -1, -1):
            x = self.convs[("upconv", i, 0)](x)
            x = [upsample(x)]
            if self.use_skips and i > 0:
                x += [input_features[i - 1]]
            x = torch.cat(x, 1)
            x = self.convs[("upconv", i, 1)](x)
            if i in self.scales:
                self.outputs[("disp", i)] = self.sigmoid(self.convs[("dispconv", i)](x))

        return self.outputs

位姿网络

因为单张彩色图片是无法获取场景下的三维信息的，所以使用单目相机所拍摄视频的连续前后两帧才可以获得相机相对于场景中角度和位置的变化。位姿网络的输入为上一帧图像和当前帧图像，是一对彩色图像，所以位姿网络接收六通道作为输入。位姿网络与深度网络整体流程类似，如下图所示，网络中包括编码过程和解码过程。左侧为编码过程，右侧为解码过程，网络最终输出轴角变化矩阵和平移变化矩阵。
单目图像深度估计——Monodepth2
编码器网络中同样使用ResNet18结构，并且在位姿编码器中使用预训练权重模型。预训练权重模型是在较大的数据集上进行训练所得到的模型，可以直接用于解决相似的问题。此处引入已训练好的残差网络模型，加入预训练模型对模型参数进行初始化，可以节约训练时间，降低训练过程中欠拟合和过拟合的风险。将预训练模型中第一个卷积核的维度进行扩展，使网络可以接收六通道作为输入，将扩展后的卷积核中的权重除以2，保证卷积操作结束后与单张图像进入残差网络的数值范围相同。编码器网络中最终输出图像特征。
解码器网络将编码器中所提取的图像特征进行整合。首先使用Squeeze对图像特征进行降维操作，然后将图像特征按照行并排起来，然后进行多次卷积操作，将矩阵缩放0.01，最终输出轴角矩阵和平移矩阵。使用矩阵预测出相机位置变化的平移运动和旋转运动。

from __future__ import absolute_import, division, print_function#pose—decoder

import torch
import torch.nn as nn
from collections import OrderedDict


class PoseDecoder(nn.Module):
    def __init__(self, num_ch_enc, num_input_features, num_frames_to_predict_for=None, stride=1):
        super(PoseDecoder, self).__init__()

        self.num_ch_enc = num_ch_enc
        self.num_input_features = num_input_features

        if num_frames_to_predict_for is None:
            num_frames_to_predict_for = num_input_features - 1
        self.num_frames_to_predict_for = num_frames_to_predict_for

        self.convs = OrderedDict()
        self.convs[("squeeze")] = nn.Conv2d(self.num_ch_enc[-1], 256, 1)
        self.convs[("pose", 0)] = nn.Conv2d(num_input_features * 256, 256, 3, stride, 1)
        self.convs[("pose", 1)] = nn.Conv2d(256, 256, 3, stride, 1)
        self.convs[("pose", 2)] = nn.Conv2d(256, 6 * num_frames_to_predict_for, 1)

        self.relu = nn.ReLU()

        self.net = nn.ModuleList(list(self.convs.values()))

    def forward(self, input_features):
        last_features = [f[-1] for f in input_features]

        cat_features = [self.relu(self.convs["squeeze"](f)) for f in last_features]
        cat_features = torch.cat(cat_features, 1)

        out = cat_features
        for i in range(3):
            out = self.convs[("pose", i)](out)
            if i != 2:
                out = self.relu(out)

        out = out.mean(3).mean(2)

        out = 0.01 * out.view(-1, self.num_frames_to_predict_for, 1, 6)

        axisangle = out[..., :3]
        translation = out[..., 3:]

        return axisangle, translation

损失函数构建

损失函数可以用于衡量模型的好坏，其表示预测值和真实值的差异程度。在模型的训练过程中，训练数据输入至网络模型，通过前向通道得到预测值，然后使用损失函数计算预测值与真实值的差值，将损失函数的值反馈传播至训练网络模型中，优化模型中的各项参数，达到降低损失值的目的，从而实现模型生成的预测值不断向真实值逼近。此模型的损失函数将重投影误差和像素平滑度结合。monodepth2很突出的一个贡献之一就是采用最小化重投影误差来代替平均化误差，平均化处理会致使重投影误差过大，从而使得损失函数的值过大，导致深度图边缘模糊。使用最小重投影误差损失仅将每个像素与可见的视图进行匹配，可以产生更清晰的效果，提高遮挡边界的清晰度。

注：刚刚接触深度学习部分，单目图像深度估计是我在深度学习和计算机视觉领域接触的第一个问题，在理解和写作时会存在一定的错误，请大家对我多多指正。文章来源地址https://www.toymoban.com/news/detail-474122.html

到了这里，关于单目图像深度估计——Monodepth2的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！