多视角视频MAE;把任意人像插入到任意场景中;高分辨率可编辑视频卡通化;显示建模运动实现一致且可控的视频生成

这篇具有很好参考价值的文章主要介绍了多视角视频MAE;把任意人像插入到任意场景中;高分辨率可编辑视频卡通化;显示建模运动实现一致且可控的视频生成。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

本文首发于公众号:机器感知

多视角视频MAE;把任意人像插入到任意场景中;高分辨率可编辑视频卡通化;显示建模运动实现一致且可控的视频生成

Object-Driven One-Shot Fine-tuning of Text-to-Image Diffusion with Prototypical Embedding

多视角视频MAE;把任意人像插入到任意场景中;高分辨率可编辑视频卡通化;显示建模运动实现一致且可控的视频生成,人工智能,深度学习,图像处理,stable diffusion,计算机视觉,transformer

As large-scale text-to-image generation models have made remarkable progress in the field of text-to-image generation, many fine-tuning methods have been proposed. However, these models often struggle with novel objects, especially with one-shot scenarios. Our proposed method aims to address the challenges of generalizability and fidelity in an object-driven way, using only a single input image and the object-specific regions of interest. To improve generalizability and mitigate overfitting, in our paradigm, a prototypical embedding is initialized based on the object's appearance and its class, before fine-tuning the diffusion model. And during fine-tuning, we propose a class-characterizing regularization to preserve prior knowledge of object classes. To further improve fidelity, we introduce object-specific loss, which can also use to implant multiple objects. Overall, our proposed object-driven method for implanting new objects can integrate seamlessly with existing concepts as well as with high fidelity and generalization.

MV2MAE: Multi-View Video Masked Autoencoders

多视角视频MAE;把任意人像插入到任意场景中;高分辨率可编辑视频卡通化;显示建模运动实现一致且可控的视频生成,人工智能,深度学习,图像处理,stable diffusion,计算机视觉,transformer

Videos captured from multiple viewpoints can help in perceiving the 3D structure of the world and benefit computer vision tasks such as action recognition, tracking, etc. In this paper, we present a method for self-supervised learning from synchronized multi-view videos. We use a cross-view reconstruction task to inject geometry information in the model. Our approach is based on the masked autoencoder (MAE) framework. In addition to the same-view decoder, we introduce a separate cross-view decoder which leverages cross-attention mechanism to reconstruct a target viewpoint video using a video from source viewpoint, to help representations robust to viewpoint changes. For videos, static regions can be reconstructed trivially which hinders learning meaningful representations. To tackle this, we introduce a motion-weighted reconstruction loss which improves temporal modeling.

StableIdentity: Inserting Anybody into Anywhere at First Sight

多视角视频MAE;把任意人像插入到任意场景中;高分辨率可编辑视频卡通化;显示建模运动实现一致且可控的视频生成,人工智能,深度学习,图像处理,stable diffusion,计算机视觉,transformer

Recent advances in large pretrained text-to-image models have shown unprecedented capabilities for high-quality human-centric generation, however, customizing face identity is still an intractable problem. Existing methods cannot ensure stable identity preservation and flexible editability, even with several images for each subject during training. In this work, we propose StableIdentity, which allows identity-consistent recontextualization with just one face image. More specifically, we employ a face encoder with an identity prior to encode the input face, and then land the face representation into a space with an editable prior, which is constructed from celeb names. By incorporating identity prior and editability prior, the learned identity can be injected anywhere with various contexts. In addition, we design a masked two-phase diffusion loss to boost the pixel-level perception of the input face and maintain the diversity of generation. Extensive experiments demonstrate our method outperforms previous customization methods. In addition, the learned identity can be flexibly combined with the off-the-shelf modules such as ControlNet.

Motion-I2V: Consistent and Controllable Image-to-Video Generation with Explicit Motion Modeling

多视角视频MAE;把任意人像插入到任意场景中;高分辨率可编辑视频卡通化;显示建模运动实现一致且可控的视频生成,人工智能,深度学习,图像处理,stable diffusion,计算机视觉,transformer

We introduce Motion-I2V, a novel framework for consistent and controllable image-to-video generation (I2V). In contrast to previous methods that directly learn the complicated image-to-video mapping, Motion-I2V factorizes I2V into two stages with explicit motion modeling. For the first stage, we propose a diffusion-based motion field predictor, which focuses on deducing the trajectories of the reference image's pixels. For the second stage, we propose motion-augmented temporal attention to enhance the limited 1-D temporal attention in video latent diffusion models. This module can effectively propagate reference image's feature to synthesized frames with the guidance of predicted trajectories from the first stage. Compared with existing methods, Motion-I2V can generate more consistent videos even at the presence of large motion and viewpoint variation. By training a sparse trajectory ControlNet for the first stage, Motion-I2V can support users to precisely control motion trajectories and motion regions with sparse trajectory and region annotations. This offers more controllability of the I2V process than solely relying on textual instructions. Additionally, Motion-I2V's second stage naturally supports zero-shot video-to-video translation.

Spatial-Aware Latent Initialization for Controllable Image Generation

多视角视频MAE;把任意人像插入到任意场景中;高分辨率可编辑视频卡通化;显示建模运动实现一致且可控的视频生成,人工智能,深度学习,图像处理,stable diffusion,计算机视觉,transformer

Recently, text-to-image diffusion models have demonstrated impressive ability to generate high-quality images conditioned on the textual input. However, these models struggle to accurately adhere to textual instructions regarding spatial layout information. While previous research has primarily focused on aligning cross-attention maps with layout conditions, they overlook the impact of the initialization noise on the layout guidance. To achieve better layout control, we propose leveraging a spatial-aware initialization noise during the denoising process. Specifically, we find that the inverted reference image with finite inversion steps contains valuable spatial awareness regarding the object's position, resulting in similar layouts in the generated images. Based on this observation, we develop an open-vocabulary framework to customize a spatial-aware initialization noise for each layout condition. Without modifying other modules except the initialization noise, our approach can be seamlessly integrated as a plug-and-play module within other training-free layout guidance frameworks.

Diffutoon: High-Resolution Editable Toon Shading via Diffusion Models

多视角视频MAE;把任意人像插入到任意场景中;高分辨率可编辑视频卡通化;显示建模运动实现一致且可控的视频生成,人工智能,深度学习,图像处理,stable diffusion,计算机视觉,transformer

Toon shading is a type of non-photorealistic rendering task of animation. Its primary purpose is to render objects with a flat and stylized appearance. As diffusion models have ascended to the forefront of image synthesis methodologies, this paper delves into an innovative form of toon shading based on diffusion models, aiming to directly render photorealistic videos into anime styles. In video stylization, extant methods encounter persistent challenges, notably in maintaining consistency and achieving high visual quality. In this paper, we model the toon shading problem as four subproblems: stylization, consistency enhancement, structure guidance, and colorization. To address the challenges in video stylization, we propose an effective toon shading approach called \textit{Diffutoon}. Diffutoon is capable of rendering remarkably detailed, high-resolution, and extended-duration videos in anime style. It can also edit the content according to prompts via an additional branch.

InternLM-XComposer2: Mastering Free-form Text-Image Composition and  Comprehension in Vision-Language Large Model

多视角视频MAE;把任意人像插入到任意场景中;高分辨率可编辑视频卡通化;显示建模运动实现一致且可控的视频生成,人工智能,深度学习,图像处理,stable diffusion,计算机视觉,transformer

We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XComposer2 proposes a Partial LoRA (PLoRA) approach that applies additional LoRA parameters exclusively to image tokens to preserve the integrity of pre-trained language knowledge, striking a balance between precise vision understanding and text composition with literary talent. Experimental results demonstrate the superiority of InternLM-XComposer2 based on InternLM2-7B in producing high-quality long-text multi-modal content and its exceptional vision-language understanding performance across various benchmarks, where it not only significantly outperforms existing multimodal models but also matches or even surpasses GPT-4V and Gemini Pro in certain assessments. This highlights its remarkable proficiency in the realm of multimodal understanding.文章来源地址https://www.toymoban.com/news/detail-832593.html

到了这里,关于多视角视频MAE;把任意人像插入到任意场景中;高分辨率可编辑视频卡通化;显示建模运动实现一致且可控的视频生成的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • 视频人像检测(OpenCV)

    初学OpenCV,欢迎大佬请多多指点,嘿嘿 源码如下: 使用import cv2.cv2 as cv为了解决在pycharm中使用cv2包内置函数的时候没有代码补全提示的问题 首先读取视频,并检测视频是否正确被读取,然后对提取出的每一帧图像进行灰度处理。 CascadeClassifier是OpenCV中做人脸检测的时候的一

    2023年04月23日
    浏览(73)
  • 解决visio绘图插入到word后导出为pdf格式时分辨率下降的问题

    题目:解决visio绘图插入到word后导出为pdf格式时分辨率下降的问题     首先说一下软件配置:     (1) word为2016;     (2) visio共两个版本:Microsoft Visio Professional 2013和Microsoft Visio Premium 2010,以下分别简称为visio2013和visio2010;     (3) Adobe Acrobat Pro DC 2019     (4) 福昕PDF阅读器

    2024年02月07日
    浏览(162)
  • 【论文解读】SiamMAE:用于从视频中学习视觉对应关系的 MAE 简单扩展

    来源:投稿 作者:橡皮 编辑:学姐 论文链接:https://siam-mae-video.github.io/resources/paper.pdf 项目主页:https://siam-mae-video.github.io/ 时间是视觉学习背景下的一个特殊维度,它提供了一种结构,在该结构中,可以感知顺序事件、学习因果关系、跟踪物体在空间中的移动,以及预测未

    2024年01月18日
    浏览(43)
  • word软件中硬件图像加速有什么用处?禁用硬件图形加速(G)会影响word文档中插入图片的分辨率吗?

    问题描述:word软件中硬件图像加速有什么用处?禁用硬件图形加速(G)会影响word文档中插入图片的分辨率吗? 问题解答: 在 Microsoft Word 中,硬件图形加速主要用于提高图形元素的渲染速度和性能,特别是处理大量或复杂的图形时。启用硬件图形加速可以加快图形的加载速度

    2024年02月21日
    浏览(52)
  • VUE3+ThreeJs实现3D全景场景,可自由旋转视角

    three.js是一个用于在Web上创建三维图形的JavaScript库。它可以用于创建各种类型的三维场景,包括游戏、虚拟现实、建筑和产品可视化等。three.js提供了许多功能和特性,包括3D渲染、光照、材质、几何形状、动画、交互和相机控制等。使用three.js,开发人员可以轻松地创建复杂

    2024年02月11日
    浏览(57)
  • UE4场景中多个固定摄像机间切换视角

    在场景中提前摆放好摄像机 调整好摄像机角度,并且将摄像机更名为transform1 在蓝图中设置 蓝图解释 通过类获取所有摄像机 通过名称来获取到切换的摄像机 blend time 是相机切换的过渡时间,不会让画面切换的很生硬 new view target 是要切换到的相机

    2024年02月13日
    浏览(41)
  • 【视频超分辨率】视频超分辨率的介绍(定义,评价指标,分类)

    视频超分率起源于图像超分率,旨在根据已有的低分辨率视频序列生成具有真实细节和内容连续的高分辨率视频序列。视频超分辨率技术可以将 低分辨率(低清晰度)视频转换为高分辨率(高清晰度)视频 ,以提供更多的细节和清晰度。 视频超分辨率技术主要分为 传统方法

    2024年02月04日
    浏览(52)
  • 使用ffmpeg修改视频分辨率同时压缩视频的质量

    调整视频的质量和码率可以使用FFmpeg中的编码选项。以下是一些常用的选项: 1 质量选项 :使用 -q:v 参数设置视频的质量等级。质量等级的范围是 0-51,其中 0 表示无损压缩,51 表示最低质量。质量等级越低,视频的文件大小就越小,但是视频的质量也会降低。 2 码率选项

    2024年02月10日
    浏览(46)
  • 【c++】:list模拟实现“任意位置插入删除我最强ƪ(˘⌣˘)ʃ“

        文章目录 前言 一.list的基本功能的使用 二.list的模拟实现 总结   1. list是可以在常数范围内在任意位置进行插入和删除的序列式容器,并且该容器可以前后双向迭代。 2. list的底层是双向链表结构,双向链表中每个元素存储在互不相关的独立节点中,在节点中通过指针指

    2024年01月17日
    浏览(43)
  • 从不同场景地图的视角对单目相机进行重定位的方案综述

    文章:A Survey on Monocular Re-Localization: From the Perspective of Scene Map Representation 作者:Jinyu Miaoa, Kun Jianga, Tuopu Wena, Yunlong Wanga, Peijing Ji 编辑:点云PCL 欢迎各位加入知识星球,获取PDF论文,欢迎转发朋友圈。文章仅做学术分享,如有侵权联系删文。 公众号致力于点云处理,SLAM,三维

    2024年01月19日
    浏览(37)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包