Image Editing、3D Textured Mesh、Image Composition、SplattingAvatar

8月前作者：JiauZhang 分类：Toy博客阅读(46) 违法举报

这篇具有很好参考价值的文章主要介绍了Image Editing、3D Textured Mesh、Image Composition、SplattingAvatar。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

本文首发于公众号：机器感知

Image Editing、3D Textured Mesh、Image Composition、SplattingAvatar

An Item is Worth a Prompt: Versatile Image Editing with Disentangled Control

Image Editing、3D Textured Mesh、Image Composition、SplattingAvatar,深度学习,transformer,stable diffusion,3d,图像生成,文生图,视频生成

Building on the success of text-to-image diffusion models (DPMs), image editing is an important application to enable human interaction with AI-generated content. Among various editing methods, editing within the prompt space gains more attention due to its capacity and simplicity of controlling semantics. However, since diffusion models are commonly pretrained on descriptive text captions, direct editing of words in text prompts usually leads to completely different generated images, violating the requirements for image editing. On the other hand, existing editing methods usually consider introducing spatial masks to preserve the identity of unedited regions, which are usually ignored by DPMs and therefore lead to inharmonic editing results. Targeting these two challenges, in this work, we propose to disentangle the comprehensive image-prompt interaction into several item-prompt interactions, with each item linked to a special learned prompt. The resulting framework, named D-Edit, is based on pretrained diffusion models with cross-attention layers disentangled and adopts a two-step optimization to build item-prompt associations. Versatile image editing can then be applied to specific items by manipulating the corresponding prompts. We demonstrate state-of-the-art results in four types of editing operations including image-based, text-based, mask-based editing, and item removal, covering most types of editing applications, all within a single unified framework. Notably, D-Edit is the first framework that can (1) achieve item editing through mask editing and (2) combine image and text-based editing. We demonstrate the quality and versatility of the editing results for a diverse collection of images through both qualitative and quantitative evaluations.

RFWave: Multi-band Rectified Flow for Audio Waveform Reconstruction

Image Editing、3D Textured Mesh、Image Composition、SplattingAvatar,深度学习,transformer,stable diffusion,3d,图像生成,文生图,视频生成

Recent advancements in generative modeling have led to significant progress in audio waveform reconstruction from diverse representations. Although diffusion models have been used for reconstructing audio waveforms, they tend to exhibit latency issues because they operate at the level of individual sample points and require a relatively large number of sampling steps. In this study, we introduce RFWave, a novel multi-band Rectified Flow approach that reconstructs high-fidelity audio waveforms from Mel-spectrograms. RFWave is distinctive for generating complex spectrograms and operating at the frame level, processing all subbands concurrently to enhance efficiency. Thanks to Rectified Flow, which aims for a flat transport trajectory, RFWave requires only 10 sampling steps. Empirical evaluations demonstrate that RFWave achieves exceptional reconstruction quality and superior computational efficiency, capable of generating audio at a speed 90 times faster than real-time.

InstructGIE: Towards Generalizable Image Editing

Image Editing、3D Textured Mesh、Image Composition、SplattingAvatar,深度学习,transformer,stable diffusion,3d,图像生成,文生图,视频生成

Recent advances in image editing have been driven by the development of denoising diffusion models, marking a significant leap forward in this field. Despite these advances, the generalization capabilities of recent image editing approaches remain constrained. In response to this challenge, our study introduces a novel image editing framework with enhanced generalization robustness by boosting in-context learning capability and unifying language instruction. This framework incorporates a module specifically optimized for image editing tasks, leveraging the VMamba Block and an editing-shift matching strategy to augment in-context learning. Furthermore, we unveil a selective area-matching technique specifically engineered to address and rectify corrupted details in generated images, such as human facial features, to further improve the quality. Another key innovation of our approach is the integration of a language unification technique, which aligns language embeddings with editing semantics to elevate the quality of image editing. Moreover, we compile the first dataset for image editing with visual prompts and editing instructions that could be used to enhance in-context capability. Trained on this dataset, our methodology not only achieves superior synthesis quality for trained tasks, but also demonstrates robust generalization capability across unseen vision tasks through tailored prompts.

CRM: Single Image to 3D Textured Mesh with Convolutional Reconstruction Model

Image Editing、3D Textured Mesh、Image Composition、SplattingAvatar,深度学习,transformer,stable diffusion,3d,图像生成,文生图,视频生成

Feed-forward 3D generative models like the Large Reconstruction Model (LRM) have demonstrated exceptional generation speed. However, the transformer-based methods do not leverage the geometric priors of the triplane component in their architecture, often leading to sub-optimal quality given the limited size of 3D data and slow training. In this work, we present the Convolutional Reconstruction Model (CRM), a high-fidelity feed-forward single image-to-3D generative model. Recognizing the limitations posed by sparse 3D data, we highlight the necessity of integrating geometric priors into network design. CRM builds on the key observation that the visualization of triplane exhibits spatial correspondence of six orthographic images. First, it generates six orthographic view images from a single input image, then feeds these images into a convolutional U-Net, leveraging its strong pixel-level alignment capabilities and significant bandwidth to create a high-resolution triplane. CRM further employs Flexicubes as geometric representation, facilitating direct end-to-end optimization on textured meshes. Overall, our model delivers a high-fidelity textured mesh from an image in just 10 seconds, without any test-time optimization.

XPSR: Cross-modal Priors for Diffusion-based Image Super-Resolution

Image Editing、3D Textured Mesh、Image Composition、SplattingAvatar,深度学习,transformer,stable diffusion,3d,图像生成,文生图,视频生成

Diffusion-based methods, endowed with a formidable generative prior, have received increasing attention in Image Super-Resolution (ISR) recently. However, as low-resolution (LR) images often undergo severe degradation, it is challenging for ISR models to perceive the semantic and degradation information, resulting in restoration images with incorrect content or unrealistic artifacts. To address these issues, we propose a \textit{Cross-modal Priors for Super-Resolution (XPSR)} framework. Within XPSR, to acquire precise and comprehensive semantic conditions for the diffusion model, cutting-edge Multimodal Large Language Models (MLLMs) are utilized. To facilitate better fusion of cross-modal priors, a \textit{Semantic-Fusion Attention} is raised. To distill semantic-preserved information instead of undesired degradations, a \textit{Degradation-Free Constraint} is attached between LR and its high-resolution (HR) counterpart. Quantitative and qualitative results show that XPSR is capable of generating high-fidelity and high-realism images across synthetic and real-world datasets. Codes will be released at \url{https://github.com/qyp2000/XPSR}.

PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering

Image Editing、3D Textured Mesh、Image Composition、SplattingAvatar,深度学习,transformer,stable diffusion,3d,图像生成,文生图,视频生成

Image composition involves seamlessly integrating given objects into a specific visual context. The current training-free methods rely on composing attention weights from several samplers to guide the generator. However, since these weights are derived from disparate contexts, their combination leads to coherence confusion in synthesis and loss of appearance information. These issues worsen with their excessive focus on background generation, even when unnecessary in this task. This not only slows down inference but also compromises foreground generation quality. Moreover, these methods introduce unwanted artifacts in the transition area. In this paper, we formulate image composition as a subject-based local editing task, solely focusing on foreground generation. At each step, the edited foreground is combined with the noisy background to maintain scene consistency. To address the remaining issues, we propose PrimeComposer, a faster training-free diffuser that composites the images by well-designed attention steering across different noise levels. This steering is predominantly achieved by our Correlation Diffuser, utilizing its self-attention layers at each step. Within these layers, the synthesized subject interacts with both the referenced object and background, capturing intricate details and coherent relationships. This prior information is encoded into the attention weights, which are then integrated into the self-attention layers of the generator to guide the synthesis process. Besides, we introduce a Region-constrained Cross-Attention to confine the impact of specific subject-related words to desired regions, addressing the unwanted artifacts shown in the prior method thereby further improving the coherence in the transition area. Our method exhibits the fastest inference efficiency and extensive experiments demonstrate our superiority both qualitatively and quantitatively.

Stealing Stable Diffusion Prior for Robust Monocular Depth Estimation

Image Editing、3D Textured Mesh、Image Composition、SplattingAvatar,深度学习,transformer,stable diffusion,3d,图像生成,文生图,视频生成

Monocular depth estimation is a crucial task in computer vision. While existing methods have shown impressive results under standard conditions, they often face challenges in reliably performing in scenarios such as low-light or rainy conditions due to the absence of diverse training data. This paper introduces a novel approach named Stealing Stable Diffusion (SSD) prior for robust monocular depth estimation. The approach addresses this limitation by utilizing stable diffusion to generate synthetic images that mimic challenging conditions. Additionally, a self-training mechanism is introduced to enhance the model's depth estimation capability in such challenging environments. To enhance the utilization of the stable diffusion prior further, the DINOv2 encoder is integrated into the depth model architecture, enabling the model to leverage rich semantic priors and improve its scene understanding. Furthermore, a teacher loss is introduced to guide the student models in acquiring meaningful knowledge independently, thus reducing their dependency on the teacher models. The effectiveness of the approach is evaluated on nuScenes and Oxford RobotCar, two challenging public datasets, with the results showing the efficacy of the method. Source code and weights are available at: https://github.com/hitcslj/SSD.

Improving Diffusion-Based Generative Models via Approximated Optimal Transport

Image Editing、3D Textured Mesh、Image Composition、SplattingAvatar,深度学习,transformer,stable diffusion,3d,图像生成,文生图,视频生成

We introduce the Approximated Optimal Transport (AOT) technique, a novel training scheme for diffusion-based generative models. Our approach aims to approximate and integrate optimal transport into the training process, significantly enhancing the ability of diffusion models to estimate the denoiser outputs accurately. This improvement leads to ODE trajectories of diffusion models with lower curvature and reduced truncation errors during sampling. We achieve superior image quality and reduced sampling steps by employing AOT in training. Specifically, we achieve FID scores of 1.88 with just 27 NFEs and 1.73 with 29 NFEs in unconditional and conditional generations, respectively. Furthermore, when applying AOT to train the discriminator for guidance, we establish new state-of-the-art FID scores of 1.68 and 1.58 for unconditional and conditional generations, respectively, each with 29 NFEs. This outcome demonstrates the effectiveness of AOT in enhancing the performance of diffusion models.

SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting

Image Editing、3D Textured Mesh、Image Composition、SplattingAvatar,深度学习,transformer,stable diffusion,3d,图像生成,文生图,视频生成

We present SplattingAvatar, a hybrid 3D representation of photorealistic human avatars with Gaussian Splatting embedded on a triangle mesh, which renders over 300 FPS on a modern GPU and 30 FPS on a mobile device. We disentangle the motion and appearance of a virtual human with explicit mesh geometry and implicit appearance modeling with Gaussian Splatting. The Gaussians are defined by barycentric coordinates and displacement on a triangle mesh as Phong surfaces. We extend lifted optimization to simultaneously optimize the parameters of the Gaussians while walking on the triangle mesh. SplattingAvatar is a hybrid representation of virtual humans where the mesh represents low-frequency motion and surface deformation, while the Gaussians take over the high-frequency geometry and detailed appearance. Unlike existing deformation methods that rely on an MLP-based linear blend skinning (LBS) field for motion, we control the rotation and translation of the Gaussians directly by mesh, which empowers its compatibility with various animation techniques, e.g., skeletal animation, blend shapes, and mesh editing. Trainable from monocular videos for both full-body and head avatars, SplattingAvatar shows state-of-the-art rendering quality across multiple datasets.

Face2Diffusion for Fast and Editable Face Personalization

Image Editing、3D Textured Mesh、Image Composition、SplattingAvatar,深度学习,transformer,stable diffusion,3d,图像生成,文生图,视频生成

Face personalization aims to insert specific faces, taken from images, into pretrained text-to-image diffusion models. However, it is still challenging for previous methods to preserve both the identity similarity and editability due to overfitting to training samples. In this paper, we propose Face2Diffusion (F2D) for high-editability face personalization. The core idea behind F2D is that removing identity-irrelevant information from the training pipeline prevents the overfitting problem and improves editability of encoded faces. F2D consists of the following three novel components: 1) Multi-scale identity encoder provides well-disentangled identity features while keeping the benefits of multi-scale information, which improves the diversity of camera poses. 2) Expression guidance disentangles face expressions from identities and improves the controllability of face expressions. 3) Class-guided denoising regularization encourages models to learn how faces should be denoised, which boosts the text-alignment of backgrounds. Extensive experiments on the FaceForensics++ dataset and diverse prompts demonstrate our method greatly improves the trade-off between the identity- and text-fidelity compared to previous state-of-the-art methods.

CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion

Image Editing、3D Textured Mesh、Image Composition、SplattingAvatar,深度学习,transformer,stable diffusion,3d,图像生成,文生图,视频生成

Recent advancements in text-to-image generative systems have been largely driven by diffusion models. However, single-stage text-to-image diffusion models still face challenges, in terms of computational efficiency and the refinement of image details. To tackle the issue, we propose CogView3, an innovative cascaded framework that enhances the performance of text-to-image diffusion. CogView3 is the first model implementing relay diffusion in the realm of text-to-image generation, executing the task by first creating low-resolution images and subsequently applying relay-based super-resolution. This methodology not only results in competitive text-to-image outputs but also greatly reduces both training and inference costs. Our experimental results demonstrate that CogView3 outperforms SDXL, the current state-of-the-art open-source text-to-image diffusion model, by 77.0\% in human evaluations, all while requiring only about 1/2 of the inference time. The distilled variant of CogView3 achieves comparable performance while only utilizing 1/10 of the inference time by SDXL.

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Image Editing、3D Textured Mesh、Image Composition、SplattingAvatar,深度学习,transformer,stable diffusion,3d,图像生成,文生图,视频生成

Diffusion models have demonstrated remarkable performance in the domain of text-to-image generation. However, most widely used models still employ CLIP as their text encoder, which constrains their ability to comprehend dense prompts, encompassing multiple objects, detailed attributes, complex relationships, long-text alignment, etc. In this paper, we introduce an Efficient Large Language Model Adapter, termed ELLA, which equips text-to-image diffusion models with powerful Large Language Models (LLM) to enhance text alignment without training of either U-Net or LLM. To seamlessly bridge two pre-trained models, we investigate a range of semantic alignment connector designs and propose a novel module, the Timestep-Aware Semantic Connector (TSC), which dynamically extracts timestep-dependent conditions from LLM. Our approach adapts semantic features at different stages of the denoising process, assisting diffusion models in interpreting lengthy and intricate prompts over sampling timesteps. Additionally, ELLA can be readily incorporated with community models and tools to improve their prompt-following capabilities. To assess text-to-image models in dense prompt following, we introduce Dense Prompt Graph Benchmark (DPG-Bench), a challenging benchmark consisting of 1K dense prompts. Extensive experiments demonstrate the superiority of ELLA in dense prompt following compared to state-of-the-art methods, particularly in multiple object compositions involving diverse attributes and relationships.

Improving Diffusion Models for Virtual Try-on

Image Editing、3D Textured Mesh、Image Composition、SplattingAvatar,深度学习,transformer,stable diffusion,3d,图像生成,文生图,视频生成

This paper considers image-based virtual try-on, which renders an image of a person wearing a curated garment, given a pair of images depicting the person and the garment, respectively. Previous works adapt existing exemplar-based inpainting diffusion models for virtual try-on to improve the naturalness of the generated visuals compared to other methods (e.g., GAN-based), but they fail to preserve the identity of the garments. To overcome this limitation, we propose a novel diffusion model that improves garment fidelity and generates authentic virtual try-on images. Our method, coined IDM-VTON, uses two different modules to encode the semantics of garment image; given the base UNet of the diffusion model, 1) the high-level semantics extracted from a visual encoder are fused to the cross-attention layer, and then 2) the low-level features extracted from parallel UNet are fused to the self-attention layer. In addition, we provide detailed textual prompts for both garment and person images to enhance the authenticity of the generated visuals. Finally, we present a customization method using a pair of person-garment images, which significantly improves fidelity and authenticity. Our experimental results show that our method outperforms previous approaches (both diffusion-based and GAN-based) in preserving garment details and generating authentic virtual try-on images, both qualitatively and quantitatively. Furthermore, the proposed customization method demonstrates its effectiveness in a real-world scenario.

GSEdit: Efficient Text-Guided Editing of 3D Objects via Gaussian Splatting

Image Editing、3D Textured Mesh、Image Composition、SplattingAvatar,深度学习,transformer,stable diffusion,3d,图像生成,文生图,视频生成

We present GSEdit, a pipeline for text-guided 3D object editing based on Gaussian Splatting models. Our method enables the editing of the style and appearance of 3D objects without altering their main details, all in a matter of minutes on consumer hardware. We tackle the problem by leveraging Gaussian splatting to represent 3D scenes, and we optimize the model while progressively varying the image supervision by means of a pretrained image-based diffusion model. The input object may be given as a 3D triangular mesh, or directly provided as Gaussians from a generative model such as DreamGaussian. GSEdit ensures consistency across different viewpoints, maintaining the integrity of the original object's information. Compared to previously proposed methods relying on NeRF-like MLP models, GSEdit stands out for its efficiency, making 3D editing tasks much faster. Our editing process is refined via the application of the SDS loss, ensuring that our edits are both precise and accurate. Our comprehensive evaluation demonstrates that GSEdit effectively alters object shape and appearance following the given textual instructions while preserving their coherence and detail.

Denoising Autoregressive Representation Learning

Image Editing、3D Textured Mesh、Image Composition、SplattingAvatar,深度学习,transformer,stable diffusion,3d,图像生成,文生图,视频生成

In this paper, we explore a new generative approach for learning visual representations. Our method, DARL, employs a decoder-only Transformer to predict image patches autoregressively. We find that training with Mean Squared Error (MSE) alone leads to strong representations. To enhance the image generation ability, we replace the MSE loss with the diffusion objective by using a denoising patch decoder. We show that the learned representation can be improved by using tailored noise schedules and longer training in larger models. Notably, the optimal schedule differs significantly from the typical ones used in standard image diffusion models. Overall, despite its simple architecture, DARL delivers performance remarkably close to state-of-the-art masked prediction models under the fine-tuning protocol. This marks an important step towards a unified model capable of both visual perception and generation, effectively combining the strengths of autoregressive and denoising diffusion models.

Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation

Image Editing、3D Textured Mesh、Image Composition、SplattingAvatar,深度学习,transformer,stable diffusion,3d,图像生成,文生图,视频生成

Vanilla text-to-image diffusion models struggle with generating accurate human images, commonly resulting in imperfect anatomies such as unnatural postures or disproportionate limbs.Existing methods address this issue mostly by fine-tuning the model with extra images or adding additional controls -- human-centric priors such as pose or depth maps -- during the image generation phase. This paper explores the integration of these human-centric priors directly into the model fine-tuning stage, essentially eliminating the need for extra conditions at the inference stage. We realize this idea by proposing a human-centric alignment loss to strengthen human-related information from the textual prompts within the cross-attention maps. To ensure semantic detail richness and human structural accuracy during fine-tuning, we introduce scale-aware and step-wise constraints within the diffusion process, according to an in-depth analysis of the cross-attention layer. Extensive experiments show that our method largely improves over state-of-the-art text-to-image models to synthesize high-quality human images based on user-written prompts. Project page: \url{https://hcplayercvpr2024.github.io}.

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

Image Editing、3D Textured Mesh、Image Composition、SplattingAvatar,深度学习,transformer,stable diffusion,3d,图像生成,文生图,视频生成

Text-to-image diffusion models (T2I) have demonstrated unprecedented capabilities in creating realistic and aesthetic images. On the contrary, text-to-video diffusion models (T2V) still lag far behind in frame quality and text alignment, owing to insufficient quality and quantity of training videos. In this paper, we introduce VideoElevator, a training-free and plug-and-play method, which elevates the performance of T2V using superior capabilities of T2I. Different from conventional T2V sampling (i.e., temporal and spatial modeling), VideoElevator explicitly decomposes each sampling step into temporal motion refining and spatial quality elevating. Specifically, temporal motion refining uses encapsulated T2V to enhance temporal consistency, followed by inverting to the noise distribution required by T2I. Then, spatial quality elevating harnesses inflated T2I to directly predict less noisy latent, adding more photo-realistic details. We have conducted experiments in extensive prompts under the combination of various T2V and T2I. The results show that VideoElevator not only improves the performance of T2V baselines with foundational T2I, but also facilitates stylistic video synthesis with personalized T2I. Our code is available at https://github.com/YBYBZhang/VideoElevator.

GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM

Image Editing、3D Textured Mesh、Image Composition、SplattingAvatar,深度学习,transformer,stable diffusion,3d,图像生成,文生图,视频生成

Key-value (KV) caching has become the de-facto to accelerate generation speed for large language models (LLMs) inference. However, the growing cache demand with increasing sequence length has transformed LLM inference to be a memory bound problem, significantly constraining the system throughput. Existing methods rely on dropping unimportant tokens or quantizing all entries uniformly. Such methods, however, often incur high approximation errors to represent the compressed matrices. The autoregressive decoding process further compounds the error of each step, resulting in critical deviation in model generation and deterioration of performance. To tackle this challenge, we propose GEAR, an efficient KV cache compression framework that achieves near-lossless high-ratio compression. GEAR first applies quantization to majority of entries of similar magnitudes to ultra-low precision. It then employs a low rank matrix to approximate the quantization error, and a sparse matrix to remedy individual errors from outlier entries. By adeptly integrating three techniques, GEAR is able to fully exploit their synergistic potentials. Our experiments demonstrate that compared to alternatives, GEAR achieves near-lossless 4-bit KV cache compression with up to 2.38x throughput improvement, while reducing peak-memory size up to 2.29x. Our code is publicly available at https://github.com/HaoKang-Timmy/GEAR.

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

In this report, we present the latest model of the Gemini family, Gemini 1.5 Pro, a highly compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. Gemini 1.5 Pro achieves near-perfect recall on long-context retrieval tasks across modalities, improves the state-of-the-art in long-document QA, long-video QA and long-context ASR, and matches or surpasses Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Studying the limits of Gemini 1.5 Pro's long-context ability, we find continued improvement in next-token prediction and near-perfect retrieval (>99%) up to at least 10M tokens, a generational leap over existing models such as Claude 2.1 (200k) and GPT-4 Turbo (128k). Finally, we highlight surprising new capabilities of large language models at the frontier; when given a grammar manual for Kalamang, a language with fewer than 200 speakers worldwide, the model learns to translate English to Kalamang at a similar level to a person who learned from the same content.文章来源地址https://www.toymoban.com/news/detail-841124.html

到了这里，关于Image Editing、3D Textured Mesh、Image Composition、SplattingAvatar的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：如若内容造成侵权/违法违规/事实不符，请点击违法举报进行投诉反馈，一经查实，立即删除！

分享到：

领支付宝红包赞助服务器费用

Text to image论文精读MISE：多模态图像合成和编辑Multimodal Image Synthesis and Editing: A Survey

由于信息在现实世界中以各种形式存在，多模态信息之间的有效交互和融合对于计算机视觉和深度学习研究中多模态数据的创建和感知起着关键作用。近期 OpenAI 发布的 DALLE-2 和谷歌发布的 Imagen 等实现了令人惊叹的文字到图像的生成效果，引发了广泛关注并且衍生出了很多

2024年02月04日
浏览(52)
【快速阅读二】从OpenCv的代码中扣取泊松融合算子（Poisson Image Editing）并稍作优化

泊松融合我自己写的第一版程序大概是2016年在某个小房间里折腾出来的，当时是用的迭代的方式，记得似乎效果不怎么样，没有达到论文的效果。前段时间又有网友问我有没有这方面的程序，我说Opencv已经有了，可以直接使用，他说opencv的框架太大，不想为了一个功能的需求

2024年01月16日
浏览(57)
论文翻译：Text-based Image Editing for Food Images with CLIP

图1：通过文本对食品图像进行处理的结果示例。最左边一栏显示的是原始输入图像。\\\"Chahan\\\"（日语中的炒饭）和 \\\"蒸饭\\\"。左起第二至第六列显示了VQGAN-CLIP所处理的图像。每个操作中使用的提示都是将食物名称和 \\\"与 \\\"一个配料名称结合起来。例如，第二列中的两幅图像分别是

2024年02月16日
浏览(48)
【论文阅读笔记】Emu Edit: Precise Image Editing via Recognition and Generation Tasks

Emu edit是一篇图像编辑Image Editing 的文章，和instruct pix2pix类似，选择了合成数据作为训练数据，不是zero-shot任务，并进一步将多种任务都整合为生成任务，从而提高模型的编辑能力。本篇文章的效果应该目前最好的，在local和global编辑甚至其他代理任务（分割、边缘检测等）

2024年02月04日
浏览(55)
[配环境]GET3D: A Generative Model of High Quality 3D Textured Shapes Learned from Images (docker方法)

代码地址：https://github.com/nv-tlabs/GET3D 本文使用了官方提供的docker镜像。克隆项目 Build Docker image 过程如下： Start an interactive docker container: 个人修改如下： docker run --privileged --gpus device=all --shm-size 125G -p XXXX:6006 -p XXXX:22 -it -d -v /home/yuqiao/docker_home:/home/yuqiao -w /home/yuqiao --name get

2024年02月02日
浏览(51)
Vox-E: Text-guided Voxel Editing of 3D Objects（3D目标的文本引导体素编辑）

Paper：https://readpaper.com/paper/1705264952657440000 Code：http://vox-e.github.io/ 原文链接：Vox-E: 3D目标的文本引导体素编辑（by 小样本视觉与智能前沿）这一领域的研究主要集中在仅外观的操作上，它改变了对象的纹理[44,46]和样式[48,42]，或者通过与显式网格表示的对应关系进行几何编辑

2024年02月12日
浏览(41)
读取3D文件mesh格式工具

最近要做一个3d仪表，所以了解了一下3d相关方面的知识。这里暂时不做一一赘述，只记录下当前的需求。需求：由于****.mesh文件比较多，qt转换后的名字大多都能顾名思义，但是为了更加准确的找到某个部件，于是需要一个工具可以打开并查看****.mesh文件。自己

2024年02月12日
浏览(49)
【翻译】Neural 3D Mesh Renderer

对于二维图像背后的三维世界的建模，哪种三维表示法最合适？多边形网格因其紧凑性和几何特性而成为一个有希望的候选者。然而，使用神经网络从二维图像建立多边形网格模型并不简单，因为从网格到图像的转换，或者说转折，涉及到一个被称为栅格化的离散操作，这阻

2024年02月07日
浏览(49)
【nerfStudio】5-nerfStudio导出3D Mesh模型

在这里我们将介绍如何从nerfstudio中导出点云和网格。您将使用的主要命令是 ns-export 。我们将点云导出为 .ply 文件，纹理网格导出为 .obj 文件。 1. TSDF融合 TSDF（截断有符号距离函数）融合是一种使用深度图像提取表面网格的算法。此方法适用于所有模型。 2. Poisson曲面重建

2024年02月09日
浏览(45)
Procedural Mesh: 创建复杂的3D几何图形

Procedural Mesh 是一个用于创建复杂3D几何图形的开源库，由 Morten Nobel 开发并维护。它允许开发者通过程序化的方式生成各种形状和结构，无需手动设计每个顶点和面。 Procedural Mesh 提供了一系列函数和工具，可以帮助开发者轻松地生成自定义的3D网格。这些功能包括：基础形状

2024年03月15日
浏览(43)