[晓理紫]每日论文推送(有中文摘要，源码或项目地址)--机器人、视觉相关

这篇具有很好参考价值的文章主要介绍了[晓理紫]每日论文推送(有中文摘要，源码或项目地址)--机器人、视觉相关。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

专属领域论文订阅

VX关注{晓理紫}，每日更新论文，如感兴趣，请转发给有需要的同学，谢谢支持
VX关注晓理紫，并留下邮箱可免费获取每日论文推送服务

[晓理紫]每日论文推送(有中文摘要，源码或项目地址)--机器人、视觉相关,每日论文,机器人

分类:

大语言模型LLM

视觉模型VLM

扩散模型

视觉导航

具身智能，机器人

强化学习

开放词汇，检测分割

晓理紫今日论文推送

== 具身智能，机器人==

标题: Augmented Reality User Interface for Command, Control, and Supervision of Large Multi-Agent Teams

作者: Frank Regal, Chris Suarez, Fabian Parra

中文摘要: 多智能体人机团队通过开发和结合人类和机器人的优势，可以更有效地收集各种环境的信息。在国防、搜救、第一反应等行业，异构的人类机器人团队有望通过将人类从未知和潜在危险的情况中转移出来，加快数据收集，提高团队安全性。这项工作建立在AugRE之上，AugRE是一个基于增强现实（AR）的可扩展人机团队框架。它使用户能够本地化并与50多个自主代理进行通信。通过我们的努力，用户能够指挥、控制和监督大型团队中的代理，包括视线和非视线，而无需事先修改环境，也无需用户在现场使用典型的硬件（即操纵杆、键盘、笔记本电脑、平板电脑等）。所展示的工作表明，早期迹象表明，将这些基于AR HMD的用户交互模式结合起来进行指挥、控制和监督将有助于提高人机团队的协作性、稳健性和信任度

摘要: Multi-agent human-robot teaming allows for the potential to gather information about various environments more efficiently by exploiting and combining the strengths of humans and robots. In industries like defense, search and rescue, first-response, and others alike, heterogeneous human-robot teams show promise to accelerate data collection and improve team safety by removing humans from unknown and potentially hazardous situations. This work builds upon AugRE, an Augmented Reality (AR) based scalable human-robot teaming framework. It enables users to localize and communicate with 50+ autonomous agents. Through our efforts, users are able to command, control, and supervise agents in large teams, both line-of-sight and non-line-of-sight, without the need to modify the environment prior and without requiring users to use typical hardware (i.e. joysticks, keyboards, laptops, tablets, etc.) in the field. The demonstrated work shows early indications that combining these AR-HMD-based user interaction modalities for command, control, and supervision will help improve human-robot team collaboration, robustness, and trust.

[Downlink:]http://arxiv.org/abs/2401.05665v1

[Project:]https://sites.google.com/view/xr-robotics-iros2023/home?authuser=0|

标题: Unified Learning from Demonstrations, Corrections, and Preferences during Physical Human-Robot Interaction

作者: Shaunak A. Mehta, Dylan P. Losey

中文摘要: 人类可以利用物理交互来教授机器人手臂。这种物理交互有多种形式，具体取决于任务、用户以及机器人迄今为止所学的知识。现有技术的方法侧重于从单一模态学习，或者通过假设机器人具有关于人类预期任务的先验信息来组合多种交互类型。相比之下，在本文中，我们引入了一种算法形式主义，它将从演示、更正和偏好中学习结合起来。我们的方法不对人类想要教机器人的任务进行假设；相反，我们通过将人类的输入与附近的替代品进行比较，从头开始学习奖励模型。我们首先推导出一个损失函数，该函数训练一组奖励模型，以匹配人类的演示、校正和偏好。反馈的类型和顺序取决于人类老师：我们使机器人能够被动或主动地收集反馈。然后，我们应用约束优化将我们学到的奖励转化为所需的机器人轨迹。通过模拟和用户研究，我们证明了我们提出的方法比现有的基线更准确地从物理人类交互中学习操纵任务，特别是当机器人面临新的或意想不到的目标时。我们的用户研究视频可在以下网站获取：https://youtu.be/FSUJsTYvEKU

摘要: Humans can leverage physical interaction to teach robot arms. This physical interaction takes multiple forms depending on the task, the user, and what the robot has learned so far. State-of-the-art approaches focus on learning from a single modality, or combine multiple interaction types by assuming that the robot has prior information about the human’s intended task. By contrast, in this paper we introduce an algorithmic formalism that unites learning from demonstrations, corrections, and preferences. Our approach makes no assumptions about the tasks the human wants to teach the robot; instead, we learn a reward model from scratch by comparing the human’s inputs to nearby alternatives. We first derive a loss function that trains an ensemble of reward models to match the human’s demonstrations, corrections, and preferences. The type and order of feedback is up to the human teacher: we enable the robot to collect this feedback passively or actively. We then apply constrained optimization to convert our learned reward into a desired robot trajectory. Through simulations and a user study we demonstrate that our proposed approach more accurately learns manipulation tasks from physical human interaction than existing baselines, particularly when the robot is faced with new or unexpected objectives. Videos of our user study are available at: https://youtu.be/FSUJsTYvEKU

[Downlink:]http://arxiv.org/abs/2207.03395v2

[Project:]https://youtu.be/FSUJsTYvEKU|

标题: Transferability of HRI Research: Potential and Challenges

作者: Wafa Johal

中文摘要: 随着机器人技术和人工智能的发展，机器人技术的应用正在蓬勃发展。人机交互（HRI）是机器人学的一个重要领域，因为它允许机器人更接近人类（与人类或为人类）工作。HRI研究成功的一个关键因素是可转移性，即研究成果被行业采用并为社会提供利益的能力。在本文中，我们探讨了可转移性在HRI研究中的潜力和挑战。首先，我们检查了HRI研究的现状，并确定了可能导致成功结果的各种类型的贡献。其次，我们讨论了每种贡献的潜在好处，并确定了可以促进行业采用HRI研究的因素。然而，我们也认识到，与可转移性相关的一些挑战，如人力资源研究从业者所需的明确定义的工作/技能集的多样性，缺乏行业主导的研究，以及人力资源研究方法缺乏标准化。我们讨论了这些挑战，并提出了潜在的解决方案，以弥合行业期望与HRI学术研究之间的差距

摘要: With advancement of robotics and artificial intelligence, applications for robotics are flourishing. Human-robot interaction (HRI) is an important area of robotics as it allows robots to work closer to humans (with them or for them). One crucial factor for the success of HRI research is transferability, which refers to the ability of research outputs to be adopted by industry and provide benefits to society. In this paper, we explore the potentials and challenges of transferability in HRI research. Firstly, we examine the current state of HRI research and identify various types of contributions that could lead to successful outcomes. Secondly, we discuss the potential benefits for each type of contribution and identify factors that could facilitate industry adoption of HRI research. However, we also recognize that there are several challenges associated with transferability, such as the diversity of well-defined job/skill-sets required from HRI practitioners, the lack of industry-led research, and the lack of standardization in HRI research methods. We discuss these challenges and propose potential solutions to bridge the gap between industry expectations and academic research in HRI.

[Downlink:]http://arxiv.org/abs/2401.05802v1

标题: Theory of Mind abilities of Large Language Models in Human-Robot Interaction : An Illusion?

作者: Mudit Verma, Siddhant Bhambri, Subbarao Kambhampati

中文摘要: 大型语言模型在各种自然语言和生成任务中表现出非凡的生成能力。然而，可能的拟人化和对失败案例的宽容推动了对大语言模型涌现能力的讨论，尤其是对大语言模式中心理理论能力的讨论。虽然存在一些错误信念测试来验证推断和维护另一个实体的心理模型的能力，但我们研究了ToM能力的一个特殊应用，它具有更高的风险和可能不可逆转的后果：人机交互。在这项工作中，我们探索了感知行为识别的任务，其中机器人采用大型语言模型（LLM）以类似于人类观察者的方式评估机器人生成的行为。我们关注四种行为类型，即可解释、可阅读、可预测和模糊行为，这些行为已被广泛用于合成可解释的机器人行为。因此，LLM的目标是成为代理的人类代理，并回答某个代理行为将如何被循环中的人类感知，例如“给定机器人的行为X，人类观察者会发现它是可解释的吗？”。我们进行了一项人类受试者研究，以验证用户能够在五个领域的精心策划的情况下（机器人设置和计划）正确回答这样的问题。信念测试的第一个分析产生了非常积极的结果，夸大了人们对LLM拥有ToM能力的期望。然后，我们提出并执行了一套打破这种错觉的扰动测试，即不一致信念、不一致上下文和信念测试。我们得出的结论是，LLM在香草提示上的高分显示了它在HRI设置中的潜在用途，然而，在LLM缺乏的情况下，拥有ToM要求对琐碎或无关的扰动保持不变

摘要: Large Language Models have shown exceptional generative abilities in various natural language and generation tasks. However, possible anthropomorphization and leniency towards failure cases have propelled discussions on emergent abilities of Large Language Models especially on Theory of Mind (ToM) abilities in Large Language Models. While several false-belief tests exists to verify the ability to infer and maintain mental models of another entity, we study a special application of ToM abilities that has higher stakes and possibly irreversible consequences : Human Robot Interaction. In this work, we explore the task of Perceived Behavior Recognition, where a robot employs a Large Language Model (LLM) to assess the robot’s generated behavior in a manner similar to human observer. We focus on four behavior types, namely - explicable, legible, predictable, and obfuscatory behavior which have been extensively used to synthesize interpretable robot behaviors. The LLMs goal is, therefore to be a human proxy to the agent, and to answer how a certain agent behavior would be perceived by the human in the loop, for example “Given a robot’s behavior X, would the human observer find it explicable?”. We conduct a human subject study to verify that the users are able to correctly answer such a question in the curated situations (robot setting and plan) across five domains. A first analysis of the belief test yields extremely positive results inflating ones expectations of LLMs possessing ToM abilities. We then propose and perform a suite of perturbation tests which breaks this illusion, i.e. Inconsistent Belief, Uninformative Context and Conviction Test. We conclude that, the high score of LLMs on vanilla prompts showcases its potential use in HRI settings, however to possess ToM demands invariance to trivial or irrelevant perturbations in the context which LLMs lack.

[Downlink:]http://arxiv.org/abs/2401.05302v1

标题: Evaluating Gesture Recognition in Virtual Reality

作者: Sandeep Reddy Sabbella, Sara Kaszuba, Francesco Leotta

中文摘要: 随着机器人融入日常生活的各个方面，人机交互（HRI）变得越来越重要。HRI的一个关键方面是手势识别，它允许机器人实时解释和响应人类手势。手势识别在HRI的非言语交际中起着重要作用。为此，正在进行的研究是，这种非语言交流如何加强语言交流，提高系统的整体效率，从而增强机器人的用户体验。然而，手势识别系统需要解决几个挑战，包括数据生成、可传输性、可扩展性、可推广性、标准化以及缺乏手势系统的基准测试。在这篇初步论文中，我们希望通过向一些可以用作地面机器人标准的命令提供手势，来解决使用虚拟现实模拟生成数据的挑战和标准化问题

摘要: Human-Robot Interaction (HRI) has become increasingly important as robots are being integrated into various aspects of daily life. One key aspect of HRI is gesture recognition, which allows robots to interpret and respond to human gestures in real-time. Gesture recognition plays an important role in non-verbal communication in HRI. To this aim, there is ongoing research on how such non-verbal communication can strengthen verbal communication and improve the system’s overall efficiency, thereby enhancing the user experience with the robot. However, several challenges need to be addressed in gesture recognition systems, which include data generation, transferability, scalability, generalizability, standardization, and lack of benchmarking of the gestural systems. In this preliminary paper, we want to address the challenges of data generation using virtual reality simulations and standardization issues by presenting gestures to some commands that can be used as a standard in ground robots.

[Downlink:]http://arxiv.org/abs/2401.04545v1

标题: Testing Human-Robot Interaction in Virtual Reality: Experience from a Study on Speech Act Classification

作者: Sara Kaszuba, Sandeep Reddy Sabbella, Francesco Leotta

中文摘要: 近年来，越来越多的人机交互（HRI）方法在虚拟现实（VR）中得到了实施和评估，因为它可以加快设计迭代，并使最终用户更安全地评估和掌握HRI原语。然而，确定最合适的VR体验并不简单。在这项工作中，我们评估了在智能农业场景中，用户如何在语音行为理解任务中感知沉浸式和非沉浸式VR。特别是，我们收集了参与这两个实验的81名参与者的意见和建议，以突出这些不同经历的优势和劣势

摘要: In recent years, an increasing number of Human-Robot Interaction (HRI) approaches have been implemented and evaluated in Virtual Reality (VR), as it allows to speed-up design iterations and makes it safer for the final user to evaluate and master the HRI primitives. However, identifying the most suitable VR experience is not straightforward. In this work, we evaluate how, in a smart agriculture scenario, immersive and non-immersive VR are perceived by users with respect to a speech act understanding task. In particular, we collect opinions and suggestions from the 81 participants involved in both experiments to highlight the strengths and weaknesses of these different experiences.

[Downlink:]http://arxiv.org/abs/2401.04534v1

== 强化学习 ==

标题: Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint

作者: Zhipeng Chen, Kun Zhou, Wayne Xin Zhao

中文摘要: 强化学习（RL）已被广泛用于训练大型语言模型，以防止意外输出，例如减少危害和错误。然而，现有的RL方法大多采用实例级奖励，无法对复杂的推理任务提供细粒度的监督，也无法关注导致错误的少数关键令牌。为了解决这一问题，我们提出了一种新的RL方法，名为\textbf{RLMEC}，该方法结合了一个生成模型作为奖励模型，该模型由错误解重写任务在最小编辑约束下进行训练，并可以为RL训练产生令牌级奖励。基于生成奖励模型，我们设计了用于训练的令牌级RL目标和用于稳定RL过程的基于模仿的正则化。这两个目标都集中在学习错误解决方案的关键令牌上，减少其他不重要令牌的影响。数学任务和问答任务的实验结果证明了该方法的有效性。我们的代码和数据位于\url{https://github.com/RUCAIBox/RLMEC}.

摘要: Reinforcement learning (RL) has been widely used in training large language models~(LLMs) for preventing unexpected outputs, \eg reducing harmfulness and errors. However, existing RL methods mostly adopt the instance-level reward, which is unable to provide fine-grained supervision for complex reasoning tasks, and can not focus on the few key tokens that lead to the incorrectness. To address it, we propose a new RL method named \textbf{RLMEC} that incorporates a generative model as the reward model, which is trained by the erroneous solution rewriting task under the minimum editing constraint, and can produce token-level rewards for RL training. Based on the generative reward model, we design the token-level RL objective for training and an imitation-based regularization for stabilizing RL process. And the both objectives focus on the learning of the key tokens for the erroneous solution, reducing the effect of other unimportant tokens. The experiment results on mathematical tasks and question-answering tasks have demonstrated the effectiveness of our approach. Our code and data are available at \url{https://github.com/RUCAIBox/RLMEC}.

[Downlink:]http://arxiv.org/abs/2401.06081v1

[GitHub:]https://github.com/RUCAIBox/RLMEC|

标题: Open-Source Reinforcement Learning Environments Implemented in MuJoCo with Franka Manipulator

作者: Zichun Xu, Yuntao Li, Xiaohang Yang

中文摘要: 本文介绍了在MuJoCo物理引擎上与MuJoCo动物园的Franka Emika Panda手臂开发的三个开源强化学习环境。通过继承体育馆核心技术的体育馆机器人API，实现了推、滑、取、放三项具有代表性的任务。支持稀疏二进制和密集奖励，并且观察空间包含所需和已实现目标的关键，以遵循多目标强化学习框架。使用三种不同的非策略算法来验证仿真属性，以确保所有任务的逼真度，并给出了基准测试结果。每个环境和任务都以干净的方式定义，并保留用于修改环境的主要参数以反映主要差异。存储库（包括所有环境）位于https://github.com/zichunxx/panda_mujoco_gym.

摘要: This paper presents three open-source reinforcement learning environments developed on the MuJoCo physics engine with the Franka Emika Panda arm in MuJoCo Menagerie. Three representative tasks, push, slide, and pick-and-place, are implemented through the Gymnasium Robotics API, which inherits from the core of Gymnasium. Both the sparse binary and dense rewards are supported, and the observation space contains the keys of desired and achieved goals to follow the Multi-Goal Reinforcement Learning framework. Three different off-policy algorithms are used to validate the simulation attributes to ensure the fidelity of all tasks, and benchmark results are also given. Each environment and task are defined in a clean way, and the main parameters for modifying the environment are preserved to reflect the main difference. The repository, including all environments, is available at https://github.com/zichunxx/panda_mujoco_gym.

[Downlink:]http://arxiv.org/abs/2312.13788v2

[GitHub:]https://github.com/zichunxx/panda_mujoco_gym.|

标题: Edge Generation Scheduling for DAG Tasks Using Deep Reinforcement Learning

作者: Binqi Sun, Mirco Theile, Ziyuan Qin

中文摘要: 有向无环图（DAG）任务目前在实时领域中被采用，用于对汽车、航空电子和工业领域的复杂应用程序进行建模，这些应用程序通过相互通信的任务链来实现其功能。本文基于平凡可调度性的概念，提出了一种新的可调度性测试方法，研究了实时DAG任务的调度问题。利用这种可调度性测试，我们提出了一种新的DAG调度框架（边缘生成调度——EGS），该框架试图通过迭代生成边缘来最小化DAG宽度，同时保证最后期限约束。我们研究了如何通过开发深度强化学习算法与图表示神经网络相结合来学习有效的EGS边缘生成策略，从而有效地解决边缘生成问题。我们通过将所提出的算法与最先进的DAG调度启发式算法和最优混合整数线性规划基线进行比较来评估其有效性。实验结果表明，所提出的算法在调度相同DAG任务时需要更少的处理器，优于现有技术。代码位于https://github.com/binqi-sun/egs.

摘要: Directed acyclic graph (DAG) tasks are currently adopted in the real-time domain to model complex applications from the automotive, avionics, and industrial domains that implement their functionalities through chains of intercommunicating tasks. This paper studies the problem of scheduling real-time DAG tasks by presenting a novel schedulability test based on the concept of trivial schedulability. Using this schedulability test, we propose a new DAG scheduling framework (edge generation scheduling – EGS) that attempts to minimize the DAG width by iteratively generating edges while guaranteeing the deadline constraint. We study how to efficiently solve the problem of generating edges by developing a deep reinforcement learning algorithm combined with a graph representation neural network to learn an efficient edge generation policy for EGS. We evaluate the effectiveness of the proposed algorithm by comparing it with state-of-the-art DAG scheduling heuristics and an optimal mixed-integer linear programming baseline. Experimental results show that the proposed algorithm outperforms the state-of-the-art by requiring fewer processors to schedule the same DAG tasks. The code is available at https://github.com/binqi-sun/egs.

[Downlink:]http://arxiv.org/abs/2308.14647v2

[GitHub:]https://github.com/binqi-sun/egs.|

标题: HomeRobot: Open-Vocabulary Mobile Manipulation

作者: Sriram Yenamandra, Arun Ramachandran, Karmesh Yadav

中文摘要: 家庭机器人（名词）：一种价格合理的顺从机器人，可以在家中导航并操纵各种物体以完成日常任务。开放词汇移动操作（OVMM）是指在任何看不见的环境中拾取任何对象，并将其放置在命令位置的问题。这是机器人在人类环境中成为有用助手的一个基本挑战，因为它涉及到解决机器人的子问题：感知、语言理解、导航和操作都是OVMM的关键。此外，这些子问题的解决方案的一体化也带来了自身的重大挑战。为了推动这一领域的研究，我们引入了HomeRobot OVMM基准，在该基准中，代理导航家庭环境，以抓取新物体并将其放置在目标容器上。HomeRobot有两个组件：一个模拟组件，在新的、高质量的多房间家庭环境中使用大型和多样化的策划对象集；和一个真实世界的组件，为低成本的Hello Robot Stretch提供了一个软件堆栈，以鼓励在实验室中复制真实世界的实验。我们实现了强化学习和启发式（基于模型的）基线，并展示了模拟到真实转移的证据。我们的基线在现实世界中实现了20%的成功率；我们的实验确定了未来研究工作提高性能的方法。查看我们网站上的视频：https://ovmm.github.io/.

摘要: HomeRobot (noun): An affordable compliant robot that navigates homes and manipulates a wide range of objects in order to complete everyday tasks. Open-Vocabulary Mobile Manipulation (OVMM) is the problem of picking any object in any unseen environment, and placing it in a commanded location. This is a foundational challenge for robots to be useful assistants in human environments, because it involves tackling sub-problems from across robotics: perception, language understanding, navigation, and manipulation are all essential to OVMM. In addition, integration of the solutions to these sub-problems poses its own substantial challenges. To drive research in this area, we introduce the HomeRobot OVMM benchmark, where an agent navigates household environments to grasp novel objects and place them on target receptacles. HomeRobot has two components: a simulation component, which uses a large and diverse curated object set in new, high-quality multi-room home environments; and a real-world component, providing a software stack for the low-cost Hello Robot Stretch to encourage replication of real-world experiments across labs. We implement both reinforcement learning and heuristic (model-based) baselines and show evidence of sim-to-real transfer. Our baselines achieve a 20% success rate in the real world; our experiments identify ways future research work improve performance. See videos on our website: https://ovmm.github.io/.

[Downlink:]http://arxiv.org/abs/2306.11565v2

[Project:]https://ovmm.github.io/.|

标题: Yes, this is what I was looking for! Towards Multi-modal Medical Consultation Concern Summary Generation

作者: Abhisek Tiwari, Shreyangshu Bera, Sriparna Saha

中文摘要: 在过去几年中，互联网在医疗保健相关任务中的使用突飞猛进，这对有效管理和处理信息以确保其高效利用提出了挑战。在情绪动荡和心理挑战的时刻，我们经常求助于互联网作为我们最初的支持来源，由于相关的社会污名，我们选择了互联网而不是与他人讨论我们的感受。在本文中，我们提出了一项新的任务，即生成多模式医疗问题摘要（MMCS），该任务对患者在咨询过程中提出的主要问题进行了简短而准确的总结。非语言提示，如患者的手势和面部表情，有助于准确识别患者的担忧。医生还会考虑患者的个人信息，如年龄和性别，以便适当地描述医疗状况。受患者个人背景和视觉手势的潜在功效的启发，我们提出了一个基于转换器的多任务、多模式意图识别和医疗问题摘要生成（IR-MMCSG）系统。此外，我们提出了一个多任务框架，用于医患会诊的意图识别和医疗问题摘要生成。我们构建了第一个多模式医疗问题摘要生成（MM MediConSummation）语料库，其中包括用医疗问题摘要、意图、患者个人信息、医生建议和关键词注释的医患咨询。我们的实验和分析证明了（a）患者的表情/手势及其个人信息在意图识别和医疗问题摘要生成中的重要作用，以及（b）意图识别和患者医疗问题摘要生成器之间的强相关性。数据集和源代码可在https://github.com/NLP-RL/MMCSG.

摘要: Over the past few years, the use of the Internet for healthcare-related tasks has grown by leaps and bounds, posing a challenge in effectively managing and processing information to ensure its efficient utilization. During moments of emotional turmoil and psychological challenges, we frequently turn to the internet as our initial source of support, choosing this over discussing our feelings with others due to the associated social stigma. In this paper, we propose a new task of multi-modal medical concern summary (MMCS) generation, which provides a short and precise summary of patients’ major concerns brought up during the consultation. Nonverbal cues, such as patients’ gestures and facial expressions, aid in accurately identifying patients’ concerns. Doctors also consider patients’ personal information, such as age and gender, in order to describe the medical condition appropriately. Motivated by the potential efficacy of patients’ personal context and visual gestures, we propose a transformer-based multi-task, multi-modal intent-recognition, and medical concern summary generation (IR-MMCSG) system. Furthermore, we propose a multitasking framework for intent recognition and medical concern summary generation for doctor-patient consultations. We construct the first multi-modal medical concern summary generation (MM-MediConSummation) corpus, which includes patient-doctor consultations annotated with medical concern summaries, intents, patient personal information, doctor’s recommendations, and keywords. Our experiments and analysis demonstrate (a) the significant role of patients’ expressions/gestures and their personal information in intent identification and medical concern summary generation, and (b) the strong correlation between intent recognition and patients’ medical concern summary generation The dataset and source code are available at https://github.com/NLP-RL/MMCSG.

[Downlink:]http://arxiv.org/abs/2401.05134v1

[GitHub:]https://github.com/NLP-RL/MMCSG.|

标题: Human as AI Mentor: Enhanced Human-in-the-loop Reinforcement Learning for Safe and Efficient Autonomous Driving

作者: Zilin Huang, Zihao Sheng, Chengyuan Ma

中文摘要: 尽管自动驾驶汽车取得了重大进展，但尚未充分探索制定既能确保自动驾驶汽车安全又能确保交通流效率的驾驶政策。在本文中，我们提出了一种增强的人在环强化学习方法，称为基于人工智能导师的深度强化学习（HAIM-DRL）框架，该框架有助于混合交通车队中安全高效的自动驾驶。从人类学习过程中汲取灵感，我们首先引入了一种创新的学习范式，将人类智能有效地注入人工智能，称为“人类即人工智能导师”（HAIM）。在这种范式中，人类专家充当人工智能代理的导师。在允许智能体充分探索不确定环境的同时，人类专家可以在危险情况下进行控制，并展示正确的行动以避免潜在的事故。另一方面，可以引导代理最小化交通流干扰，从而优化交通流效率。详细地说，HAIM-DRL利用从自由探索和部分人类演示中收集的数据作为其两个训练来源。值得注意的是，我们避开了手动设计奖励函数的复杂过程；相反，我们直接从部分人类演示中导出代理状态动作值，以指导代理的策略学习。此外，我们采用最小干预技术来减少人类导师的认知负荷。比较结果表明，HAIM-DRL在驾驶安全性、采样效率、交通流干扰的缓解以及对未知交通场景的可推广性方面优于传统方法。本文的代码和演示视频可访问：https://zilin-huang.github.io/HAIM-DRL-website/

摘要: Despite significant progress in autonomous vehicles (AVs), the development of driving policies that ensure both the safety of AVs and traffic flow efficiency has not yet been fully explored. In this paper, we propose an enhanced human-in-the-loop reinforcement learning method, termed the Human as AI mentor-based deep reinforcement learning (HAIM-DRL) framework, which facilitates safe and efficient autonomous driving in mixed traffic platoon. Drawing inspiration from the human learning process, we first introduce an innovative learning paradigm that effectively injects human intelligence into AI, termed Human as AI mentor (HAIM). In this paradigm, the human expert serves as a mentor to the AI agent. While allowing the agent to sufficiently explore uncertain environments, the human expert can take control in dangerous situations and demonstrate correct actions to avoid potential accidents. On the other hand, the agent could be guided to minimize traffic flow disturbance, thereby optimizing traffic flow efficiency. In detail, HAIM-DRL leverages data collected from free exploration and partial human demonstrations as its two training sources. Remarkably, we circumvent the intricate process of manually designing reward functions; instead, we directly derive proxy state-action values from partial human demonstrations to guide the agents’ policy learning. Additionally, we employ a minimal intervention technique to reduce the human mentor’s cognitive load. Comparative results show that HAIM-DRL outperforms traditional methods in driving safety, sampling efficiency, mitigation of traffic flow disturbance, and generalizability to unseen traffic scenarios. The code and demo videos for this paper can be accessed at: https://zilin-huang.github.io/HAIM-DRL-website/

[Downlink:]http://arxiv.org/abs/2401.03160v2

[Project:]https://zilin-huang.github.io/HAIM-DRL-website/|

== 开放词汇检测 ==

标题: CoSSegGaussians: Compact and Swift Scene Segmenting 3D Gaussians

作者: Bin Dou, Tianyu Zhang, Yongjia Ma

中文摘要: 我们提出了紧凑和快速分割3D高斯（CoSSegGaussians），这是一种仅使用RGB图像输入以快速渲染速度进行紧凑3D一致场景分割的方法。先前基于NeRF的3D分割方法依赖于隐式或体素神经场景表示和射线行进体绘制，这是耗时的。最近的3D高斯Splatting显著提高了渲染速度，然而，现有的基于高斯的分割方法（例如：高斯分组）无法提供紧凑的分割掩模，尤其是在零样本分割中，这主要是由于当遇到不一致的2D机器生成标签时，缺乏用于直接将可学习参数分配给每个高斯的鲁棒性和紧凑性。我们的方法旨在通过用浅层解码网络映射每个高斯点的融合空间和语义有意义的特征，快速实现紧凑可靠的零样本场景分割。具体来说，我们的方法首先在RGB图像的监督下优化高斯点的位置、协方差和颜色属性。在高斯定位之后，我们通过对每个高斯进行非投影来提取从图像中提取的多尺度DINO特征，然后将其与来自快速点特征处理网络（即RandLA-Net）的空间特征相结合。然后将浅层解码MLP应用于多尺度融合特征以获得紧凑分割。实验结果表明，我们的模型可以进行高质量的零样本场景分割，因为我们的模型在语义和全景分割任务上都优于其他分割方法，同时与基于NeRF的分割相比，只消耗了大约10%的分割时间。代码和更多结果将在https://David-Dou.github.io/CoSSegGaussians

摘要: We propose Compact and Swift Segmenting 3D Gaussians(CoSSegGaussians), a method for compact 3D-consistent scene segmentation at fast rendering speed with only RGB images input. Previous NeRF-based 3D segmentation methods have relied on implicit or voxel neural scene representation and ray-marching volume rendering which are time consuming. Recent 3D Gaussian Splatting significantly improves the rendering speed, however, existing Gaussians-based segmentation methods(eg: Gaussian Grouping) fail to provide compact segmentation masks especially in zero-shot segmentation, which is mainly caused by the lack of robustness and compactness for straightforwardly assigning learnable parameters to each Gaussian when encountering inconsistent 2D machine-generated labels. Our method aims to achieve compact and reliable zero-shot scene segmentation swiftly by mapping fused spatial and semantically meaningful features for each Gaussian point with a shallow decoding network. Specifically, our method firstly optimizes Gaussian points’ position, convariance and color attributes under the supervision of RGB images. After Gaussian Locating, we distill multi-scale DINO features extracted from images through unprojection to each Gaussian, which is then incorporated with spatial features from the fast point features processing network, i.e. RandLA-Net. Then the shallow decoding MLP is applied to the multi-scale fused features to obtain compact segmentation. Experimental results show that our model can perform high-quality zero-shot scene segmentation, as our model outperforms other segmentation methods on both semantic and panoptic segmentation task, meanwhile consumes approximately only 10% segmenting time compared to NeRF-based segmentation. Code and more results will be available at https://David-Dou.github.io/CoSSegGaussians

[Downlink:]http://arxiv.org/abs/2401.05925v1

[Project:]https://David-Dou.github.io/CoSSegGaussians|

标题: IODeep: an IOD for the introduction of deep learning in the DICOM standard

作者: Salvatore Contino, Luca Cruciata, Orazio Gambino

中文摘要: 背景和目的：近年来，随着越来越多的数据集的可用性和知名竞赛的建立，人工智能（AI），特别是深度神经网络（DNN）成为生物医学图像分割的相关研究课题。尽管基于DNN的分割在研究方面很受欢迎，但这些技术在日常临床实践中几乎没有使用过，即使它们可以在诊断过程中有效地支持医生。除了与神经模型预测的可解释性相关的问题外，这些系统没有集成在诊断工作流程中，需要对其使用进行标准化以实现这一目标。方法：本文向IODeep提出了一种新的DICOM信息对象定义（IOD），旨在存储已经在特定图像数据集上训练的DNN的权重和架构，该图像数据集被标记为采集模式、解剖区域和正在研究的疾病。结果：IOD体系结构以及基于上述标签的PACS服务器的DNN选择算法，以及一个专门设计用于演示DICOM集成有效性的简单PACS查看器，而不需要在PACS服务器端进行修改。此外，还实现了支持整个工作流的基于服务的体系结构。结论：IODeep确保了训练后的人工智能模型在DICOM基础设施中的完全集成，它还实现了一种场景，即训练后的模型可以根据医院数据进行微调，也可以在不同医院共享的联合学习方案中进行训练。通过这种方式，人工智能模型可以根据放射科病房产生的真实数据进行定制，从而改进医生的决策过程。源代码免费提供于https://github.com/CHILab1/IODeep.git

摘要: Background and Objective: In recent years, Artificial Intelligence (AI) and in particular Deep Neural Networks (DNN) became a relevant research topic in biomedical image segmentation due to the availability of more and more data sets along with the establishment of well known competitions. Despite the popularity of DNN based segmentation on the research side, these techniques are almost unused in the daily clinical practice even if they could support effectively the physician during the diagnostic process. Apart from the issues related to the explainability of the predictions of a neural model, such systems are not integrated in the diagnostic workflow, and a standardization of their use is needed to achieve this goal. Methods: This paper presents IODeep a new DICOM Information Object Definition (IOD) aimed at storing both the weights and the architecture of a DNN already trained on a particular image dataset that is labeled as regards the acquisition modality, the anatomical region, and the disease under investigation. Results: The IOD architecture is presented along with a DNN selection algorithm from the PACS server based on the labels outlined above, and a simple PACS viewer purposely designed for demonstrating the effectiveness of the DICOM integration, while no modifications are required on the PACS server side. Also a service based architecture in support of the entire workflow has been implemented. Conclusion: IODeep ensures full integration of a trained AI model in a DICOM infrastructure, and it is also enables a scenario where a trained model can be either fine-tuned with hospital data or trained in a federated learning scheme shared by different hospitals. In this way AI models can be tailored to the real data produced by a Radiology ward thus improving the physician decision making process. Source code is freely available at https://github.com/CHILab1/IODeep.git

[Downlink:]http://arxiv.org/abs/2311.16163v3

[GitHub:]https://github.com/CHILab1/IODeep.git|

标题: LKCA: Large Kernel Convolutional Attention

作者: Chenghao Li, Boheng Zeng, Yi Lu

中文摘要: 我们重新审视了视觉变换器中注意力机制与大核卷积网之间的关系，并提出了一种新的空间注意力，称为大核卷积注意力（LKCA）。它通过用单个大内核卷积代替注意力运算来简化注意力运算。LKCA结合了卷积神经网络和视觉转换器的优势，具有大的感受野、局部性和参数共享。我们从卷积和注意力的角度解释了LKCA的优越性，为每个视图提供了等效的代码实现。实验证实，从卷积和注意力角度实现的LKCA表现出等效的性能。我们在分类和分割任务中对ViT的LKCA变体进行了广泛的实验。实验表明LKCA在视觉任务中表现出有竞争力的表现。我们的代码将在https://github.com/CatworldLee/LKCA.

摘要: We revisit the relationship between attention mechanisms and large kernel ConvNets in visual transformers and propose a new spatial attention named Large Kernel Convolutional Attention (LKCA). It simplifies the attention operation by replacing it with a single large kernel convolution. LKCA combines the advantages of convolutional neural networks and visual transformers, possessing a large receptive field, locality, and parameter sharing. We explained the superiority of LKCA from both convolution and attention perspectives, providing equivalent code implementations for each view. Experiments confirm that LKCA implemented from both the convolutional and attention perspectives exhibit equivalent performance. We extensively experimented with the LKCA variant of ViT in both classification and segmentation tasks. The experiments demonstrated that LKCA exhibits competitive performance in visual tasks. Our code will be made publicly available at https://github.com/CatworldLee/LKCA.

[Downlink:]http://arxiv.org/abs/2401.05738v1

[GitHub:]https://github.com/CatworldLee/LKCA.|

标题: Recurrent Generic Contour-based Instance Segmentation with Progressive Learning

作者: Hao Feng, Keyi Zhou, Wengang Zhou

中文摘要: 基于轮廓的实例分割因其在处理复杂背景下的视觉对象时的灵活性和优雅性而受到积极研究。在这项工作中，我们提出了一种新的深度网络架构，即PolySnake，用于基于通用轮廓的实例分割。受经典Snake算法的启发，所提出的PolySnake通过迭代和渐进的轮廓细化策略实现了卓越和稳健的分割性能。从技术上讲，PolySnake引入了一个递归更新算子来迭代估计对象轮廓。它保持对轮廓的单一估计，该轮廓朝着对象边界逐渐变形。在每次迭代中，PolySnake都会为当前轮廓构建一个语义丰富的表示，并将其提供给递归算子以进行进一步的轮廓调整。通过迭代细化，轮廓逐渐收敛到紧紧包围对象实例的稳定状态。除了一般实例分割的范围外，还进行了大量实验，以验证我们的PolySnake在两个额外的特定任务场景中的有效性和可推广性，包括场景文本检测和车道检测。结果表明，在三个任务中，所提出的PolySnake在多个流行的基准测试上优于现有的高级方法。代码和经过预训练的模型可在https://github.com/fh2019ustc/PolySnake

摘要: Contour-based instance segmentation has been actively studied, thanks to its flexibility and elegance in processing visual objects within complex backgrounds. In this work, we propose a novel deep network architecture, i.e., PolySnake, for generic contour-based instance segmentation. Motivated by the classic Snake algorithm, the proposed PolySnake achieves superior and robust segmentation performance with an iterative and progressive contour refinement strategy. Technically, PolySnake introduces a recurrent update operator to estimate the object contour iteratively. It maintains a single estimate of the contour that is progressively deformed toward the object boundary. At each iteration, PolySnake builds a semantic-rich representation for the current contour and feeds it to the recurrent operator for further contour adjustment. Through the iterative refinements, the contour progressively converges to a stable status that tightly encloses the object instance. Beyond the scope of general instance segmentation, extensive experiments are conducted to validate the effectiveness and generalizability of our PolySnake in two additional specific task scenarios, including scene text detection and lane detection. The results demonstrate that the proposed PolySnake outperforms the existing advanced methods on several multiple prevalent benchmarks across the three tasks. The codes and pre-trained models are available at https://github.com/fh2019ustc/PolySnake

[Downlink:]http://arxiv.org/abs/2301.08898v2

[GitHub:]https://github.com/fh2019ustc/PolySnake|

标题: LinK3D: Linear Keypoints Representation for 3D LiDAR Point Cloud

作者: Yunge Cui, Yinlong Zhang, Jiahua Dong

中文摘要: 特征提取和匹配是许多机器人视觉任务的基本部分，如2D或3D对象检测、识别和配准。众所周知，二维特征提取和匹配已经取得了巨大的成功。不幸的是，在3D领域，由于3D激光雷达传感器的描述性差和效率低，目前的方法可能无法支持其在机器人视觉任务中的广泛应用。为了解决这一限制，我们提出了一种新的3D特征表示方法：3D激光雷达点云的线性关键点表示，称为LinK3D。LinK3D的新颖之处在于，它充分考虑了激光雷达点云的特性（如稀疏性和复杂性），并用其鲁棒的邻居关键点来表示关键点，这在关键点的描述中提供了强大的约束。在三个公共数据集上对所提出的LinK3D进行了评估，实验结果表明，我们的方法具有很好的匹配性能。更重要的是，LinK3D还显示出出色的实时性能，比典型旋转激光雷达传感器在10Hz下的传感器帧速率更快。LinK3D从64束激光雷达收集的点云中提取特征平均只需30毫秒，在配备英特尔酷睿i7处理器的计算机上执行时，匹配两次激光雷达扫描仅需约20毫秒。此外，我们的方法可以扩展到激光雷达里程计任务，并显示出良好的可扩展性。我们在发布方法的实现https://github.com/YungeCui/LinK3D.

摘要: Feature extraction and matching are the basic parts of many robotic vision tasks, such as 2D or 3D object detection, recognition, and registration. As is known, 2D feature extraction and matching have already achieved great success. Unfortunately, in the field of 3D, the current methods may fail to support the extensive application of 3D LiDAR sensors in robotic vision tasks due to their poor descriptiveness and inefficiency. To address this limitation, we propose a novel 3D feature representation method: Linear Keypoints representation for 3D LiDAR point cloud, called LinK3D. The novelty of LinK3D lies in that it fully considers the characteristics (such as the sparsity and complexity) of LiDAR point clouds and represents the keypoint with its robust neighbor keypoints, which provide strong constraints in the description of the keypoint. The proposed LinK3D has been evaluated on three public datasets, and the experimental results show that our method achieves great matching performance. More importantly, LinK3D also shows excellent real-time performance, faster than the sensor frame rate at 10 Hz of a typical rotating LiDAR sensor. LinK3D only takes an average of 30 milliseconds to extract features from the point cloud collected by a 64-beam LiDAR and takes merely about 20 milliseconds to match two LiDAR scans when executed on a computer with an Intel Core i7 processor. Moreover, our method can be extended to LiDAR odometry task, and shows good scalability. We release the implementation of our method at https://github.com/YungeCui/LinK3D.

[Downlink:]http://arxiv.org/abs/2206.05927v3

[GitHub:]https://github.com/YungeCui/LinK3D.|

标题: DC-Net: Divide-and-Conquer for Salient Object Detection

作者: Jiayi Zhu, Xuebin Qin, Abdulmotaleb Elsaddik

中文摘要: 在本文中，我们将分割和征服引入显著对象检测（SOD）任务，以使模型能够学习用于预测显著图的先验知识。我们设计了一种新的网络，即分治网络（DC Net），它使用两个编码器来解决有助于预测最终显著性图的不同子任务，这里是预测宽度为4的边缘图和显著对象的位置图，然后将具有不同语义信息的特征图聚合到解码器中，以预测最终的显著性图。DC Net的解码器由我们新设计的两级残差嵌套ASPP（ResASPP $^{2}$ ）模块组成，该模块能够用少量卷积运算捕获大量不同尺度的特征，并具有始终保持高分辨率和能够获得大而紧凑的有效感受野（ERF）的优点。基于Divide and Conquer并行计算的优势，我们使用并行加速来加速DCNet，使其能够在6个LR-SOD和5个HR-SOD数据集上以高效（60 FPS和55 FPS）的速度获得有竞争力的性能。代码和结果可用：https://github.com/PiggyJerry/DC-Net.

摘要: In this paper, we introduce Divide-and-Conquer into the salient object detection (SOD) task to enable the model to learn prior knowledge that is for predicting the saliency map. We design a novel network, Divide-and-Conquer Network (DC-Net) which uses two encoders to solve different subtasks that are conducive to predicting the final saliency map, here is to predict the edge maps with width 4 and location maps of salient objects and then aggregate the feature maps with different semantic information into the decoder to predict the final saliency map. The decoder of DC-Net consists of our newly designed two-level Residual nested-ASPP (ResASPP $^{2}$ ) modules, which have the ability to capture a large number of different scale features with a small number of convolution operations and have the advantages of maintaining high resolution all the time and being able to obtain a large and compact effective receptive field (ERF). Based on the advantage of Divide-and-Conquer’s parallel computing, we use Parallel Acceleration to speed up DC-Net, allowing it to achieve competitive performance on six LR-SOD and five HR-SOD datasets under high efficiency (60 FPS and 55 FPS). Codes and results are available: https://github.com/PiggyJerry/DC-Net.

[Downlink:]http://arxiv.org/abs/2305.14955v3

[GitHub:]https://github.com/PiggyJerry/DC-Net.|