[论文阅读笔记] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models

这篇具有很好参考价值的文章主要介绍了[论文阅读笔记] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

一、论文信息

1 论文标题

TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models

2 发表刊物

arXiv2023

3 作者团队

复旦大学

4 关键词

Benchmark、Continual Learing、LLMs

二、文章结构

三、引言

1 研究动机

  • 已经对齐过的大模型 (Aligned LLMs )能力很强,但持续学习能力缺乏关注;
  • 目前CL的benchmark对于顶尖的LLMs来说过于简单,并且在指令微调存在model的potential exposure。(这里的exposure是指什么,在担心安全吗?)

2 任务背景

Intro-P1:

  • LLMs (通用能力)+fine-tuning (特长能力)+alignment(安全) 已经统治了NLP。但是对模型的需求能力仍然在增长,尤其是在domain-specific knowledge, multilingual proficiency, complex task-solving, tool usage等方面。
  • 但重头训练LLMs代价太大不现实,因此通过持续学习方式incrementally训练已有的模型显得非常重要。这就引出一个重要问题:To what degree do Aligned LLMs exhibit catastrophic forgetting when subjected to incremental training?

Intro-P2:

目前的CL benchmark不适合用于评估SOTA LLMs,原因如下:

  • 很多常见且简单的NLU数据集。对于LLMs来说太简单,而且很多已经作为训练数据喂给LLMs,再用来evaluate不合适。
  • 现存的benchmark只关注模型在序列任务的表现,缺乏对新任务泛化性、人类指令遵循性和安全保护性等方面的评估。

Intro-P3:
提出了适用于aligned LLMs的CL benchmark: TRACE

  • 8 distinct datasets spanning challenging tasks
    • domain-specific tasks
    • multilingual capabilities
    • code generation
    • mathematical reasoning
  • equal distribution
  • 3 metrics
    • general ability delta
    • instruction following delta
    • safety delta

Intro-P4:
在TRACE上评估了5个LLMs:

  • 几乎所有LLMs在通用能力上都会明显下降;
  • LLMs的多语言能力会提高;
  • 全量微调相比LoRA更容易合适目标任务,但在通用能力上下降明显;
  • LLMs的指令遵循能力也会下降;

Intro-P5:

  • 使用一些推理方法会有效保存模型的能力;
  • 提出了 Reasoning-augmented Continual Learning (RCL)
  • not only boosts performance on target tasks but also significantly upholds the inherent strengths of LLMs;

3 相关工作

3.1 CL

经典3分类,可以参考之前的文章。

3.2 CL Benchmark in NLP

Standard CL Benchmark;
15个分类;

3.3 COT

COT;
Zero shot COT;
fine-tune COT;

四、创新方法

1 模型结构

[论文阅读笔记] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models,阅读笔记,持续学习,大语言模型,论文阅读,笔记,语言模型,自然语言处理,人工智能

TRACE consists of two main components:

  • A selection of eight datasets constituting a tailored set of tasks for continual learning, covering challenges in domain-specific tasks, multilingual capabilities, code generation, and mathematical reasoning.
  • A post-training evaluation of LLM capabilities. In addition to traditional continual learning metrics, we introduce General Ability Delta, Instruction Following Delta, and Safety Delta to evaluate shifts in LLM’s inherent abilities.

2 数据构建

[论文阅读笔记] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models,阅读笔记,持续学习,大语言模型,论文阅读,笔记,语言模型,自然语言处理,人工智能

3 评测指标

  • General Ability Delta: Δ R t G = 1 M ∑ i = 1 M ( R t , i G − R 0 , i G ) \Delta R_t^G=\frac1M\sum_{i=1}^M(R_{t,i}^G-R_{0,i}^G) ΔRtG=M1i=1M(Rt,iGR0,iG), 其中 t , i t,i t,i表示已经训练到第t个任务时的模型在第i个任务上的表现。 0 , i 0,i 0,i表示模型直接在i上的表现。
  • Instruction Following Delta: Δ R t I = 1 N ∑ i = 1 N ( R t , i I − R 0 , i I ) \Delta R_t^I=\frac1N\sum_{i=1}^N(R_{t,i}^I-R_{0,i}^I) ΔRtI=N1i=1N(Rt,iIR0,iI)
  • Safety Delta: Δ R t S = 1 L ∑ i = 1 L ( R t , i S − R 0 , i S ) \Delta R_t^S=\frac1L\sum_{i=1}^L(R_{t,i}^S-R_{0,i}^S) ΔRtS=L1i=1L(Rt,iSR0,iS)

上述指标计算方式一致,区别在于用于评测的数据集不同。

4 实验设置

4.1 baselines

  • Sequential Full-Parameter Fine-Tuning (SeqFT): This method involves training all model
    parameters in sequence.
  • LoRA-based Sequential Fine-Tuning (LoraSeqFT): Only the low-rank LoRA matrices are fine-tuned, leaving the LLM backbone fixed. This method is chosen based on prior findings of reduced forgetting with ”Efficient Tuning” .
  • Replay-based Sequential Fine-Tuning (Replay): Replay, a common continual learning strategy, is employed for its simplicity and effectiveness. We incorporate alignment data from LIMA into the replay memory, replaying 10% of historical data.
  • In-Context Learning (ICL): Task demonstrations are supplied as part of the language prompt, acting as a form of prompt engineering. A 6-shot setting is used for our experiments.

To evaluate the resilience of safety alignment models from diverse training backgrounds and strategies, we select five aligned models from three organizations:

  • Meta:
    • LLaMa-2-7B-Chat,
    • LLaMa-2-13B-Chat
  • BaiChuan:
    • Baichuan 2-7B-Chat
  • Large Model Systems Organization
    • Vicuna-13B-V1.5
    • Vicuna-7B-V1.5

实验结果

主实验结果表格
[论文阅读笔记] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models,阅读笔记,持续学习,大语言模型,论文阅读,笔记,语言模型,自然语言处理,人工智能

序列任务上的表现

  • In-Context Learning (ICL) Performance: ICL methods generally perform lower than SeqFT and Replay methods. This suggests that the TRACE benchmark is indeed challenging, and LLMs can’t readily identify solutions just through simple demonstrations.
  • Replay Performance: Among all the baselines, Replay achieved the highest OP score. With its
    BWT score being positive, it indicates that Replay effectively retains its performance on sequential tasks without significant forgetting. This makes Replay a straightforward and efficient strategy in a continual learning context.
  • Full Parameter Training vs. LoRA: Full parameter training demonstrates better task-specific
    adaptability compared to LoRA, with a smaller BWT score. For instance, LLaMA-2-7B-Chat’s
    SeqFT OP(BWT) is 48.7 (8.3%), while LoRASeqFT stands at 12.7 (45.7%). This suggests that
    when the focus is primarily on sequential tasks, full parameter fine-tuning should be prioritized over parameter-efficient methods like LoRA.

通用能力的适应

From the Model Perspective:

  • Nearly all models display a negative General Ability Delta, indicating a general decline in overall capabilities after continual learning.
  • Larger models, in comparison to their smaller counterparts, show a more pronounced (明显的) forgetting in factual knowledge and reasoning tasks.

From the Task Perspective:

  • Despite the presence of CoT prompts, there is a noticeable decline in math and reasoning abilities across all models, suggesting that these abilities are highly sensitive to new task learning.
  • Excluding the llama2-7b model, most models exhibit a significant drop in performance on MMLU, suggesting a gradual loss of factual knowledge through continual learning.
  • TydiQA task sees a general boost post-training, possibly due to the inclusion of Chinese and German datasets in our sequential tasks. Even more intriguing is the observed enhancement (and some declines) in other languages on TydiQA, suggesting potential cross-linguistic transfer characteristics.
  • Performance shifts on PIQA for most models are subtle(不明显的), indicating the relative robustness of commonsense knowledge during continual learning.

From the Methodological Perspective:

  • The Replay method proves beneficial in preserving reasoning and factuality skills. Especially for larger models, the mitigation of forgetting through Replay is more pronounced. For instance, for LLaMA-2-7B-Chat, Replay offers a 6.5 EM score boost compared to methods without Replay, while for LLaMA-2-13B-Chat, the increase is 17.1 EM score.

实验图1
[论文阅读笔记] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models,阅读笔记,持续学习,大语言模型,论文阅读,笔记,语言模型,自然语言处理,人工智能

  • Figure 2 (a) illustrates the win rate % for instruction following sequentially trained LLMs and their original versions. Here, the win rate can be approximated as an indicator for the Instruction-following delta. It’s evident that all three training methods exhibit a marked decline in instruction-following capabilities compared to their initial versions, with the decline being most pronounced in the LoRA method. Therefore, be cautious when exploring approaches like LoRA for continual learning in LLMs. 概括:说明LoRA微调完很可能不遵循指令。
  • Figure 2(b) shows the win rate % for instruction following between the new LLMs and their starting versions. Here, the win rate can be used as a measure for the Safety Delta. Compared to the original models, most answers were rated as ’Tie’. This suggests that the safety of the model’s answers is largely unaffected by continual learning on general tasks. 概括:说明大部分情况下安全性不太受持续学习训练的影响。

LLMs遗忘的影响因子

数据质量和训练步数

[论文阅读笔记] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models,阅读笔记,持续学习,大语言模型,论文阅读,笔记,语言模型,自然语言处理,人工智能

  • Performance improves as data volume grows, indicating at least 5000 samples from the TRACE-selected datasets are needed for full fitting.
  • Performance improves with up to 5 training epochs, confirming our baseline epoch setting balances target task optimization and retaining existing capabilities.

[论文阅读笔记] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models,阅读笔记,持续学习,大语言模型,论文阅读,笔记,语言模型,自然语言处理,人工智能
How exactly does the reasoning capability of LLMs transform during the continual learning process?

  • a surge in the model’s reasoning prowess post-training on the ScienceQA task, while it declined for other tasks.
  • even though the two tasks from NumGLUE are mathematically inclined, their answers don’t provide a clear reasoning path. ScienceQA does offer such a pathway in its answers. This observation suggests the potential advantage of incorporating reasoning paths during training to preserve and perhaps even enhance the model’s reasoning capability.

Reasoning-Augmented Continual Learning

motivation:

Instead of treating LLMs as traditional models and inundating them with large volumes of data to fit a task’s distribution, might we leverage their inherent abilities for rapid task transfer?

method

[论文阅读笔记] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models,阅读笔记,持续学习,大语言模型,论文阅读,笔记,语言模型,自然语言处理,人工智能

results

[论文阅读笔记] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models,阅读笔记,持续学习,大语言模型,论文阅读,笔记,语言模型,自然语言处理,人工智能

讨论

Can traditional continual learning methods be effectively applied to LLMs?

[论文阅读笔记] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models,阅读笔记,持续学习,大语言模型,论文阅读,笔记,语言模型,自然语言处理,人工智能文章来源地址https://www.toymoban.com/news/detail-801421.html

  • High Training Cost: LLMs require significant data for both pre-training and alignment, leading to a high training cost. Using simple replay to maintain past capabilities can be very expensive. Therefore, selecting key data from past training to keep LLMs’ diverse predictive abilities is essential.
  • Large Number of Parameters: The huge parameter size of LLMs demands advanced hardware for training. Many regularization techniques need to store gradients from past tasks, which is a big challenge for both CPU and GPU memory.
  • One-for-All Deployment of LLMs: LLMs are designed for a wide range of tasks, meaning tailoring parameters for specific tasks might limit their ability to generalize to new tasks. Additionally, methods that adjust the network dynamically can complicate deployment, as it becomes tricky to handle multiple task queries at once.

How should LLMs approach continual learning?

  • Direct end-to-end training of Language Model (LLMs) might cause them to excessively focus on specific patterns of the target task, potentially hindering their performance in more general scenarios.
  • LLMs are already trained on diverse datasets and possess the ability to handle multiple tasks, even with limited examples. Building upon the Superficial Alignment Hypothesis proposed by LIMA, the focus should be on aligning LLMs’ existing capabilities with new tasks rather than starting from scratch.
  • Therefore, strategies like the RCL approach, which leverage LLMs’ inherent abilities for quick transfer to novel tasks, can be effective in mitigating catastrophic forgetting.

到了这里,关于[论文阅读笔记] TRACE: A Comprehensive Benchmark for Continual Learning In Large Language Models的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • 分层强化学习 综述论文阅读 Hierarchical Reinforcement Learning: A Comprehensive Survey

    分层强化学习可以通过将困难的长期决策任务分解为更简单的子任务,提升强化学习算法的性能。 分层强化学习方法主要涉及:使用HRL学习分层策略、子任务发现、迁移学习和多智能体学习四个主要挑战。 强化学习算法的一个痛点:如果任务的长度很长,状态空间和动作空

    2024年02月04日
    浏览(32)
  • 【论文精读】GAIA: A Benchmark for General AI Assistants

    一篇来自Meta、HuggingFace、AutoGPT联合投稿的Agent Benchmark的工作,为当前百花齐放的Agent领域带来了评测的标准。这篇工作详细介绍了GAIA的设计理念,展望了GAIA的未来,讨论了当前GAIA的不足,细读下来可以看到这些大佬们对于这个当前火热领域的热切期待。 Paper https://arxiv.org

    2024年02月04日
    浏览(35)
  • Learning Sample Relationship for Exposure Correction 论文阅读笔记

    这是中科大发表在CVPR2023的一篇论文,提出了一个module和一个损失项,能够提高现有exposure correction网络的性能。这已经是最近第三次看到这种论文了,前两篇分别是CVPR2022的ENC(和这篇文章是同一个一作作者)和CVPR2023的SKF,都是类似即插即用地提出一些模块来提高现有方法的

    2024年02月07日
    浏览(39)
  • Deep Frequency Filtering for Domain Generalization论文阅读笔记

    这是CVPR2023的一篇论文,讲的是在频域做domain generalization,找到频域中generalizable的分量enhance它,suppress那些影响generalization的分量 DG是一个研究模型泛化性的领域,尝试通过各自方法使得模型在未见过的测试集上有良好的泛化性。 intro部分指出,低频分量更好泛化,而高频分

    2024年02月07日
    浏览(29)
  • 【论文笔记】基于预训练模型的持续学习(Continual Learning)(增量学习,Incremental Learning)

    论文链接: Continual Learning with Pre-Trained Models: A Survey 代码链接: Github: LAMDA-PILOT 持续学习 (Continual Learning, CL)旨在使模型在学习新知识的同时能够保留原来的知识信息了,然而现实任务中,模型并不能很好地保留原始信息,这也就是常说的 灾害性遗忘 (Catastrophic forgetting)

    2024年04月26日
    浏览(34)
  • RIS 系列 Mask Grounding for Referring Image Segmentation 论文阅读笔记

    写在前面   一篇 Arxiv 上面的新文章,看看清华大佬们的研究。 论文地址:Mask Grounding for Referring Image Segmentation 代码地址:原论文说将会开源,静待佳音~ 预计提交于:CVPR 2024 Ps:2023 年每周一篇博文阅读笔记,主页 更多干货,欢迎关注呀,期待 6 千粉丝有你的参与呦~   

    2024年02月03日
    浏览(38)
  • 【论文阅读笔记】PraNet: Parallel Reverse Attention Network for Polyp Segmentation

    PraNet: Parallel Reverse Attention Network for Polyp Segmentation PraNet:用于息肉分割的并行反向注意力网络 2020年发表在MICCAI Paper Code 结肠镜检查是检测结直肠息肉的有效技术,结直肠息肉与结直肠癌高度相关。在临床实践中,从结肠镜图像中分割息肉是非常重要的,因为它为诊断和手术

    2024年01月20日
    浏览(41)
  • Lightening Network for Low-Light Image Enhancement 论文阅读笔记

    这是2022年TIP期刊的一篇有监督暗图增强的文章 网络结构如图所示: LBP的网络结构如下: 有点绕,其基于的理论如下。就是说,普通的暗图增强就只是走下图的L1红箭头,从暗图估计一个亮图。但是其实这个亮图和真实的亮图还是有一些差距,怎么弥补呢,可以再进一步学习

    2024年02月16日
    浏览(35)
  • 【论文阅读笔记】Traj-MAE: Masked Autoencoders for Trajectory Prediction

    通过预测可能的危险,轨迹预测一直是构建可靠的自动驾驶系统的关键任务。一个关键问题是在不发生碰撞的情况下生成一致的轨迹预测。为了克服这一挑战,我们提出了一种有效的用于轨迹预测的掩蔽自编码器(Traj-MAE),它能更好地代表驾驶环境中智能体的复杂行为。 具体

    2024年02月06日
    浏览(34)
  • [论文笔记] 大模型主流Benchmark测试集介绍

             自然语言处理(NLP)的进步往往通过在各种benchmark测试集上的表现来衡量。随着多语言和跨语言NLP研究的兴起,越来越多的多语言测试集被提出以评估模型在不同语言和文化背景下的泛化能力。在这篇文章中,我们将介绍几个主流的多语言NLP benchmark测试集,包括

    2024年02月22日
    浏览(87)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包