NLP之LLMs:《Zeno Chatbot Report》的翻译与解读—CMU副教授详测七款个类ChatGPT大模型(GPT-2、LLaMa、Alpaca、Vicuna、MPT-Chat、Coher

这篇具有很好参考价值的文章主要介绍了NLP之LLMs:《Zeno Chatbot Report》的翻译与解读—CMU副教授详测七款个类ChatGPT大模型(GPT-2、LLaMa、Alpaca、Vicuna、MPT-Chat、Coher。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

NLP之LLMs:《Zeno Chatbot Report》的翻译与解读—CMU副教授详测七款个类ChatGPT大模型(GPT-2、LLaMa、Alpaca、Vicuna、MPT-Chat、Cohere Command和ChatGPT)

目录

《Zeno Chatbot Report》的翻译与解读—CMU副教授详细测评七款个类ChatGPT大模型

Overview概览

Setup设置

Model Settings模型设置

Evaluation Metrics评估指标

Further Analysis进一步分析

Results结果

How well do models perform overall?模型整体表现如何?

Accuracy by Gold-standard Response Length根据标准人类回复长度的准确性

How important is the context window?上下文窗口有多重要?

How important is the prompt?提示的重要性有多大?

Discovered Errors (and possible mitigations)发现的错误(及可能的缓解措施)

Hallucinations错觉

Failure to Probe无法探询

Repeated Content重复内容

Correct正确的回答

Final Words最后


《Zeno Chatbot Report》的翻译与解读—CMU副教授详细测评七款个类ChatGPT大模型

作者

Alex Cabrera和 Graham Neubig,CMU副教授

时间

2023 年 5 月 18 日

地址

zeno-build/tasks/chatbot/report at main · zeno-ml/zeno-build · GitHub

Overview概览

Large language models (LLMs) are taking the world by storm, and one big application for them is chat, with applications in question answering, customer service, and many others. However, chatbots are notoriously hard to evaluate, and there still isn’t a clear sense about which of the recent models are best to use in what situations.

大型语言模型(LLMs)正风靡全球,其中一个主要应用是聊天,包括问答客户服务等多个领域。然而,聊天机器人的评估一直以来都很困难,目前对于最近的模型在不同情境下的最佳选择还没有清晰的认识。

In this report, we demonstrate some first results on evaluating and comparing recent chatbots, with the goal of making it easier for people to understand the current lay-of-the-land with respect to all of the open-source and API-based models coming out recently. In particular, we create a new open-source toolkit for evaluating LLMs, Zeno Build. This combines (1) a unified interface to use open-source LLMs through Hugging Face or online APIs,

(2) an online interface for browsing and analyzing results using Zeno, and

(3) state-of-the-art evaluation metrics for text using Critique.

在这份报告中,我们展示了对最近聊天机器人的评估和比较的初步结果,旨在帮助人们更容易地了解最近发布的开源和API模型的现状。具体而言,我们创建了一个新的开源工具包用于评估LLMs,名为Zeno Build。它结合了以下三个方面:

(1)通过Hugging Face或在线API使用开源LLMs的统一接口,

(2)使用Zeno进行浏览和分析结果的在线界面,

(3)使用Critique进行文本的最先进评估指标。

Browse the results here

Highlights:

  1. We evaluated 7 language models: GPT-2, LLaMa, Alpaca, Vicuna, MPT-Chat, Cohere Command, and ChatGPT (gpt-3.5-turbo)
  2. The models were evaluated on their ability to create human-like responses on a customer service dataset
  3. ChatGPT came out on top, but the open-source chat model Vicuna was also very competitive
  4. We find that it is important to use a chat-tuned model with a long context window
  5. Prompt engineering particularly improves performance for turns early in the conversation, but less so in later turns where more context is available
  6. Even for a strong model like ChatGPT, it is easy to find obvious issues in hallucinations, failure to probe for more information, and repeated content

Read on for more detail, try out Zeno Build if you want to play around yourself, and we very much welcome additional contributions! To get in touch, open an issue on the issues page, jump in the Zeno discord, or get in contact via email.

在这里浏览结果

亮点:

  1. 我们评估了7个语言模型:GPT-2、LLaMa、Alpaca、Vicuna、MPT-Chat、Cohere Command和ChatGPT(gpt-3.5-turbo)
  2. 这些模型在客户服务数据集上评估了它们生成类似人类回复的能力
  3. ChatGPT表现最佳,但开源聊天模型Vicuna也非常有竞争力
  4. 我们发现使用一个经过聊天调优的模型长上下文窗口非常重要
  5. 提示工程特别提高了对话早期回合的性能,但在后续回合中,因为有更多的上下文可用,效果稍逊
  6. 即使对于像ChatGPT这样强大的模型,我们仍然很容易发现明显的问题,如产生虚假信息未能探索更多信息以及重复内容

如果你想了解更多细节,请继续阅读,如果你想自己尝试,请使用Zeno Build,我们非常欢迎额外的贡献!如果你想联系我们,请在问题页面提出问题、加入Zeno的Discord群,或通过电子邮件联系我们。

Setup设置

Model Settings模型设置

GPT-2、LLaMa、Alpaca、Vicuna、MPT-Chat、Cohere Command、ChatGPT

We use the DSTC11 customer service dataset, which includes agent-customer customer service interactions. We test 7 models:

  1. GPT-2: A classic language model from 2019. We added this as a baseline to see how much the recent progress in language modeling has made a difference in building better chat models.
  2. LLaMa: A language model originally trained by Meta AI that uses a straight-up language modeling objective. We use the 7B model for this and all following open-source models.
  3. Alpaca: A model based on LLaMa that additionally uses instruction tuning.
  4. Vicuna: A model based on LLaMa that is further explicitly tuned for chatbot-based applications.
  5. MPT-Chat: A model trained from scratch in a way similar to Vicuna, which has a more commercially permissive license.
  6. Cohere Command: An API-based model by Cohere that is tuned for following commands.
  7. ChatGPT (gpt-3.5-turbo): The standard-bearer of API-based chat models by OpenAI.

For all models by default we use a temperature of 0.3, context window of 4 previous chat turns, and a standard prompt saying “You are a chatbot tasked with making small-talk with people.” (with other ablations below).

我们使用了DSTC11客户服务数据集,其中包括代理商和客户之间的客户服务互动。我们测试了7个模型:

  1. GPT-2:2019年的经典语言模型。我们将其作为基准模型,以了解最近在语言建模方面取得的进展在构建更好的聊天模型方面有多大影响。
  2. LLaMa:Meta AI最初训练的语言模型,使用纯粹的语言建模目标。我们在这个模型和后续的开源模型中使用了7B模型
  3. Alpaca:基于LLaMa的模型,此外还使用了指令调优
  4. Vicuna:基于LLaMa的模型,进一步明确针对聊天机器人应用进行了调优
  5. MPT-Chat:以类似Vicuna的方式从头开始训练的模型,具有更商业友好的许可证。
  6. Cohere Command:Cohere提供的基于API的模型,专门用于遵循指令。
  7. ChatGPT(gpt-3.5-turbo):由OpenAI提供的API聊天模型的旗舰。

对于所有模型,默认情况下,我们使用温度值为0.3上下文窗口为4个先前的对话轮次,以及标准提示:“你是一个与人进行闲聊的聊天机器人。”(下面还有其他消融设置)。

Evaluation Metrics评估指标

We evaluated the models based on how similar their outputs are to human customer service responses. This was done using metrics provided by the Critique toolkit:

  1. chrf: Measures the overlap of character strings
  2. BERTScore: Measures overlap of embeddings between the two utterances
  3. UniEval Coherence: Predicts how coherent the outputs are with the previous chat turn

We also measured length ratio, which simply measures the length of the output divided by the length of the gold-standard human response, indicating how verbose the chatbot is.

我们根据模型输出与人类客户服务回复的相似程度进行评估。我们使用Critique工具包提供的指标进行评估

  1. chrf:衡量字符串之间的重叠程度
  2. BERTScore:衡量两个话语之间嵌入的重叠程度
  3. UniEval一致性:预测输出与先前的对话轮次的连贯性

我们还测量了长度比率,简单地将输出的长度除以标准的人类回复长度,以表示聊天机器人的冗长程度。

Further Analysis进一步分析

To dig deeper into the results, we used the Zeno analysis interface, specifically using its report generator to subdivide the examples based on the position in the conversation (start, early, middle, and late) and the length of the gold-standard human response (short, medium, and long), and its exploration interface to look through examples with bad automatic scores, and to better understand where each of the models is failing.

We also did ablation studies on the Vicuna model, trying different context windows and prompts in the analysis.

为了深入研究结果,我们使用Zeno分析界面,具体使用其报告生成器根据对话位置(开始、早期、中间和后期)和标准的人类回复长度(短、中等和长)对示例进行细分,使用其探索界面查看自动评分较低的示例,并更好地了解每个模型的失败之处。

我们还对Vicuna模型进行了消融研究,尝试了不同的上下文窗口和提示方式

Results结果

How well do models perform overall?模型整体表现如何?

According to all of these metrics, gpt-3.5-turbo was the clear winner. Vicuna was the open-source Winner. GPT-2 and LLaMa were not very good, demonstrating the importance of training directly on chat.

These rankings also approximately match those of the lmsys chat arena, which uses human A/B testing to compare models, but Zeno Build’s results were obtained without any human ratings.

根据所有这些指标,gpt-3.5-turbo是明显的优胜者Vicuna是开源模型中的优胜者。GPT-2和LLaMa的表现不太好,这说明直接在聊天上进行训练的重要性。

这些排名与lmsys chat arena的排名大致相符,lmsys chat arena使用人类A/B测试来比较模型,但Zeno Build的结果是在没有任何人类评级的情况下获得的。

With regards to verbosity, gpt3.5-turbo is far more verbose than the others, and it seems that models tuned for chat tend to be verbose in general.

至于冗长程度,gpt3.5-turbo比其他模型更冗长,而且似乎针对聊天进行调优的模型总体上更冗长。

lmsys chat arena的排名:https://chat.lmsys.org/

使用 Elo 评级系统来计算模型的相对性能

Accuracy by Gold-standard Response Length根据标准人类回复长度的准确性

Next, we used the Zeno report UI to dig deeper. First, we measure accuracy separately by short (≤35 characters), medium (36-70 characters), and long (≥71 characters) human responses.

gpt-3.5-turbo and Vicuna maintain accuracy even on longer chat turns while others drop off.

接下来,我们使用Zeno报告界面进行更深入的分析。首先,我们分别衡量短(≤35个字符)、中等(36-70个字符)和长(≥71个字符)人类回复的准确性

gpt-3.5-turbo和Vicuna在更长的对话中仍然保持准确性,而其他模型则下降。

How important is the context window?上下文窗口有多重要?

We experimented using Vicuna with context windows ranging from 1-4 previous utterances. As we increase the context window, the performance goes up, indicating that larger context windows are important.

我们使用Vicuna尝试了1-4个先前话语的上下文窗口。随着上下文窗口的增加,性能提高,表明较大的上下文窗口很重要

Longer context is particularly important in the middle and later parts of the conversation, where responses are less templated and more dependent on what was said previously.

在对话的中间和后期,更长的上下文尤其重要,因为回复不那么模板化,更依赖于先前的对话内容。

More context is particularly important when trying to generate outputs where the gold standard is shorter (possibly because there is more ambiguity).

当尝试生成金标准较短的输出时,更多的上下文尤为重要(可能是因为存在更多的歧义)。

How important is the prompt?提示的重要性有多大?

We tried 5 different prompts - 4 generic ones and one specifically tailored to the task of customer service chat in the insurance domain:

  1. Standard: “You are a chatbot tasked with making small-talk with people.”
  2. Friendly: “You are a kind and friendly chatbot tasked with making small-talk with people in a way that makes them feel pleasant.”
  3. Polite: “You are an exceedingly polite chatbot that speaks very formally and tries to not make any missteps in your responses.”
  4. Cynical: “You are a cynical chatbot that has a very dark view of the world and in general likes to point out any possible problems.”
  5. Insurance: “You are an agent at the Rivertown Insurance helpdesk that mainly helps with resolving insurance claims.”

我们尝试了5个不同的提示方式:4个通用提示和一个针对保险领域客户服务聊天任务的特定提示:

  1. 标准提示:“你是一个与人进行闲聊的聊天机器人。”
  2. 友好提示:“你是一个友善而友好的聊天机器人,旨在以让人感到愉快的方式与人进行闲聊。”
  3. 礼貌提示:“你是一个非常有礼貌的聊天机器人,讲话非常正式,尽量不出差错。”
  4. 愤世嫉俗提示:“你是一个愤世嫉俗的聊天机器人,对世界持有非常消极的看法,通常喜欢指出任何可能存在的问题。”
  5. 保险提示:“你是Rivertown Insurance帮助台的一名代理人,主要帮助解决保险索赔问题。”

Overall, the prompt didn’t make a very large measurable difference, but the “cynical” chatbot was a little bit worse, and the tailored “insurance” chatbot was a little bit better overall.

总体而言,提示对结果影响不大,但“愤世嫉俗”聊天机器人稍差一些,而专门定制的“保险”聊天机器人整体上稍好一些

The differences were especially stark on the first turn of the conversation, indicating that the prompt is most important when there is little other context to work with.

在对话的第一个轮次上,差异尤为明显,这表明提示在很少的上下文情况下最重要

Discovered Errors (and possible mitigations)发现的错误(及可能的缓解措施)

Finally, we used Zeno’s exploration UI to try to find possible errors by gpt-3.5-turbo, the worst performing model. Specifically, we looked at all examples that had low chrf (<0.1) and looked through them manually to find trends.

最后,我们使用Zeno的探索界面来尝试发现gpt-3.5-turbo这个表现最差的模型可能存在的错误。具体而言,我们查看了所有chrf得分低(<0.1)的示例,并通过手动查看这些示例来找出其中的趋势。

Hallucinations错觉

Sometimes the model generates factually incorrect statements, particularly based on providing false customer information or information about the company policies. This would need to be solved by adding more information about the customer into the prompt, or looking up company policies and referring to them when answering specific questions.

有时模型会生成事实上不正确的陈述,特别是基于提供虚假的客户信息或公司政策信息。这可能需要通过在提示中添加更多关于客户的信息,或在回答特定问题时查找公司政策并参考它们来解决

Failure to Probe无法探询

Sometimes the model fails to probe for more information when it’s actually necessary, such as continuing listening for a number when the number is not yet complete. This could possibly be mitigated by modifying the prompt to remind the model of the required shape for certain pieces of information (e.g. a phone number must be 10 digits).

有时模型在实际需要时未能继续探询更多信息,比如在号码尚未完整输入时仍然继续监听号码。这可能可以通过修改提示来提醒模型对某些信息的要求形式(例如,电话号码必须是10位数字)来缓解。

Repeated Content重复内容

Sometimes the same content is repeated multiple times, such as the bot saying “thank you” twice here.

有时相同的内容会重复多次,比如这里机器人说了两次“谢谢”。

Correct正确的回答

Sometimes the response is reasonable, but just different than the human response.

有时回答是合理的,只是与人类回答不同

Final Words最后

We hope this report was helpful! If you want to try other models, other dataset, other prompts, or other hyperparameter settings, jump over to the chatbot example on the zeno-build repository to try it out. We’ll be happy to discuss more and answer any questions via email, discord, or Github issues.

希望这份报告对您有所帮助!如果您想尝试其他模型、其他数据集、其他提示或其他超参数设置,请转到zeno-build存储库中的聊天机器人示例来尝试。我们很乐意通过电子邮件、Discord或GitHub问题讨论更多并回答任何问题。文章来源地址https://www.toymoban.com/news/detail-485236.html

到了这里,关于NLP之LLMs:《Zeno Chatbot Report》的翻译与解读—CMU副教授详测七款个类ChatGPT大模型(GPT-2、LLaMa、Alpaca、Vicuna、MPT-Chat、Coher的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • NLP / LLMs中的Temperature 是什么?

    ChatGPT, GPT-3, GPT-3.5, GPT-4, LLaMA, Bard等大型语言模型的一个重要的超参数 大型语言模型能够根据给定的上下文或提示生成新文本,由于神经网络等深度学习技术的进步,这些模型越来越受欢迎。可用于控制生成语言模型行为的关键参数之一是Temperature 参数。在本文中,我们将讨

    2023年04月16日
    浏览(36)
  • NLP | 基于LLMs的文本分类任务

    比赛链接:讯飞开放平台 来源:DataWhale AI夏令营3(NLP)   ①Roberta在预训练的阶段中没有对下一句话进行预测( NSP ) ②采用了 动态掩码 ③使用 字符级 和 词级别 表征的 混合文本编码 。 论文:https://arxiv.org/pdf/1907.11692.pdf   DataWhale Topline的改进:   特征1:平均池化Mean

    2024年02月11日
    浏览(35)
  • NLP和LLMs: 理解它们之间的区别

    NLP(自然语言处理)和LLMs(大型语言模型)都与处理自然语言相关,但它们的重点和范围略有不同。 自然语言处理(NLP): 定义 : 自然语言处理(NLP)是人工智能领域的一个子领域,专注于研究和开发使计算机能够理解、处理、生成自然语言文本的技术和方法。 目标 :

    2024年04月17日
    浏览(41)
  • 最近火出圈的GPT-4 技术Report出来了,快进来看看逐文对照翻译!

    近期OpenAI发布的GPT-4的效果好得让人惊艳!碾压了之前火到出圈的ChatGPT,通过同步发布的GPT-4 Technical Report一同看看到底发生了什么! No.0 摘要 We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many

    2024年02月14日
    浏览(47)
  • NLP&ChatGPT&LLMs技术、源码、案例实战210课

    NLPChatGPTLLMs技术、源码、案例实战210课 超过12.5万行NLP/ChatGPT/LLMs代码的AI课程 讲师介绍 现任职于硅谷一家对话机器人CTO,专精于Conversational AI 在美国曾先后工作于硅谷最顶级的机器学习和人工智能实验室 CTO、杰出AI工程师、首席机器学习工程师 美国一家Talents Sourcing公司的F

    2024年02月07日
    浏览(42)
  • 翻译: LLMs关于人工智能的担忧 Concerns about AI

    在短时间内,获取生成人工智能的能力已经在全球范围内传播,使许多人能够生成高质量的文章、图片和音频。随着这些惊人的能力的出现,也带来了许多关于人工智能的担忧。我认为即使在生成人工智能兴起之前,我们就已经生活在许多焦虑之中。对环境的担忧,对权威的

    2024年02月03日
    浏览(55)
  • GPT-4原论文详细解读(GPT-4 Technical Report)

    返回论文和资料目录 相比之前的GPT-3.5等大型语言模型(这里可以看我的InstructGPT解读,也方便理解本文内容),GPT-4最大的不同在于变成了多模态,即输出不变的情况下,输入可以为图片或文本。其展现了优于ChatGPT模型并且非常强大的性能。读者可在OpenAI官网体验体验,不过

    2023年04月22日
    浏览(52)
  • LLMs NLP模型评估Model evaluation ROUGE and BLEU SCORE

    在整个课程中,你看到过类似模型在这个任务上表现良好,或者这个微调模型在性能上相对于基础模型有显著提升等陈述。 这些陈述是什么意思?如何形式化你的微调模型在你起初的预训练模型上的性能改进?让我们探讨一些由大型语言模型开发者使用的指标,你可以用这些

    2024年02月10日
    浏览(38)
  • NLP——Translation 机器翻译

    基于统计的机器翻译任务通常通过翻译模型(Translation Model)和语言模型(Language Model)的结合来学习更强大的翻译模型。这种结合被称为统计机器翻译(SMT)。 翻译模型(Translation Model):翻译模型主要关注如何将源语言句子翻译成目标语言句子。它使用双语语料库进行训练

    2024年02月09日
    浏览(94)
  • 几个nlp的小任务(机器翻译)

    2024年02月10日
    浏览(40)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包