NLP之LLMs:《Zeno Chatbot Report》的翻译与解读—CMU副教授详测七款个类ChatGPT大模型(GPT-2、LLaMa、Alpaca、Vicuna、MPT-Chat、Cohere Command和ChatGPT)
目录
《Zeno Chatbot Report》的翻译与解读—CMU副教授详细测评七款个类ChatGPT大模型
Overview概览
Setup设置
Model Settings模型设置
Evaluation Metrics评估指标
Further Analysis进一步分析
Results结果
How well do models perform overall?模型整体表现如何?
Accuracy by Gold-standard Response Length根据标准人类回复长度的准确性
How important is the context window?上下文窗口有多重要?
How important is the prompt?提示的重要性有多大?
Discovered Errors (and possible mitigations)发现的错误(及可能的缓解措施)
Hallucinations错觉
Failure to Probe无法探询
Repeated Content重复内容
Correct正确的回答
Final Words最后
《Zeno Chatbot Report》的翻译与解读—CMU副教授详细测评七款个类ChatGPT大模型
作者 |
Alex Cabrera和 Graham Neubig,CMU副教授 |
时间 |
2023 年 5 月 18 日 |
地址 |
zeno-build/tasks/chatbot/report at main · zeno-ml/zeno-build · GitHub |
Overview概览
Large language models (LLMs) are taking the world by storm, and one big application for them is chat, with applications in question answering, customer service, and many others. However, chatbots are notoriously hard to evaluate, and there still isn’t a clear sense about which of the recent models are best to use in what situations. |
大型语言模型(LLMs)正风靡全球,其中一个主要应用是聊天,包括问答、客户服务等多个领域。然而,聊天机器人的评估一直以来都很困难,目前对于最近的模型在不同情境下的最佳选择还没有清晰的认识。 |
In this report, we demonstrate some first results on evaluating and comparing recent chatbots, with the goal of making it easier for people to understand the current lay-of-the-land with respect to all of the open-source and API-based models coming out recently. In particular, we create a new open-source toolkit for evaluating LLMs, Zeno Build. This combines (1) a unified interface to use open-source LLMs through Hugging Face or online APIs, (2) an online interface for browsing and analyzing results using Zeno, and (3) state-of-the-art evaluation metrics for text using Critique. |
在这份报告中,我们展示了对最近聊天机器人的评估和比较的初步结果,旨在帮助人们更容易地了解最近发布的开源和API模型的现状。具体而言,我们创建了一个新的开源工具包用于评估LLMs,名为Zeno Build。它结合了以下三个方面: (1)通过Hugging Face或在线API使用开源LLMs的统一接口, (2)使用Zeno进行浏览和分析结果的在线界面, (3)使用Critique进行文本的最先进评估指标。 |
Browse the results here Highlights:
Read on for more detail, try out Zeno Build if you want to play around yourself, and we very much welcome additional contributions! To get in touch, open an issue on the issues page, jump in the Zeno discord, or get in contact via email. |
在这里浏览结果 亮点:
如果你想了解更多细节,请继续阅读,如果你想自己尝试,请使用Zeno Build,我们非常欢迎额外的贡献!如果你想联系我们,请在问题页面提出问题、加入Zeno的Discord群,或通过电子邮件联系我们。 |
Setup设置
Model Settings模型设置
GPT-2、LLaMa、Alpaca、Vicuna、MPT-Chat、Cohere Command、ChatGPT
We use the DSTC11 customer service dataset, which includes agent-customer customer service interactions. We test 7 models:
For all models by default we use a temperature of 0.3, context window of 4 previous chat turns, and a standard prompt saying “You are a chatbot tasked with making small-talk with people.” (with other ablations below). |
我们使用了DSTC11客户服务数据集,其中包括代理商和客户之间的客户服务互动。我们测试了7个模型:
对于所有模型,默认情况下,我们使用温度值为0.3,上下文窗口为4个先前的对话轮次,以及标准提示:“你是一个与人进行闲聊的聊天机器人。”(下面还有其他消融设置)。 |
Evaluation Metrics评估指标
We evaluated the models based on how similar their outputs are to human customer service responses. This was done using metrics provided by the Critique toolkit:
We also measured length ratio, which simply measures the length of the output divided by the length of the gold-standard human response, indicating how verbose the chatbot is. |
我们根据模型输出与人类客户服务回复的相似程度进行评估。我们使用Critique工具包提供的指标进行评估:
我们还测量了长度比率,简单地将输出的长度除以标准的人类回复长度,以表示聊天机器人的冗长程度。 |
Further Analysis进一步分析
To dig deeper into the results, we used the Zeno analysis interface, specifically using its report generator to subdivide the examples based on the position in the conversation (start, early, middle, and late) and the length of the gold-standard human response (short, medium, and long), and its exploration interface to look through examples with bad automatic scores, and to better understand where each of the models is failing. We also did ablation studies on the Vicuna model, trying different context windows and prompts in the analysis. |
为了深入研究结果,我们使用Zeno分析界面,具体使用其报告生成器根据对话位置(开始、早期、中间和后期)和标准的人类回复长度(短、中等和长)对示例进行细分,使用其探索界面查看自动评分较低的示例,并更好地了解每个模型的失败之处。 我们还对Vicuna模型进行了消融研究,尝试了不同的上下文窗口和提示方式。 |
Results结果
How well do models perform overall?模型整体表现如何?
According to all of these metrics, gpt-3.5-turbo was the clear winner. Vicuna was the open-source Winner. GPT-2 and LLaMa were not very good, demonstrating the importance of training directly on chat. These rankings also approximately match those of the lmsys chat arena, which uses human A/B testing to compare models, but Zeno Build’s results were obtained without any human ratings. |
根据所有这些指标,gpt-3.5-turbo是明显的优胜者。Vicuna是开源模型中的优胜者。GPT-2和LLaMa的表现不太好,这说明直接在聊天上进行训练的重要性。 这些排名与lmsys chat arena的排名大致相符,lmsys chat arena使用人类A/B测试来比较模型,但Zeno Build的结果是在没有任何人类评级的情况下获得的。 |
With regards to verbosity, gpt3.5-turbo is far more verbose than the others, and it seems that models tuned for chat tend to be verbose in general. |
至于冗长程度,gpt3.5-turbo比其他模型更冗长,而且似乎针对聊天进行调优的模型总体上更冗长。 |
lmsys chat arena的排名:https://chat.lmsys.org/
使用 Elo 评级系统来计算模型的相对性能
Accuracy by Gold-standard Response Length根据标准人类回复长度的准确性
Next, we used the Zeno report UI to dig deeper. First, we measure accuracy separately by short (≤35 characters), medium (36-70 characters), and long (≥71 characters) human responses. gpt-3.5-turbo and Vicuna maintain accuracy even on longer chat turns while others drop off. |
接下来,我们使用Zeno报告界面进行更深入的分析。首先,我们分别衡量短(≤35个字符)、中等(36-70个字符)和长(≥71个字符)人类回复的准确性。 gpt-3.5-turbo和Vicuna在更长的对话中仍然保持准确性,而其他模型则下降。 |
How important is the context window?上下文窗口有多重要?
We experimented using Vicuna with context windows ranging from 1-4 previous utterances. As we increase the context window, the performance goes up, indicating that larger context windows are important. |
我们使用Vicuna尝试了1-4个先前话语的上下文窗口。随着上下文窗口的增加,性能提高,表明较大的上下文窗口很重要。 |
Longer context is particularly important in the middle and later parts of the conversation, where responses are less templated and more dependent on what was said previously. |
在对话的中间和后期,更长的上下文尤其重要,因为回复不那么模板化,更依赖于先前的对话内容。 |
More context is particularly important when trying to generate outputs where the gold standard is shorter (possibly because there is more ambiguity). |
当尝试生成金标准较短的输出时,更多的上下文尤为重要(可能是因为存在更多的歧义)。 |
How important is the prompt?提示的重要性有多大?
We tried 5 different prompts - 4 generic ones and one specifically tailored to the task of customer service chat in the insurance domain:
|
我们尝试了5个不同的提示方式:4个通用提示和一个针对保险领域客户服务聊天任务的特定提示:
|
Overall, the prompt didn’t make a very large measurable difference, but the “cynical” chatbot was a little bit worse, and the tailored “insurance” chatbot was a little bit better overall. |
总体而言,提示对结果影响不大,但“愤世嫉俗”聊天机器人稍差一些,而专门定制的“保险”聊天机器人整体上稍好一些。 |
The differences were especially stark on the first turn of the conversation, indicating that the prompt is most important when there is little other context to work with. |
在对话的第一个轮次上,差异尤为明显,这表明提示在很少的上下文情况下最重要。 |
Discovered Errors (and possible mitigations)发现的错误(及可能的缓解措施)
Finally, we used Zeno’s exploration UI to try to find possible errors by gpt-3.5-turbo, the worst performing model. Specifically, we looked at all examples that had low chrf (<0.1) and looked through them manually to find trends. |
最后,我们使用Zeno的探索界面来尝试发现gpt-3.5-turbo这个表现最差的模型可能存在的错误。具体而言,我们查看了所有chrf得分低(<0.1)的示例,并通过手动查看这些示例来找出其中的趋势。 |
Hallucinations错觉
Sometimes the model generates factually incorrect statements, particularly based on providing false customer information or information about the company policies. This would need to be solved by adding more information about the customer into the prompt, or looking up company policies and referring to them when answering specific questions. |
有时模型会生成事实上不正确的陈述,特别是基于提供虚假的客户信息或公司政策信息。这可能需要通过在提示中添加更多关于客户的信息,或在回答特定问题时查找公司政策并参考它们来解决。 |
Failure to Probe无法探询
Sometimes the model fails to probe for more information when it’s actually necessary, such as continuing listening for a number when the number is not yet complete. This could possibly be mitigated by modifying the prompt to remind the model of the required shape for certain pieces of information (e.g. a phone number must be 10 digits). |
有时模型在实际需要时未能继续探询更多信息,比如在号码尚未完整输入时仍然继续监听号码。这可能可以通过修改提示来提醒模型对某些信息的要求形式(例如,电话号码必须是10位数字)来缓解。 |
Repeated Content重复内容
Sometimes the same content is repeated multiple times, such as the bot saying “thank you” twice here. |
有时相同的内容会重复多次,比如这里机器人说了两次“谢谢”。 |
Correct正确的回答
Sometimes the response is reasonable, but just different than the human response. |
有时回答是合理的,只是与人类回答不同。 |
Final Words最后
We hope this report was helpful! If you want to try other models, other dataset, other prompts, or other hyperparameter settings, jump over to the chatbot example on the zeno-build repository to try it out. We’ll be happy to discuss more and answer any questions via email, discord, or Github issues.文章来源:https://www.toymoban.com/news/detail-485236.html |
希望这份报告对您有所帮助!如果您想尝试其他模型、其他数据集、其他提示或其他超参数设置,请转到zeno-build存储库中的聊天机器人示例来尝试。我们很乐意通过电子邮件、Discord或GitHub问题讨论更多并回答任何问题。文章来源地址https://www.toymoban.com/news/detail-485236.html |
到了这里,关于NLP之LLMs:《Zeno Chatbot Report》的翻译与解读—CMU副教授详测七款个类ChatGPT大模型(GPT-2、LLaMa、Alpaca、Vicuna、MPT-Chat、Coher的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!