chatgpt fine-tuning 官方文档

这篇具有很好参考价值的文章主要介绍了chatgpt fine-tuning 官方文档。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

Fine-tuning

Learn how to customize a model for your application.

Introduction

This guide is intended for users of the new OpenAI fine-tuning API. If you are a legacy fine-tuning user, please refer to our legacy fine-tuning guide.

Fine-tuning lets you get more out of the models available through the API by providing:

  1. Higher quality results than prompting
  2. Ability to train on more examples than can fit in a prompt
  3. Token savings due to shorter prompts
  4. Lower latency requests

GPT models have been pre-trained on a vast amount of text. To use the models effectively, we include instructions and sometimes several examples in a prompt. Using demonstrations to show how to perform a task is often called "few-shot learning."

Fine-tuning improves on few-shot learning by training on many more examples than can fit in the prompt, letting you achieve better results on a wide number of tasks. Once a model has been fine-tuned, you won't need to provide as many examples in the prompt. This saves costs and enables lower-latency requests.

At a high level, fine-tuning involves the following steps:

  1. Prepare and upload training data
  2. Train a new fine-tuned model
  3. Use your fine-tuned model

Visit our pricing page to learn more about how fine-tuned model training and usage are billed.

What models can be fine-tuned?

We are working on enabling fine-tuning for GPT-4 and expect this feature to be available later this year.

Fine-tuning is currently available for the following models:

  • gpt-3.5-turbo-0613 (recommended)
  • babbage-002
  • davinci-002

We expect gpt-3.5-turbo to be the right model for most users in terms of results and ease of use, unless you are migrating a legacy fine-tuned model.

When to use fine-tuning

Fine-tuning GPT models can make them better for specific applications, but it requires a careful investment of time and effort. We recommend first attempting to get good results with prompt engineering, prompt chaining (breaking complex tasks into multiple prompts), and function calling, with the key reasons being:

  • There are many tasks for which our models may initially appear to not perform well at, but with better prompting we can achieve much better results and potentially not need to be fine-tune
  • Iterating over prompts and other tactics has a much faster feedback loop than iterating with fine-tuning, which requires creating datasets and running training jobs
  • In cases where fine-tuning is still necessary, initial prompt engineering work is not wasted - we typically see best results when using a good prompt in the fine-tuning data (or combining prompt chaining / tool use with fine-tuning)

Our GPT best practices guide provides a background on some of the most effective strategies and tactics for getting better performance without fine-tuning. You may find it helpful to iterate quickly on prompts in our playground.

Common use cases

Some common use cases where fine-tuning can improve results:

  • Setting the style, tone, format, or other qualitative aspects
  • Improving reliability at producing a desired output
  • Correcting failures to follow complex prompts
  • Handling many edge cases in specific ways
  • Performing a new skill or task that’s hard to articulate in a prompt

One high-level way to think about these cases is when it’s easier to "show, not tell". In the sections to come, we will explore how to set up data for fine-tuning and various examples where fine-tuning improves the performance over the baseline model.

Another scenario where fine-tuning is effective is in reducing costs and / or latency, by replacing GPT-4 or by utilizing shorter prompts, without sacrificing quality. If you can achieve good results with GPT-4, you can often reach similar quality with a fine-tuned gpt-3.5-turbo model by fine-tuning on the GPT-4 completions, possibly with a shortened instruction prompt.

Preparing your dataset

Once you have determined that fine-tuning is the right solution (i.e. you’ve optimized your prompt as far as it can take you and identified problems that the model still has), you’ll need to prepare data for training the model. You should create a diverse set of demonstration conversations that are similar to the conversations you will ask the model to respond to at inference time in production.

Each example in the dataset should be a conversation in the same format as our Chat completions API, specifically a list of messages where each message has a role, content, and optional name. At least some of the training examples should directly target cases where the prompted model is not behaving as desired, and the provided assistant messages in the data should be the ideal responses you want the model to provide.

Example format

In this example, our goal is to create a chatbot that occasionally gives sarcastic responses, these are three training examples (conversations) we could create for a dataset:

1
2
3
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "What's the capital of France?"}, {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "Who wrote 'Romeo and Juliet'?"}, {"role": "assistant", "content": "Oh, just some guy named William Shakespeare. Ever heard of him?"}]}
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, {"role": "user", "content": "How far is the Moon from Earth?"}, {"role": "assistant", "content": "Around 384,400 kilometers. Give or take a few, like that really matters."}]}

We do not currently support function calling examples but are working to enable this.

The conversational chat format is required to fine-tune gpt-3.5-turbo. For babbage-002 and davinci-002, you can follow the prompt completion pair format used for legacy fine-tuning as shown below.

1
2
3
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}
{"prompt": "<prompt text>", "completion": "<ideal generated text>"}

Crafting prompts

We generally recommend taking the set of instructions and prompts that you found worked best for the model prior to fine-tuning, and including them in every training example. This should let you reach the best and most general results, especially if you have relatively few (e.g. under a hundred) training examples.

If you would like to shorten the instructions or prompts that are repeated in every example to save costs, keep in mind that the model will likely behave as if those instructions were included, and it may be hard to get the model to ignore those "baked-in" instructions at inference time.

It may take more training examples to arrive at good results, as the model has to learn entirely through demonstration and without guided instructions.

Example count recommendations

To fine-tune a model, you are required to provide at least 10 examples. We typically see clear improvements from fine-tuning on 50 to 100 training examples with gpt-3.5-turbo but the right number varies greatly based on the exact use case.

We recommend starting with 50 well-crafted demonstrations and seeing if the model shows signs of improvement after fine-tuning. In some cases that may be sufficient, but even if the model is not yet production quality, clear improvements are a good sign that providing more data will continue to improve the model. No improvement suggests that you may need to rethink how to set up the task for the model or restructure the data before scaling beyond a limited example set.

Train and test splits

After collecting the initial dataset, we recommend splitting it into a training and test portion. When submitting a fine-tuning job with both training and test files, we will provide statistics on both during the course of training. These statistics will be your initial signal of how much the model is improving. Additionally, constructing a test set early on will be useful in making sure you are able to evaluate the model after training, by generating samples on the test set.

Token limits

Each training example is limited to 4096 tokens. Examples longer than this will be truncated to the first 4096 tokens when training. To be sure that your entire training example fits in context, consider checking that the total token counts in the message contents are under 4,000. The maximum number of total tokens trained per job is 50 million tokens (tokens_in_dataset * n_epochs).

You can compute token counts using our counting tokens notebook from the OpenAI cookbook.

Estimate costs

In order to estimate the cost of a fine-tuning job, please refer to the pricing page for details on the cost per 1k tokens. To estimate the costs for a specific fine-tuning job, use the following formula:

base cost per 1k tokens * number of tokens in the input file * number of epochs trained

For a training file with 100,000 tokens trained over 3 epochs, the expected cost would be ~$2.40.

Check data formatting

Once you have compiled a dataset and before you create a fine-tuning job, it is important to check the data formatting. To do this, we created a simple Python script which you can use to find potential errors, review token counts, and estimate the cost of a fine-tuning job.

Data formatting script

Once you have the data validated, the file needs to be uploaded in order to be used with a fine-tuning jobs:

1
2
3
4
5
6
7
8
import openai
import os
openai.api_key = os.getenv("OPENAI_API_KEY")

openai.File.create(
  file=open("mydata.jsonl", "rb"),
  purpose='fine-tune'
)

Create a fine-tuned model

After ensuring you have the right amount and structure for your dataset, and have uploaded the file, the next step is to create a fine-tuning job.

Start your fine-tuning job using the OpenAI SDK:

openai.FineTuningJob.create(training_file="file-abc123", model="gpt-3.5-turbo")

model is the name of the model you're starting from (gpt-3.5-turbobabbage-002, or davinci-002). You can customize your fine-tuned model's name using the suffix parameter.

After you've started a fine-tuning job, it may take some time to complete. Your job may be queued behind other jobs in our system, and training a model can take minutes or hours depending on the model and dataset size. After the model training is completed, the user who created the fine-tuning job will receive an email confirmation.

In addition to creating a fine-tuning job, you can also list existing jobs, retrieve the status of a job, or cancel a job.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# List 10 fine-tuning jobs
openai.FineTuningJob.list(limit=10)

# Retrieve the state of a fine-tune
openai.FineTuningJob.retrieve("ft-abc123")

# Cancel a job
openai.FineTuningJob.cancel("ft-abc123")

# List up to 10 events from a fine-tuning job
openai.FineTuningJob.list_events(id="ft-abc123", limit=10)

# Delete a fine-tuned model (must be an owner of the org the model was created in)
openai.Model.delete("ft-abc123")

Use a fine-tuned model

When a job has succeeded, you will see the fine_tuned_model field populated with the name of the model when you retrieve the job details. You may now specify this model as a parameter to in the Chat completions (for gpt-3.5-turbo) or legacy Completions API (for babbage-002 and davinci-002), and make requests to it using the Playground.

After your job is completed, the model should be available right away for inference use. In some cases, it may take several minutes for your model to become ready to handle requests. If requests to your model time out or the model name cannot be found, it is likely because your model is still being loaded. If this happens, try again in a few minutes.

1
2
3
4
5
6
7
8
9
completion = openai.ChatCompletion.create(
  model="ft:gpt-3.5-turbo:my-org:custom_suffix:id",
  messages=[
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ]
)

print(completion.choices[0].message)

You can start making requests by passing the model name as shown above and in our GPT guide.

Analyzing your fine-tuned model

We provide the following training metrics computed over the course of training: training loss, training token accuracy, test loss, and test token accuracy. These statistics are meant to provide a sanity check that training went smoothly (loss should decrease, token accuracy should increase).

However we think that evaluating samples from the fine-tuned model provides the most relevant sense of model quality. We recommend generating samples from both the base model and the fine-tuned model on a test set, and comparing the samples side by side. The test set should ideally include the full distribution of inputs that you might send to the model for inference. If manual evaluation is too time-consuming, consider using our Evals library for how to use GPT-4 to perform evaluations.

Iterating on data quality

If the results from a fine-tuning job are not as good as you expected, consider the following ways to adjust the training dataset:

  • Collect examples to target remaining issues
    • If the model still isn’t good at certain aspects, add training examples that directly show the model how to do these aspects correctly
  • Scrutinize existing examples for issues
    • If your model has grammar, logic, or style issues, check if your data has any of the same issues. For instance, if the model now says "I will schedule this meeting for you" (when it shouldn’t), see if existing examples teach the model to say it can do new things that it can’t do
  • Consider the balance and diversity of data
    • If 60% of the assistant responses in the data says "I cannot answer this", but at inference time only 5% of responses should say that, you will likely get an overabundance of refusals
  • Make sure your training examples contain all of the information needed for the response
    • If we want the model to compliment a user based on their personal traits and a training example includes assistant compliments for traits not found in the preceding conversation, the model may learn to hallucinate information
  • Look at the agreement / consistency in the training examples
    • If multiple people created the training data, it’s likely that model performance will be limited by the level of agreement / consistency between people. For instance, in a text extraction task, if people only agreed on 70% of extracted snippets, the model would likely not be able to do better than this
  • Make sure your all of your training examples are in the same format, as expected for inference

Iterating on data quantity

Once you’re satisfied with the quality and distribution of the examples, you can consider scaling up the number of training examples. This tends to help the model learn the task better, especially around possible "edge cases". We expect a similar amount of improvement every time you double the number of training examples. You can loosely estimate the expected quality gain from increasing the training data size by:

  • Fine-tuning on your current dataset
  • Fine-tuning on half of your current dataset
  • Observing the quality gap between the two

In general, if you have to make a trade-off, a smaller amount of high-quality data is generally more effective than a larger amount of low-quality data.

Iterating on hyperparameters

We allow you to specify the number of epochs to fine-tune a model for. We recommend initially training without specifying the number of epochs, allowing us to pick a default for you based on dataset size, then adjusting if you observe the following:

  • If the model does not follow the training data as much as expected increase the number by 1 or 2 epochs
    • This is more common for tasks for which there is a single ideal completion (or a small set of ideal completions which are similar). Some examples include classification, entity extraction, or structured parsing. These are often tasks for which you can compute a final accuracy metric against a reference answer.
  • If the model becomes less diverse than expected decrease the number by 1 or 2 epochs
    • This is more common for tasks for which there are a wide range of possible good completions

Fine-tuning examples

Now that we have explored the basics of the fine-tuning API, let’s look at going through the fine-tuning lifecycle for a few different use cases.

Style and tone

Structured output

Migration of legacy models

For users migrating from /v1/fine-tunes to the updated /v1/fine_tuning/jobs API and newer models, the main difference you can expect is the updated API. The legacy prompt completion pair data format has been retained for the updated babbage-002 and davinci-002 models to ensure a smooth transition. The new models will support fine-tuning with 4k token context and have a knowledge cutoff of September 2021.

For most tasks, you should expect to get better performance from gpt-3.5-turbo than from the GPT base models.

FAQ

When should I use fine-tuning vs embeddings with retrieval?

Embeddings with retrieval is best suited for cases when you need to have a large database of documents with relevant context and information.

By default OpenAI’s models are trained to be helpful generalist assistants. Fine-tuning can be used to make a model which is narrowly focused, and exhibits specific ingrained behavior patterns. Retrieval strategies can be used to make new information available to a model by providing it with relevant context before generating its response. Retrieval strategies are not an alternative to fine-tuning and can in fact be complementary to it.

When can I fine-tune GPT-4 or GPT-3.5-Turbo-16k?

We plan to release support for fine-tuning both of these models later this year.

How do I know if my fine-tuned model is actually better than the base model?

We recommend generating samples from both the base model and the fine-tuned model on a test set of chat conversations, and comparing the samples side by side. For more comprehensive evaluations, consider using the OpenAI evals framework to create an eval specific to your use case.

Can I continue fine-tuning a model that has already been fine-tuned?

No, we do not currently support continuing the fine-tuning process once a job has finished. We plan to support this in the near future.

How can I estimate the cost of fine-tuning a model?

Please refer to the estimate cost section above.

Does the new fine-tuning endpoint still work with Weights & Biases for tracking metrics?

No, we do not currently support this integration but are working to enable it in the near future.

How many fine-tuning jobs can I have running at once?

Please refer to our rate limit guide for the most up to date information on the limits.文章来源地址https://www.toymoban.com/news/detail-704280.html

到了这里,关于chatgpt fine-tuning 官方文档的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • openai模型个性化训练Embedding和fine-tuning区别

    现在基于自然语言和文档进行对话的背后都是使用的基于嵌入的向量搜索。OpenAI在这方面做的很好,它的Cookbook(github.com/openai/openai-cookbook)上有很多案例,最近他们对文档做了一些更新。 GPT擅长回答问题,但是只能回答它以前被训练过的问题,如果是没有训练过的数据,比如

    2024年02月15日
    浏览(41)
  • 了解大语言模型的参数高效微调(Parameter-Effcient Fine-Tuning)

    🍉 CSDN 叶庭云 : https://yetingyun.blog.csdn.net/ 大语言模型在众多应用领域实现了突破性的进步,显著提升了各种任务的完成度。然而,其庞大的规模也带来了高昂的计算成本。这些模型往往包含数十亿甚至上千亿参数,需要巨大的计算资源来运行。特别是,当需要为特定的下游

    2024年04月14日
    浏览(68)
  • 微调(Fine-Tune)或不微调:用于 AI 驱动业务转型的大型语言模型

    目录 To Fine-Tune or Not Fine-Tune: Large Language Models for AI-Driven Business Transformation微调或不微调:用于 AI 驱动业务转型的大型语言模型 LLMs - Large Language ModelsLLMs - 大型语言模型 Where do LLMs come from?LLMs 从何而来? How are LLMs trained? LLMs 是如何训练的? 

    2024年02月07日
    浏览(41)
  • 基于ChatYuan-large-v2 语言模型 Fine-tuning 微调训练 广告生成 任务

    ChatYuan-large-v2 是一个开源的支持中英双语的功能型对话语言大模型,与其他 LLM 不同的是模型十分轻量化,并且在轻量化的同时效果相对还不错,仅仅通过 0.7B 参数量就可以实现 10B 模型的基础效果,正是其如此的轻量级,使其可以在普通显卡、 CPU 、甚至手机上进行推理,而

    2024年02月13日
    浏览(44)
  • 使用 Docker 和 Alpaca LoRA 对 LLaMA 65B 大模型进行 Fine-Tune

    这篇文章中,我们来聊聊如何使用两张显卡来进行 LLaMA 65B 大模型的微调工作,以及如何在一张普通的 4090 家用显卡上,只花几个小时,就能够完成 7B 模型的微调。 在之前的几篇文章里,我们介绍过三种方式运行 Meta 开源模型 LLaMA 的 7B、13B 版本: 《模型杂谈:使用 IN8 量化

    2023年04月23日
    浏览(37)
  • AI语音合成 VITS Fast Fine-tuning,半小时合成专属模型,部署训练使用讲解

    项目名:VITS-fast-fine-tuning (VITS 快速微调) 项目地址:https://github.com/Plachtaa/VITS-fast-fine-tuning 支持语言:中、日、英 官方简介: 这个代码库会指导你如何将自定义角色(甚至你自己),加入预训练的VITS模型中,在1小时内的微调使模型具备如下功能: 在 模型所包含的任意两

    2024年02月08日
    浏览(42)
  • 通过ORPO技术微调 llama3大模型(Fine-tune Llama 3 with ORPO)

    1f45bd1e8577af66a05f5e3fadb0b29 ORPO是一种新颖的微调技术,它将传统的监督微调和偏好对齐阶段整合到一个过程中。这减少了训练所需的计算资源和时间。此外,经验结果表明,ORPO在各种模型大小和基准测试中都超过了其他对齐方法。 在本文中,我们将使用ORPO和TRL库来微调新的

    2024年04月23日
    浏览(38)
  • LLMs 缩放指令模型Scaling instruct models FLAN(Fine-tuned LAnguage Net,微调语言网络)

    本论文介绍了FLAN(Fine-tuned LAnguage Net,微调语言网络),一种指导微调方法,并展示了其应用结果。该研究证明,通过在1836个任务上微调540B PaLM模型,同时整合Chain-of-Thought Reasoning(思维链推理)数据,FLAN在泛化、人类可用性和零射推理方面相对于基础模型取得了改进。论文

    2024年02月11日
    浏览(34)
  • 【LLM】大语言模型学习之LLAMA 2:Open Foundation and Fine-Tuned Chat Model

    自从开源以来,LLAMA可以说是 AI 社区内最强大的开源大模型。但因为开源协议问题,一直不可免费商用。近日,Meta发布了期待已久的免费可商用版本LLAMA 2。 在这项工作中,我们开发并发布了LLAMA 2,这是一系列预训练和微调的大型语言模型(LLMs),规模从70亿到700亿个参数不

    2024年02月15日
    浏览(55)
  • 详解AI大模型行业黑话,迅速搞懂提示工程(prompt)、向量工程(embedding)、微调工程(fine-tune)

    大家都在讨论大模型,似乎什么都可以与大模型结合,可当初学者也想上手时,却面临一堆令人头大的词汇,什么Prompt、、Embedding、Fine-tuning,看到瞬间头都大了。一堆英文就算了,还不容易查到正确解释,怎么办呢?别担心,本文就用一种有趣的方式让大家认识它们。 首先

    2024年02月02日
    浏览(40)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包