基于Transformer语言模型:GPT-2

这篇具有很好参考价值的文章主要介绍了基于Transformer语言模型:GPT-2。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

基于Transformer语言模型:GPT-2

  Transformer是Google在2017年提出的一种新型模型架构。它以自注意力机制取代传统的RNN和CNN对序列数据的建模,它在机器翻译、语言理解等任务上显示出强大的表示能力,目前已经成为自然语言处理领域的主流框架之一。主要特点有:

  1. 完全基于注意力机制,没有循环神经网络或卷积神经网络。
  2. 并行计算,不像RNN那样串行计算,更加高效。
  3. 可处理远距离依赖,不受RNN的梯度消失问题影响。

  Transformer的基本结构包含N个相同的层(Layer)。每个层包含两个子层:

  1. 多头自注意力机制(Multi-Head Attention):通过将给定向量投影到多个子空间,再在每个子空间中计算注意力,最后将结果拼接,实现对序列中每个位置的注意力计算。
  2. 前馈神经网络(Feed Forward Neural Network): 通常包含两个线性变换与ReLU激活,实现位置编码的非线性映射。

模型结构

  • b b b : batch size
  • t t t : sequence length
  • c c c : embedding dims
  • N N N : vocabulary size

模型实现:Pytorch

  • 简单实现,参考GPT-2
import os
import math
import time
import pandas as pd
from dataclasses import dataclass

import torch
import torch.nn as nn
from torch.nn import functional as F
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
from torch.utils.tensorboard import SummaryWriter
# 模型参数
@dataclass
class ModelConfig:
    vocab_size: int = None  
    n_embed : int = None
    n_hidden: int = None
    max_seq_length: int = None
    n_head: int = None
    n_layer: int = None
激活函数GLUEs

GELU(Gaussian Error Linear Units)是BERT作者在论文中提出的一种新型激活函数,广泛应用于GPT系列的模型中,函数的定义如下:
G E L U ( x ) = x P ( X < = x ) = x Φ ( x ) = x 1 2 [ 1 + e r f ( x / 2 ) ] GELU(x) = xP(X <= x) = x\Phi(x)=x\frac{1}{2}[1+erf(x/\sqrt 2)] GELU(x)=xP(X<=x)=xΦ(x)=x21[1+erf(x/2 )]

  • Φ ( x ) \Phi(x) Φ(x) : x x x 的累积分布函数

近似表示为:

G E L U ( x ) ≈ 0.5 x ( 1 + t a n h [ 2 / π ( x + 0.044715 x 3 ) ] ) = x σ ( 1.702 x ) GELU(x) \approx 0.5x(1 + tanh[\sqrt{2/ \pi} (x + 0.044715x^3)]) = x\sigma(1.702x) GELU(x)0.5x(1+tanh[2/π (x+0.044715x3)])=xσ(1.702x)

class NewGELU(nn.Module):
    
    def forward(self, x):
        return 0.5 * x * (1.0 + torch.tanh(math.sqrt(2.0 / math.pi) * (x + 0.044715 * torch.pow(x, 3.0))))
因果自注意力
class CausalSelfAttention(nn.Module):
    """
    A multi-head masked self-attention layer with a projection at the end.
    It is possible to use torch.nn.MultiheadAttention here
    but I am including an explicit implementation here
    to show that there is nothing too scary here.
    """
    def __init__(self, config):
        super().__init__()
        assert config.n_embed % config.n_head == 0
        # key, query, value projections for all heads, but in a batch
        self.c_attn = nn.Linear(config.n_embed, 3 * config.n_embed)
        # output projection
        self.c_proj = nn.Linear(config.n_embed, config.n_embed)
        # causal mask to ensure that attention is only applied to the left in the input sequence
        self.register_buffer("bias", torch.tril(torch.ones(config.max_seq_length, config.max_seq_length))
                                     .view(1, 1, config.max_seq_length, config.max_seq_length))
        self.n_head = config.n_head
        self.n_embed = config.n_embed

    def forward(self, x):
        B, T, C = x.size() # batch size, sequence length, embedding dimensionality (n_embed)

        # calculate query, key, values for all heads in batch and move head forward to be the batch dim
        q, k ,v  = self.c_attn(x).split(self.n_embed, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2) # (B, nh, T, hs)

        # causal self-attention; Self-attend: (B, nh, T, hs) x (B, nh, hs, T) -> (B, nh, T, T)
        att = (q @ k.transpose(-2, -1)) * (1.0 / math.sqrt(k.size(-1)))
        att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
        att = F.softmax(att, dim=-1)
        y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
        y = y.transpose(1, 2).contiguous().view(B, T, C) # re-assemble all head outputs side by side

        # output projection
        y = self.c_proj(y)
        return y
Transformer Block
class Block(nn.Module):
    """ an unassuming Transformer block """

    def __init__(self, config):
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embed)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embed)
        self.mlp = nn.ModuleDict(dict(
            c_fc    = nn.Linear(config.n_embed, 4 * config.n_embed),
            c_proj  = nn.Linear(4 * config.n_embed, config.n_embed),
            act     = NewGELU(),
        ))
        m = self.mlp
        self.mlpf = lambda x: m.c_proj(m.act(m.c_fc(x))) # MLP forward

    def forward(self, x):
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlpf(self.ln_2(x))
        return x
GPT Model
class Transformer(nn.Module):
    """ Transformer Language Model, exactly as seen in GPT-2 """

    def __init__(self, config):
        super().__init__()
        self.max_seq_length = config.max_seq_length

        self.transformer = nn.ModuleDict(dict(
            wte = nn.Embedding(config.vocab_size, config.n_embed),
            wpe = nn.Embedding(config.max_seq_length, config.n_embed),
            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),
            ln_f = nn.LayerNorm(config.n_embed),
        ))
        self.lm_head = nn.Linear(config.n_embed, config.vocab_size, bias=False)

        # report number of parameters (note we don't count the decoder parameters in lm_head)
        n_params = sum(p.numel() for p in self.transformer.parameters())
        print("number of parameters: %.2fM" % (n_params/1e6,))

    def get_max_seq_length(self):
        return self.max_seq_length

    def forward(self, idx, targets=None):
        device = idx.device
        b, t = idx.size()
        assert t <= self.max_seq_length, f"Cannot forward sequence of length {t}, block size is only {self.max_seq_length}"
        pos = torch.arange(0, t, dtype=torch.long, device=device).unsqueeze(0) # shape (1, t)

        # forward the GPT model itself
        tok_emb = self.transformer.wte(idx) # token embeddings of shape (b, t, n_embed)
        pos_emb = self.transformer.wpe(pos) # position embeddings of shape (1, t, n_embed)
        x = tok_emb + pos_emb
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)

        # if we are given some desired targets also calculate the loss
        loss = None
        if targets is not None:
            loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1), ignore_index=-1)

        return logits, loss

测试数据

数据集来10k+中文外卖评价数据集:

data = pd.read_csv('./dataset/waimai_10k.csv')
data.dropna(subset='review',inplace=True)
data['review_length'] = data.review.apply(lambda x:len(x))
data.sample(5)
label review review_length
551 1 没想到味道这么棒,以后我就多订,请多多关照! 22
1054 1 真没传说中的那么好吃,也可能是因为我点的是猪肘卷饼的缘故吧。反正菜太少了,肉太多了,尤其是好... 75
6688 0 我希望下次再送的时候,把饭拿稳了,我打开的时候都散了 26
9453 0 太!慢!了! 6
5433 0 13:20多才送到,呵呵嗒,强烈要求自取!! 22

语料统计信息:

data = data[data.review_length <=50] # 滤掉长度超过50的评论
words = data.review.tolist()
chars = sorted(list(set(''.join(words))))    
max_word_length = max(len(w) for w in words)

print(f"number of examples: {len(words)}")
print(f"max word length: {max_word_length}")
print(f"size of vocabulary: {len(chars)}")
number of examples: 10796
max word length: 50
size of vocabulary: 2272
划分训练/测试数据
test_set_size = min(1000, int(len(words) * 0.1)) 
rp = torch.randperm(len(words)).tolist()
train_words = [words[i] for i in rp[:-test_set_size]]
test_words = [words[i] for i in rp[-test_set_size:]]
print(f"split up the dataset into {len(train_words)} training examples and {len(test_words)} test examples")
split up the dataset into 9796 training examples and 1000 test examples
构造字符数据集[tensor]
  • < BLANK> : 0
  • token seqs : [1, 2, 3, 4, 5, 6]
  • x : [0, 1, 2, 3, 4, 5, 6]
  • y : [1, 2, 3, 4, 5, 6, 0]
class CharDataset(Dataset):

    def __init__(self, words, chars, max_word_length):
        self.words = words
        self.chars = chars
        self.max_word_length = max_word_length
        # char-->index-->char
        self.char2i = {ch:i+1 for i,ch in enumerate(chars)}
        self.i2char = {i:s for s,i in self.char2i.items()}    

    def __len__(self):
        return len(self.words)

    def contains(self, word):
        return word in self.words

    def get_vocab_size(self):
        return len(self.chars) + 1      

    def get_output_length(self):
        return self.max_word_length + 1

    def encode(self, word):
        # char sequece ---> index sequence
        ix = torch.tensor([self.char2i[w] for w in word], dtype=torch.long)
        return ix

    def decode(self, ix):
        # index sequence ---> char sequence
        word = ''.join(self.i2char[i] for i in ix)
        return word

    def __getitem__(self, idx):
        word = self.words[idx]
        ix = self.encode(word)
        x = torch.zeros(self.max_word_length + 1, dtype=torch.long)
        y = torch.zeros(self.max_word_length + 1, dtype=torch.long)
        x[1:1+len(ix)] = ix
        y[:len(ix)] = ix
        y[len(ix)+1:] = -1 # index -1 will mask the loss
        return x, y
数据加载器[DataLoader]
class InfiniteDataLoader:
    
    def __init__(self, dataset, **kwargs):
        train_sampler = torch.utils.data.RandomSampler(dataset, replacement=True, num_samples=int(1e10))
        self.train_loader = DataLoader(dataset, sampler=train_sampler, **kwargs)
        self.data_iter = iter(self.train_loader)

    def next(self):
        try:
            batch = next(self.data_iter)
        except StopIteration: # this will technically only happen after 1e10 samples... (i.e. basically never)
            self.data_iter = iter(self.train_loader)
            batch = next(self.data_iter)
        return batch

训练模型

# 模型评估
@torch.inference_mode()
def evaluate(model, dataset, batch_size=10, max_batches=None):
    model.eval()
    loader = DataLoader(dataset, shuffle=True, batch_size=batch_size, num_workers=0)
    losses = []
    for i, batch in enumerate(loader):
        batch = [t.to('cuda') for t in batch]
        X, Y = batch
        logits, loss = model(X, Y)
        losses.append(loss.item())
        if max_batches is not None and i >= max_batches:
            break
    mean_loss = torch.tensor(losses).mean().item()
    model.train() # reset model back to training mode
    return mean_loss

环境初始化:

torch.manual_seed(seed=12345)
torch.cuda.manual_seed_all(seed=12345)

work_dir = "./GPT2_log"
os.makedirs(work_dir, exist_ok=True)
writer = SummaryWriter(log_dir=work_dir)

模型初始化:

config = ModelConfig(vocab_size=len(chars)+1,
                     n_embed=128,
                     n_hidden=64,
                     max_seq_length=max_word_length+1,
                     n_head=4,
                     n_layer=4)

model = Transformer(config)

model.to('cuda')
number of parameters: 1.09M





Transformer(
  (transformer): ModuleDict(
    (wte): Embedding(2273, 128)
    (wpe): Embedding(51, 128)
    (h): ModuleList(
      (0-3): 4 x Block(
        (ln_1): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (attn): CausalSelfAttention(
          (c_attn): Linear(in_features=128, out_features=384, bias=True)
          (c_proj): Linear(in_features=128, out_features=128, bias=True)
        )
        (ln_2): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
        (mlp): ModuleDict(
          (c_fc): Linear(in_features=128, out_features=512, bias=True)
          (c_proj): Linear(in_features=512, out_features=128, bias=True)
          (act): NewGELU()
        )
      )
    )
    (ln_f): LayerNorm((128,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=128, out_features=2273, bias=False)
)

初始化数据:

train_dataset = CharDataset(train_words, chars, max_word_length)
test_dataset = CharDataset(test_words, chars, max_word_length)

train_dataset[0][0].shape, train_dataset[0][1].shape
(torch.Size([51]), torch.Size([51]))

Training:文章来源地址https://www.toymoban.com/news/detail-469140.html

# init optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-4, weight_decay=0.01, betas=(0.9, 0.99), eps=1e-8)
# init dataloader
batch_loader = InfiniteDataLoader(train_dataset, batch_size=256, pin_memory=True, num_workers=4)

# training loop
best_loss = None
step = 0
train_losses, test_losses = [],[]
while True:

    t0 = time.time()

    # get the next batch, ship to device, and unpack it to input and target
    batch = batch_loader.next()
    batch = [t.to('cuda') for t in batch]
    X, Y = batch
    # feed into the model
    logits, loss = model(X, Y)

    # calculate the gradient, update the weights
    model.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
    # wait for all CUDA work on the GPU to finish then calculate iteration time taken
    torch.cuda.synchronize()
    t1 = time.time()

    # logging
    if step % 1000 == 0:
        print(f"step {step} | loss {loss.item():.4f} | step time {(t1-t0)*1000:.2f}ms")

    # evaluate the model
    if step > 0 and step % 100 == 0:
        train_loss = evaluate(model, train_dataset, batch_size=100, max_batches=10)
        test_loss  = evaluate(model, test_dataset,  batch_size=100, max_batches=10)
        train_losses.append(train_loss)
        test_losses.append(test_loss)
        # save the model to disk if it has improved
        if best_loss is None or test_loss < best_loss:
            out_path = os.path.join(work_dir, "model.pt")
            print(f"test loss {test_loss} is the best so far, saving model to {out_path}")
            torch.save(model.state_dict(), out_path)
            best_loss = test_loss

    step += 1
    # termination conditions
    if step > 10100:
        break
step 0 | loss 7.8996 | step time 424.90ms
test loss 4.594789028167725 is the best so far, saving model to ./GPT2_log/model.pt
test loss 3.9983901977539062 is the best so far, saving model to ./GPT2_log/model.pt
test loss 3.762165069580078 is the best so far, saving model to ./GPT2_log/model.pt
test loss 3.6443073749542236 is the best so far, saving model to ./GPT2_log/model.pt
test loss 3.5818755626678467 is the best so far, saving model to ./GPT2_log/model.pt
test loss 3.565037250518799 is the best so far, saving model to ./GPT2_log/model.pt
step 1000 | loss 2.4028 | step time 30.07ms
step 2000 | loss 1.1732 | step time 30.60ms
step 3000 | loss 0.7114 | step time 29.94ms
step 4000 | loss 0.5963 | step time 29.27ms
step 5000 | loss 0.5811 | step time 30.56ms
step 6000 | loss 0.5321 | step time 30.99ms
step 7000 | loss 0.5324 | step time 29.52ms
step 8000 | loss 0.5611 | step time 30.47ms
step 9000 | loss 0.5524 | step time 30.72ms
step 10000 | loss 0.5481 | step time 31.09ms

测试:外卖评价生成器

# laod save best model
model.load_state_dict(torch.load('./GPT2_log/model.pt'))
<All keys matched successfully>
@torch.no_grad()
def generate(model, idx, max_new_tokens, temperature=1.0, do_sample=False, top_k=None):
    for _ in range(max_new_tokens):
        # forward the model to get the logits for the index in the sequence
        logits, _ = model(idx)
        # pluck the logits at the final step and scale by desired temperature
        logits = logits[:,-1,:] / temperature
        # optionally crop the logits to only the top k options
        if top_k is not None:
            v, _ = torch.topk(logits, top_k)
            logits[logits < v[:, [-1]]] = -float('Inf')
        # apply softmax to convert logits to (normalized) probabilities
        probs = F.softmax(logits, dim=-1)
        # either sample from the distribution or take the most likely element
        if do_sample:
            idx_next = torch.multinomial(probs, num_samples=1)
        else:
            _, idx_next = torch.topk(probs, k=1, dim=-1)
         
        # append sampled index to the running sequence and continue
        idx = torch.cat((idx, idx_next), dim=-1)
    return idx
def print_samples(num=13):
    # inital 0 tokens
    X_init = torch.zeros((num, 1), dtype=torch.long).to('cuda')
    steps = train_dataset.get_output_length() - 1 # -1 because we already start with <START> token (index 0)
    X_samp = generate(model, X_init, steps, top_k=None, do_sample=True).to('cuda')
    new_samples = []
    for i in range(X_samp.size(0)):
        # get the i'th row of sampled integers, as python list
        row = X_samp[i, 1:].tolist() # note: we need to crop out the first <START> token
        # token 0 is the <END> token, so we crop the output sequence at that point
        crop_index = row.index(0) if 0 in row else len(row)
        row = row[:crop_index]
        word_samp = train_dataset.decode(row)
        new_samples.append(word_samp)
    return new_samples
print_samples(num=10)
['送餐很快,师傅辛苦了,味道非常好,服务挺好!',
 '还不错,有点辣',
 '花。不好吃,送餐单人是不长,肘子皮白粥包装好。很细心酱服务态度,很值!!!',
 '一如既往的神子还不错,真心不了',
 '师傅洒了快',
 '估汁买的太大了,我实在哪里的一道是80分,袖蹄放只有股卷饼腻。没夏怪卷饼,怎么好吃完成。',
 '一个半小时!现在外卖来了!太慢了已送到50多啊!好差辣椒腐柳,也不是虑就不怎么怀疑址,。',
 '忘餐厅到这次不放挺热的',
 '骑士态度很好,门给送餐员但饮料。这种纯)目少',
 '好吃好吃,味道也没有!']

到了这里,关于基于Transformer语言模型:GPT-2的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • ChatGPT:基于GPT-3.5架构的强大语言模型

    这段时间,周围朋友们讨论最多的除了春招的激烈之外,就是ChatGPT了,大家被ChatGPT的智能和超强的学习能力所震惊,甚至担心未来会被人工智能所取代。 这样的担心不无道理,我们程序员作为技术人员,需要对新技术保持敏锐的嗅觉,以免被时代抛弃。但就我最近对ChatGP

    2024年02月07日
    浏览(49)
  • 【AI理论学习】语言模型Performer:一种基于Transformer架构的通用注意力框架

    Performer是一种用于高效处理自注意力机制(Self-Attention)的神经网络架构 。自注意力机制在许多自然语言处理和计算机视觉任务中

    2024年02月09日
    浏览(48)
  • 基于GPT大语言模型的AI写作辅助工具

    随着人工智能技术的不断发展,越来越多的AI写作辅助工具被广泛应用于各个领域。而其中,基于GPT大语言模型、NLP自然语言处理和GPT图片创作技术的AI写作辅助工具成为了众多用户的首选。 AI写作辅助工具的核心技术是GPT大语言模型。基于这项技术,AI写作辅助工具可以自动

    2024年02月15日
    浏览(45)
  • GPT模型训练实践(2)-Transformer模型工作机制

            Transformer 的结构如下,主要由 编码器-解码器 组成,因为其不需要大量标注数据训练和天然支持并行计算的接口,正在全面取代CNN和RNN: 扩展阅读:What Is a Transformer Model? ​ ​ 其中 编码器中包含自注意力层和前馈神经网络层; 解码器包含自注意力层、编码器-解

    2024年02月12日
    浏览(47)
  • 【】实现GPT中Transformer模型之框架概念

      作者:黑夜路人 时间:2023年7月 GPT是什么意思 GPT 的全称是 Generative Pre-trained Transformer(生成型预训练变换模型),它是基于大量语料数据上训练,以生成类似于人类自然语言的文本。其名称中的“预训练”指的是在大型文本语料库上进行的初始训练过程,其中模型学习预

    2024年02月16日
    浏览(34)
  • 标题:深入了解ChatGPT:基于GPT-4架构的创新人工智能语言模型及其应用前景

    一、ChatGPT简介 ChatGPT是一种基于OpenAI开发的GPT-4架构的人工智能语言模型。GPT-4是一种自然语言处理技术,其前身为GPT-3。随着技术的迅速发展,GPT-4在许多方面超越了GPT-3,如模型规模、知识库和性能。ChatGPT在多种语言和应用场景中表现出卓越的性能,成为现代人工智能领域

    2024年02月03日
    浏览(52)
  • Generative Pre-trained Transformer(GPT)模型技术初探

    2017年,Google在论文 Attention is All you need 中提出了 Transformer 模型,其使用 Self-Attention 结构取代了在 NLP 任务中常用的 RNN 网络结构。相比 RNN 网络结构,其最大的优点是可以并行计算。Transformer 的整体模型架构如下图所示 首先,让我们先将Transformer模型视为一个黑盒,如下图所

    2023年04月14日
    浏览(85)
  • 搭建部署属于自己的基于gpt3.5的大语言模型(基于flask+html+css+js+mysql实现)

    本项目是一个基于GPT-3.5模型的聊天机器人网站,旨在为用户提供一个简便、直接的方式来体验和利用GPT-3.5模型的强大功能。项目以Flask为基础,构建了一个完整的Web应用程序,其中包含了多个前端页面和后端API接口,能够处理用户输入并与GPT-3.5模型进行交互来生成响应。 一

    2024年02月07日
    浏览(64)
  • 【ChatGPT】基于tensorflow2实现transformer(GPT-3.5)

    请记住,您是一位NLP领域的专家和优秀的算法工程师。使用带有 tensorflow2.0 subclass api 的 python 从头开始实现 transformer 模型。 全部内容如下: 构建transformer模型架构和依赖层; 生成并预处理一些假样本数据,用于训练上面构建的模型; 上面生成的样本数据的训练模型示例教程

    2023年04月10日
    浏览(40)
  • 大语言模型系列-Transformer

    前文大语言模型系列-ELMo提到了,RNN的缺陷限制了NLP领域的发展,2017年Transofrmer的横空出世,NLP领域迎来了基于Transformer的预训练模型(LLM)的大爆发。 Transformer由谷歌的2017年论文《Attention is All You Need》提出。 Transformer通过引入注意力机制解决了RNN存在的以下问题: RNN编码器

    2024年01月19日
    浏览(47)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包