自动语音识别模型whisper安装和初探-Toy模板网

这篇具有很好参考价值的文章主要介绍了自动语音识别模型whisper安装和初探。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

whisper介绍

whisper是OpenAI 最近发布的语音识别模型。OpenAI 通过从网络上收集了 68 万小时的多语言（98 种语言）和多任务（multitask）监督数据对 Whisper 进行了训练，whisper可以执行多语言语音识别、语音翻译和语言识别。

whisper安装（Windows）

1.CMD命令窗口建立名为whisper的虚拟环境：

conda create -n whisper python==3.8
conda activate whisper

注意：whisper要求python版本为3.8以上，否则会报错
2.虚拟环境中，安装whisper

pip install -U openai-whisper

2.虚拟环境中安装ffmpeg
这里我直接pip install ffmpeg后，在python中无法import ffmpeg，不知道是什么原因
重新安装ffmpeg-python就解决了

pip uninstall ffmpeg
pip uninstall ffmpeg-python
pip install ffmpeg-python

whisper初探

1.根据官网instruction进行测试，其中的audio是我自己保存的一段音频

import whisper

model = whisper.load_model("base")
result = model.transcribe("audio.mp3")
print(result["text"])

结果：
自动语音识别模型whisper安装和初探
发现可以准确进行语音识别，没有问题（但是没有标点符号？）
2.接下来测试instruction中的另一段代码

import whisper

model = whisper.load_model("base")

# load audio and pad/trim it to fit 30 seconds
audio = whisper.load_audio("audio.mp3")
audio = whisper.pad_or_trim(audio)

# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(audio).to(model.device)

# detect the spoken language
_, probs = model.detect_language(mel)
print(f"Detected language: {max(probs, key=probs.get)}")

# decode the audio
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)

# print the recognized text
print(result.text)

报错如下图
自动语音识别模型whisper安装和初探
提示需要用gpu跑模型，于是在虚拟环境中安装gpu版本的torch，我的CUDA为11.7版本，python为3.8，选择对应的torch和torchvision进行安装，torch.cuda.is_available() 为True，之后再次运行代码，没有出错。

结果发现识别出的只有整个音频中的一小段，这是什么意思？于是我下载了whisper的源代码，看一看调用函数的含义。
我发现在这段代码中，whisper.pad_or_trim是选择了音频前30s的token，之后采用了whisper.log_mel_spectrogram，对这30s的token生成了Log-Mel 频谱图，detect_language函数是用于自动识别这段音频的语言，最后用decode进行了解码。因此，这段代码其实是截取了音频中的前30s片段进行识别而非整段音频。
（1）如果需要对整段音频进行识别，则需要将输入的音频被分割成 30 秒的小段，进行滚动窗口的识别，而这些都已经封装在transcribe中；
（2）如果需要进行实时的转译，whisper本身是不支持实时，也就是说它必须将音频分成30s的片段，但可以通过每秒增量转录音频来构建类似的模型；https://github.com/openai/whisper/discussions/20
（3）如果音频不足30s，pad_or_trim会对token进行padding；
（4）如果不想要以30s为切分的时间间隔，这可能并不支持（？）仅通过修改whisper中的N_MELS、CHUNK_LENGTH等参数无法做到这一点。因为预训练保存的模型参数中，channel为80是固定的。
根据官方给出的解释：时间间隔太短，会缺乏上下文，将会更频繁地断句。很多句子都会失去意义。太长，则模型的复杂性更高。https://github.com/openai/whisper/discussions/1118文章来源地址https://www.toymoban.com/news/detail-417431.html