文本转语音（TTS）模型部署

AI应用

选择芯片

本案例使用来自 OuteAI 的模型来演示文本转语音特性。

OuteTTS介绍

在日益增长的语音合成需求中，文本转语音（TTS）技术快速进步，但也面临不少挑战。传统TTS模型往往依赖复杂的多模块架构，如深度神经网络、语音合成器、文本分析器等适配器，以生成自然的人类语音。这种复杂度带来了大量资源消耗，对设备的要求极高，使得许多设备无法轻松使用。尤其是个性化的语音生成和应用场景，传统TTS技术往往需要庞大的数据集和较高的硬件配置，对此，Oute AI发布了OuteTTS模型，为TTS领域带来了革新。

OuteTTS-0.2-500M 是一款不依赖外部适配器、纯语言建模的轻量级TTS模型。它基于 Qwen-2.5-0.5B 构建，继承了该架构在语言理解和生成方面的优势。通过直接整合文本和语音生成流程，这款模型实现了简洁高效的自然语音合成，并具备“零样本语音克隆”能力，仅凭几秒钟的参考音频即可模仿新的声音。OuteTTS的推出，不仅为开发者带来了全新机遇，也大大降低了TTS技术的门槛，为更多个性化、实时语音生成的需求提供了高效方案。

它与 llama.cpp 和 GGUF 格式兼容，适用于有声读物、智能客服、语音导航等多种应用场景。

部署指南

本示例需要先从Github下载并编译 llama.cpp 项目。

模型转换

下载OuteTTS-0.2-500M模型：

$ pushd models
$ git clone --branch main --single-branch --depth 1 https://huggingface.co/OuteAI/OuteTTS-0.2-500M
$ cd OuteTTS-0.2-500M && git lfs install && git lfs pull
$ popd

将模型转换为 .gguf 格式:

(venv) python convert_hf_to_gguf.py models/OuteTTS-0.2-500M \
    --outfile models/outetts-0.2-0.5B-f16.gguf --outtype f16

生成的模型将位于 models/outetts-0.2-0.5B-f16.gguf.

为提升运行效率，可通过以下指令将模型量化为Q8_0格式：

$ build/bin/llama-quantize models/outetts-0.2-0.5B-f16.gguf \
    models/outetts-0.2-0.5B-q8_0.gguf q8_0

量化后的模型将位于 models/outetts-0.2-0.5B-q8_0.gguf.

接下来我们需下载语音解码器的模型：

$ pushd models
$ git clone --branch main --single-branch --depth 1 https://huggingface.co/novateur/WavTokenizer-large-speech-75token
$ cd WavTokenizer-large-speech-75token && git lfs install && git lfs pull
$ popd

这个模型文件是 PyTorch checkpoint (.ckpt) ，我们首先需要将其转换为 huggingface 格式:

(venv) python examples/tts/convert_pt_to_hf.py \
    models/WavTokenizer-large-speech-75token/wavtokenizer_large_speech_320_24k.ckpt
...
Model has been successfully converted and saved to models/WavTokenizer-large-speech-75token/model.safetensors
Metadata has been saved to models/WavTokenizer-large-speech-75token/index.json
Config has been saved to models/WavTokenizer-large-speech-75tokenconfig.json

接下来我们将 huggingface 格式的文件转化为 gguf:

(venv) python convert_hf_to_gguf.py models/WavTokenizer-large-speech-75token \
    --outfile models/wavtokenizer-large-75-f16.gguf --outtype f16
...
INFO:hf-to-gguf:Model successfully exported to models/wavtokenizer-large-75-f16.gguf

运行示例

使用如上步骤生成的LLM模型，以及语音解码器模型，我们可以运行以下示例：

$ build/bin/llama-tts -t 8 -m  ./models/outetts-0.2-0.5B-q8_0.gguf \
    -mv ./models/wavtokenizer-large-75-f16.gguf \
    -p "it took me several hours to get llama.cpp working as a server, it took me 2 minutes to get ollama working."
...
main: audio written to file 'output.wav'

生成的文件 output.wav 包含prompt的语音。可以使用 aplay 或其它语音播放器进行播放，在Linux系统中，可使用如下命令播放生成的语音文件：

$ aplay output.wav

使用 llama-server 运行示例

使用 llama-server 来运行此示例也是可行的，并且需要启动两个服务器实例。其中一个将负责提供 LLM 模型服务，另一个则负责提供语音解码器模型服务。

LLM 模型的 server 使用以下命令启动:

$ ./build/bin/llama-server -m ./models/outetts-0.2-0.5B-q8_0.gguf --port 8020

启动语音解码器模型的server：

./build/bin/llama-server -m ./models/wavtokenizer-large-75-f16.gguf --port 8021 --embeddings --pooling none

接下来，使用 tts-outetts.py 生成音频文件.

首先，创建一个python虚拟环境，并安装所需的依赖：

$ python3 -m venv venv
$ source venv/bin/activate
(venv) pip install requests numpy

接着运行以下python脚本，生成语音文件:

(venv) python ./examples/tts/tts-outetts.py http://localhost:8020 http://localhost:8021 "it took me several hours to get llama.cpp working as a server, it took me 2 minutes to get ollama working."
spectrogram generated: n_codes: 90, n_embd: 1282
converting to audio ...
audio generated: 28800 samples
audio written to file "output.wav"

使用 aplay 或其它多媒体播放器播放生成的音频文件:

$ aplay output.wav

AI 开放平台

AI应用

选择芯片