Optimize documents (#994)

* [feature]add dataset classs

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [dev]combine agent and tts infer

* [feature]:update inference

* [feature]:update uv.lock

* [Merge]:merge upstream/main

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [fix]:remove unused files

* [fix]:remove unused files

* [fix]:remove unused files

* [fix]:fix infer bugs

* [docs]:update introduction and optinize front appearence

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
This commit is contained in:
Whale and Dolphin 2025-06-03 21:05:14 +08:00 committed by GitHub
parent 4bf24d8c33
commit 75d7ecb5b5
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
18 changed files with 727 additions and 182 deletions

BIN
docs/assets/openaudio.jpg Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 40 KiB

BIN
docs/assets/openaudio.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 956 B

View File

@ -1,4 +1,14 @@
# Introduction
# OpenAudio (formerly Fish-Speech)
<div align="center">
<div align="center">
<img src="../assets/openaudio.jpg" alt="OpenAudio" style="display: block; margin: 0 auto; width: 35%;"/>
</div>
<strong>Advanced Text-to-Speech Model Series</strong>
<div>
<a target="_blank" href="https://discord.gg/Es5qTB9BcN">
@ -12,39 +22,114 @@
</a>
</div>
!!! warning
We assume no responsibility for any illegal use of the codebase. Please refer to the local laws regarding DMCA (Digital Millennium Copyright Act) and other relevant laws in your area. <br/>
This codebase is released under Apache 2.0 license and all models are released under the CC-BY-NC-SA-4.0 license.
<strong>Try it now:</strong> <a href="https://fish.audio">Fish Audio Playground</a> | <strong>Learn more:</strong> <a href="https://openaudio.com">OpenAudio Website</a>
## Requirements
</div>
- GPU Memory: 12GB (Inference)
- System: Linux, Windows
---
## Setup
!!! warning "Legal Notice"
We assume no responsibility for any illegal use of the codebase. Please refer to the local laws regarding DMCA (Digital Millennium Copyright Act) and other relevant laws in your area.
**License:** This codebase is released under Apache 2.0 license and all models are released under the CC-BY-NC-SA-4.0 license.
First, we need to create a conda environment to install the packages.
## **Introduction**
```bash
We are excited to announce that we have rebranded to **OpenAudio** - introducing a brand new series of advanced Text-to-Speech models that builds upon the foundation of Fish-Speech with significant improvements and new capabilities.
conda create -n fish-speech python=3.12
conda activate fish-speech
**Openaudio-S1-mini**: [Video](To Be Uploaded); [Hugging Face](https://huggingface.co/fishaudio/openaudio-s1-mini);
pip install sudo apt-get install portaudio19-dev # For pyaudio
pip install -e . # This will download all rest packages.
**Fish-Speech v1.5**: [Video](https://www.bilibili.com/video/BV1EKiDYBE4o/); [Hugging Face](https://huggingface.co/fishaudio/fish-speech-1.5);
apt install libsox-dev ffmpeg # If needed.
## **Highlights**
### **Emotion Control**
OpenAudio S1 **supports a variety of emotional, tone, and special markers** to enhance speech synthesis:
- **Basic emotions**:
```
(angry) (sad) (excited) (surprised) (satisfied) (delighted)
(scared) (worried) (upset) (nervous) (frustrated) (depressed)
(empathetic) (embarrassed) (disgusted) (moved) (proud) (relaxed)
(grateful) (confident) (interested) (curious) (confused) (joyful)
```
!!! warning
The `compile` option is not supported on windows and macOS, if you want to run with compile, you need to install trition by yourself.
- **Advanced emotions**:
```
(disdainful) (unhappy) (anxious) (hysterical) (indifferent)
(impatient) (guilty) (scornful) (panicked) (furious) (reluctant)
(keen) (disapproving) (negative) (denying) (astonished) (serious)
(sarcastic) (conciliative) (comforting) (sincere) (sneering)
(hesitating) (yielding) (painful) (awkward) (amused)
```
## Acknowledgements
- **Tone markers**:
```
(in a hurry tone) (shouting) (screaming) (whispering) (soft tone)
```
- [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2)
- [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2)
- [GPT VITS](https://github.com/innnky/gpt-vits)
- [MQTTS](https://github.com/b04901014/MQTTS)
- [GPT Fast](https://github.com/pytorch-labs/gpt-fast)
- [Transformers](https://github.com/huggingface/transformers)
- [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS)
- **Special audio effects**:
```
(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting)
(groaning) (crowd laughing) (background laughter) (audience laughing)
```
You can also use Ha,ha,ha to control, there's many other cases waiting to be explored by yourself.
### **Excellent TTS quality**
We use Seed TTS Eval Metrics to evaluate the model performance, and the results show that OpenAudio S1 achieves **0.008 WER** and **0.004 CER** on English text, which is significantly better than previous models. (English, auto eval, based on OpenAI gpt-4o-transcribe, speaker distance using Revai/pyannote-wespeaker-voxceleb-resnet34-LM)
| Model | Word Error Rate (WER) | Character Error Rate (CER) | Speaker Distance |
|-------|----------------------|---------------------------|------------------|
| **S1** | **0.008** | **0.004** | **0.332** |
| **S1-mini** | **0.011** | **0.005** | **0.380** |
### **Two Type of Models**
| Model | Size | Availability | Features |
|-------|------|--------------|----------|
| **S1** | 4B parameters | Avaliable on [fish.audio](fish.audio) | Full-featured flagship model |
| **S1-mini** | 0.5B parameters | Avaliable on huggingface [hf space](https://huggingface.co/spaces/fishaudio/openaudio-s1-mini) | Distilled version with core capabilities |
Both S1 and S1-mini incorporate online Reinforcement Learning from Human Feedback (RLHF).
## **Features**
1. **Zero-shot & Few-shot TTS:** Input a 10 to 30-second vocal sample to generate high-quality TTS output. **For detailed guidelines, see [Voice Cloning Best Practices](https://docs.fish.audio/text-to-speech/voice-clone-best-practices).**
2. **Multilingual & Cross-lingual Support:** Simply copy and paste multilingual text into the input box—no need to worry about the language. Currently supports English, Japanese, Korean, Chinese, French, German, Arabic, and Spanish.
3. **No Phoneme Dependency:** The model has strong generalization capabilities and does not rely on phonemes for TTS. It can handle text in any language script.
4. **Highly Accurate:** Achieves a low CER (Character Error Rate) of around 0.4% and WER (Word Error Rate) of around 0.8% for Seed-TTS Eval.
5. **Fast:** With fish-tech acceleration, the real-time factor is approximately 1:5 on an Nvidia RTX 4060 laptop and 1:15 on an Nvidia RTX 4090.
6. **WebUI Inference:** Features an easy-to-use, Gradio-based web UI compatible with Chrome, Firefox, Edge, and other browsers.
7. **GUI Inference:** Offers a PyQt6 graphical interface that works seamlessly with the API server. Supports Linux, Windows, and macOS. [See GUI](https://github.com/AnyaCoder/fish-speech-gui).
8. **Deploy-Friendly:** Easily set up an inference server with native support for Linux, Windows (MacOS comming soon), minimizing speed loss.
## **Disclaimer**
We do not hold any responsibility for any illegal usage of the codebase. Please refer to your local laws about DMCA and other related laws.
## **Media & Demos**
#### 🚧 Coming Soon
Video demonstrations and tutorials are currently in development.
## **Documentation**
### Quick Start
- [Build Environment](en/install.md) - Set up your development environment
- [Inference Guide](en/inference.md) - Run the model and generate speech
## **Community & Support**
- **Discord:** Join our [Discord community](https://discord.gg/Es5qTB9BcN)
- **Website:** Visit [OpenAudio.com](https://openaudio.com) for latest updates
- **Try Online:** [Fish Audio Playground](https://fish.audio)

View File

@ -34,9 +34,7 @@ python fish_speech/models/text2semantic/inference.py \
--text "The text you want to convert" \
--prompt-text "Your reference text" \
--prompt-tokens "fake.npy" \
--checkpoint-path "checkpoints/openaudio-s1-mini" \
--num-samples 2 \
--compile # if you want a faster speed
--compile
```
This command will create a `codes_N` file in the working directory, where N is an integer starting from 0.
@ -50,15 +48,12 @@ This command will create a `codes_N` file in the working directory, where N is a
### 3. Generate vocals from semantic tokens:
#### VQGAN Decoder
!!! warning "Future Warning"
We have kept the interface accessible from the original path (tools/vqgan/inference.py), but this interface may be removed in subsequent releases, so please change your code as soon as possible.
```bash
python fish_speech/models/dac/inference.py \
-i "codes_0.npy" \
--checkpoint-path "checkpoints/openaudiio-s1-mini/codec.pth"
```
## HTTP API Inference

31
docs/en/install.md Normal file
View File

@ -0,0 +1,31 @@
## Requirements
- GPU Memory: 12GB (Inference)
- System: Linux, WSL
## Setup
First you need install pyaudio and sox, which is used for audio processing.
``` bash
apt install portaudio19-dev libsox-dev ffmpeg
```
### Conda
```bash
conda create -n fish-speech python=3.12
conda activate fish-speech
pip install -e .
```
### UV
```bash
uv sync --python 3.12
```
!!! warning
The `compile` option is not supported on windows and macOS, if you want to run with compile, you need to install trition by yourself.

View File

@ -1,4 +1,14 @@
# 紹介
# OpenAudio (旧 Fish-Speech)
<div align="center">
<div align="center">
<img src="../assets/openaudio.jpg" alt="OpenAudio" style="display: block; margin: 0 auto; width: 35%;"/>
</div>
<strong>先進的なText-to-Speechモデルシリーズ</strong>
<div>
<a target="_blank" href="https://discord.gg/Es5qTB9BcN">
@ -12,39 +22,113 @@
</a>
</div>
!!! warning
このコードベースの違法な使用について、当方は一切の責任を負いません。お住まいの地域のDMCAデジタルミレニアム著作権法およびその他の関連法規をご参照ください。<br/>
このコードベースはApache 2.0ライセンスの下でリリースされ、すべてのモデルはCC-BY-NC-SA-4.0ライセンスの下でリリースされています。
<strong>今すぐ試す:</strong> <a href="https://fish.audio">Fish Audio Playground</a> | <strong>詳細情報:</strong> <a href="https://openaudio.com">OpenAudio ウェブサイト</a>
## システム要件
</div>
- GPU メモリ12GB推論
- システムLinux、Windows
---
## セットアップ
!!! warning "法的通知"
このコードベースの違法な使用について、当方は一切の責任を負いません。お住まいの地域のDMCAデジタルミレニアム著作権法およびその他の関連法規をご参照ください。
**ライセンス:** このコードベースはApache 2.0ライセンスの下でリリースされ、すべてのモデルはCC-BY-NC-SA-4.0ライセンスの下でリリースされています。
まず、パッケージをインストールするためのconda環境を作成する必要があります。
## **紹介**
```bash
私たちは **OpenAudio** への改名を発表できることを嬉しく思います。Fish-Speechを基盤とし、大幅な改善と新機能を加えた、新しい先進的なText-to-Speechモデルシリーズを紹介します。
conda create -n fish-speech python=3.12
conda activate fish-speech
**Openaudio-S1-mini**: [動画](アップロード予定); [Hugging Face](https://huggingface.co/fishaudio/openaudio-s1-mini);
pip install sudo apt-get install portaudio19-dev # pyaudio用
pip install -e . # これにより残りのパッケージがすべてダウンロードされます。
**Fish-Speech v1.5**: [動画](https://www.bilibili.com/video/BV1EKiDYBE4o/); [Hugging Face](https://huggingface.co/fishaudio/fish-speech-1.5);
apt install libsox-dev ffmpeg # 必要に応じて。
## **ハイライト**
### **感情制御**
OpenAudio S1は**多様な感情、トーン、特殊マーカーをサポート**して音声合成を強化します:
- **基本感情**
```
(angry) (sad) (excited) (surprised) (satisfied) (delighted)
(scared) (worried) (upset) (nervous) (frustrated) (depressed)
(empathetic) (embarrassed) (disgusted) (moved) (proud) (relaxed)
(grateful) (confident) (interested) (curious) (confused) (joyful)
```
!!! warning
`compile`オプションはWindowsとmacOSでサポートされていません。compileで実行したい場合は、tritionを自分でインストールする必要があります。
- **高度な感情**
```
(disdainful) (unhappy) (anxious) (hysterical) (indifferent)
(impatient) (guilty) (scornful) (panicked) (furious) (reluctant)
(keen) (disapproving) (negative) (denying) (astonished) (serious)
(sarcastic) (conciliative) (comforting) (sincere) (sneering)
(hesitating) (yielding) (painful) (awkward) (amused)
```
## 謝辞
- **トーンマーカー**
```
(in a hurry tone) (shouting) (screaming) (whispering) (soft tone)
```
- [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2)
- [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2)
- [GPT VITS](https://github.com/innnky/gpt-vits)
- [MQTTS](https://github.com/b04901014/MQTTS)
- [GPT Fast](https://github.com/pytorch-labs/gpt-fast)
- [Transformers](https://github.com/huggingface/transformers)
- [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS)
- **特殊音響効果**
```
(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting)
(groaning) (crowd laughing) (background laughter) (audience laughing)
```
Ha,ha,haを使用してコントロールすることもでき、他にも多くの使用法があなた自身の探索を待っています。
### **優秀なTTS品質**
Seed TTS評価指標を使用してモデルのパフォーマンスを評価した結果、OpenAudio S1は英語テキストで**0.008 WER**と**0.004 CER**を達成し、以前のモデルより大幅に改善されました。英語、自動評価、OpenAI gpt-4o-転写に基づく、話者距離はRevai/pyannote-wespeaker-voxceleb-resnet34-LM使用
| モデル | 単語誤り率 (WER) | 文字誤り率 (CER) | 話者距離 |
|-------|----------------------|---------------------------|------------------|
| **S1** | **0.008** | **0.004** | **0.332** |
| **S1-mini** | **0.011** | **0.005** | **0.380** |
### **2つのモデルタイプ**
| モデル | サイズ | 利用可能性 | 特徴 |
|-------|------|--------------|----------|
| **S1** | 40億パラメータ | [fish.audio](fish.audio) で利用可能 | 全機能搭載のフラッグシップモデル |
| **S1-mini** | 5億パラメータ | huggingface [hf space](https://huggingface.co/spaces/fishaudio/openaudio-s1-mini) で利用可能 | コア機能を備えた蒸留版 |
S1とS1-miniの両方にオンライン人間フィードバック強化学習RLHFが組み込まれています。
## **機能**
1. **ゼロショット・フューショットTTS** 10〜30秒の音声サンプルを入力するだけで高品質なTTS出力を生成します。**詳細なガイドラインについては、[音声クローニングのベストプラクティス](https://docs.fish.audio/text-to-speech/voice-clone-best-practices)をご覧ください。**
2. **多言語・言語横断サポート:** 多言語テキストを入力ボックスにコピー&ペーストするだけで、言語を気にする必要はありません。現在、英語、日本語、韓国語、中国語、フランス語、ドイツ語、アラビア語、スペイン語をサポートしています。
3. **音素依存なし:** このモデルは強力な汎化能力を持ち、TTSに音素に依存しません。あらゆる言語スクリプトのテキストを処理できます。
4. **高精度:** Seed-TTS Evalで低い文字誤り率CER約0.4%と単語誤り率WER約0.8%を達成します。
5. **高速:** fish-tech加速により、Nvidia RTX 4060ラップトップでリアルタイム係数約1:5、Nvidia RTX 4090で約1:15を実現します。
6. **WebUI推論** Chrome、Firefox、Edge、その他のブラウザと互換性のあるGradioベースの使いやすいWebUIを備えています。
7. **GUI推論** APIサーバーとシームレスに連携するPyQt6グラフィカルインターフェースを提供します。Linux、Windows、macOSをサポートします。[GUIを見る](https://github.com/AnyaCoder/fish-speech-gui)。
8. **デプロイフレンドリー:** Linux、Windows、MacOSの native サポートで推論サーバーを簡単にセットアップし、速度低下を最小化します。
## **免責事項**
コードベースの違法な使用について、当方は一切の責任を負いません。お住まいの地域のDMCAやその他の関連法律をご参照ください。
## **メディア・デモ**
#### 🚧 近日公開
動画デモとチュートリアルは現在開発中です。
## **ドキュメント**
### クイックスタート
- [環境構築](install.md) - 開発環境をセットアップ
- [推論ガイド](inference.md) - モデルを実行して音声を生成
## **コミュニティ・サポート**
- **Discord** [Discordコミュニティ](https://discord.gg/Es5qTB9BcN)に参加
- **ウェブサイト:** 最新アップデートは[OpenAudio.com](https://openaudio.com)をご覧ください
- **オンライン試用:** [Fish Audio Playground](https://fish.audio)

View File

@ -34,9 +34,7 @@ python fish_speech/models/text2semantic/inference.py \
--text "変換したいテキスト" \
--prompt-text "参照テキスト" \
--prompt-tokens "fake.npy" \
--checkpoint-path "checkpoints/openaudio-s1-mini" \
--num-samples 2 \
--compile # より高速化を求める場合
--compile
```
このコマンドは、作業ディレクトリに `codes_N` ファイルを作成しますNは0から始まる整数
@ -50,15 +48,12 @@ python fish_speech/models/text2semantic/inference.py \
### 3. セマンティックトークンから音声を生成:
#### VQGANデコーダー
!!! warning "将来の警告"
元のパスtools/vqgan/inference.pyからアクセス可能なインターフェースを維持していますが、このインターフェースは後続のリリースで削除される可能性があるため、できるだけ早くコードを変更してください。
```bash
python fish_speech/models/dac/inference.py \
-i "codes_0.npy" \
--checkpoint-path "checkpoints/openaudiio-s1-mini/codec.pth"
-i "codes_0.npy"
```
## HTTP API推論
@ -103,5 +98,3 @@ python -m tools.run_webui
!!! note
`GRADIO_SHARE``GRADIO_SERVER_PORT``GRADIO_SERVER_NAME` などのGradio環境変数を使用してWebUIを設定できます。
お楽しみください!

30
docs/ja/install.md Normal file
View File

@ -0,0 +1,30 @@
## システム要件
- GPU メモリ12GB推論
- システムLinux、WSL
## セットアップ
まず、音声処理に使用される pyaudio と sox をインストールする必要があります。
``` bash
apt install portaudio19-dev libsox-dev ffmpeg
```
### Conda
```bash
conda create -n fish-speech python=3.12
conda activate fish-speech
pip install -e .
```
### UV
```bash
uv sync --python 3.12
```
!!! warning
`compile` オプションは Windows と macOS でサポートされていません。compile で実行したい場合は、triton を自分でインストールする必要があります。

View File

@ -1,4 +1,14 @@
# 소개
# OpenAudio (구 Fish-Speech)
<div align="center">
<div align="center">
<img src="../assets/openaudio.jpg" alt="OpenAudio" style="display: block; margin: 0 auto; width: 35%;"/>
</div>
<strong>고급 텍스트-음성 변환 모델 시리즈</strong>
<div>
<a target="_blank" href="https://discord.gg/Es5qTB9BcN">
@ -12,39 +22,113 @@
</a>
</div>
!!! warning
코드베이스의 불법적인 사용에 대해서는 일체 책임을 지지 않습니다. 귀하의 지역의 DMCA(디지털 밀레니엄 저작권법) 및 기타 관련 법률을 참고하시기 바랍니다. <br/>
이 코드베이스는 Apache 2.0 라이선스 하에 배포되며, 모든 모델은 CC-BY-NC-SA-4.0 라이선스 하에 배포됩니다.
<strong>지금 체험:</strong> <a href="https://fish.audio">Fish Audio Playground</a> | <strong>자세히 알아보기:</strong> <a href="https://openaudio.com">OpenAudio 웹사이트</a>
## 시스템 요구사항
</div>
- GPU 메모리: 12GB (추론)
- 시스템: Linux, Windows
---
## 설치
!!! warning "법적 고지"
코드베이스의 불법적인 사용에 대해서는 일체 책임을 지지 않습니다. 귀하의 지역의 DMCA(디지털 밀레니엄 저작권법) 및 기타 관련 법률을 참고하시기 바랍니다.
**라이선스:** 이 코드베이스는 Apache 2.0 라이선스 하에 배포되며, 모든 모델은 CC-BY-NC-SA-4.0 라이선스 하에 배포됩니다.
먼저 패키지를 설치하기 위한 conda 환경을 만들어야 합니다.
## **소개**
```bash
저희는 **OpenAudio**로의 브랜드 변경을 발표하게 되어 기쁩니다. Fish-Speech를 기반으로 하여 상당한 개선과 새로운 기능을 추가한 새로운 고급 텍스트-음성 변환 모델 시리즈를 소개합니다.
conda create -n fish-speech python=3.12
conda activate fish-speech
**Openaudio-S1-mini**: [동영상](업로드 예정); [Hugging Face](https://huggingface.co/fishaudio/openaudio-s1-mini);
pip install sudo apt-get install portaudio19-dev # pyaudio용
pip install -e . # 나머지 모든 패키지를 다운로드합니다.
**Fish-Speech v1.5**: [동영상](https://www.bilibili.com/video/BV1EKiDYBE4o/); [Hugging Face](https://huggingface.co/fishaudio/fish-speech-1.5);
apt install libsox-dev ffmpeg # 필요한 경우.
## **주요 특징**
### **감정 제어**
OpenAudio S1은 **다양한 감정, 톤, 특수 마커를 지원**하여 음성 합성을 향상시킵니다:
- **기본 감정**:
```
(angry) (sad) (excited) (surprised) (satisfied) (delighted)
(scared) (worried) (upset) (nervous) (frustrated) (depressed)
(empathetic) (embarrassed) (disgusted) (moved) (proud) (relaxed)
(grateful) (confident) (interested) (curious) (confused) (joyful)
```
!!! warning
`compile` 옵션은 Windows와 macOS에서 지원되지 않습니다. compile로 실행하려면 trition을 직접 설치해야 합니다.
- **고급 감정**:
```
(disdainful) (unhappy) (anxious) (hysterical) (indifferent)
(impatient) (guilty) (scornful) (panicked) (furious) (reluctant)
(keen) (disapproving) (negative) (denying) (astonished) (serious)
(sarcastic) (conciliative) (comforting) (sincere) (sneering)
(hesitating) (yielding) (painful) (awkward) (amused)
```
## 감사의 말
- **톤 마커**:
```
(in a hurry tone) (shouting) (screaming) (whispering) (soft tone)
```
- [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2)
- [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2)
- [GPT VITS](https://github.com/innnky/gpt-vits)
- [MQTTS](https://github.com/b04901014/MQTTS)
- [GPT Fast](https://github.com/pytorch-labs/gpt-fast)
- [Transformers](https://github.com/huggingface/transformers)
- [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS)
- **특수 음향 효과**:
```
(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting)
(groaning) (crowd laughing) (background laughter) (audience laughing)
```
Ha,ha,ha를 사용하여 제어할 수도 있으며, 여러분 스스로 탐구할 수 있는 다른 많은 사용법이 있습니다.
### **뛰어난 TTS 품질**
Seed TTS 평가 지표를 사용하여 모델 성능을 평가한 결과, OpenAudio S1은 영어 텍스트에서 **0.008 WER**과 **0.004 CER**을 달성하여 이전 모델보다 현저히 향상되었습니다. (영어, 자동 평가, OpenAI gpt-4o-전사 기반, 화자 거리는 Revai/pyannote-wespeaker-voxceleb-resnet34-LM 사용)
| 모델 | 단어 오류율 (WER) | 문자 오류율 (CER) | 화자 거리 |
|-------|----------------------|---------------------------|------------------|
| **S1** | **0.008** | **0.004** | **0.332** |
| **S1-mini** | **0.011** | **0.005** | **0.380** |
### **두 가지 모델 유형**
| 모델 | 크기 | 가용성 | 특징 |
|-------|------|--------------|----------|
| **S1** | 40억 매개변수 | [fish.audio](fish.audio)에서 이용 가능 | 모든 기능을 갖춘 플래그십 모델 |
| **S1-mini** | 5억 매개변수 | huggingface [hf space](https://huggingface.co/spaces/fishaudio/openaudio-s1-mini)에서 이용 가능 | 핵심 기능을 갖춘 경량화 버전 |
S1과 S1-mini 모두 온라인 인간 피드백 강화 학습(RLHF)이 통합되어 있습니다.
## **기능**
1. **제로샷 및 퓨샷 TTS:** 10~30초의 음성 샘플을 입력하여 고품질 TTS 출력을 생성합니다. **자세한 가이드라인은 [음성 복제 모범 사례](https://docs.fish.audio/text-to-speech/voice-clone-best-practices)를 참조하세요.**
2. **다국어 및 교차 언어 지원:** 다국어 텍스트를 입력 상자에 복사하여 붙여넣기만 하면 됩니다. 언어에 대해 걱정할 필요가 없습니다. 현재 영어, 일본어, 한국어, 중국어, 프랑스어, 독일어, 아랍어, 스페인어를 지원합니다.
3. **음소 의존성 없음:** 이 모델은 강력한 일반화 능력을 가지고 있으며 TTS에 음소에 의존하지 않습니다. 어떤 언어 스크립트의 텍스트도 처리할 수 있습니다.
4. **높은 정확도:** Seed-TTS Eval에서 약 0.4%의 낮은 문자 오류율(CER)과 약 0.8%의 단어 오류율(WER)을 달성합니다.
5. **빠른 속도:** fish-tech 가속을 통해 Nvidia RTX 4060 노트북에서 실시간 계수 약 1:5, Nvidia RTX 4090에서 약 1:15를 달성합니다.
6. **WebUI 추론:** Chrome, Firefox, Edge 및 기타 브라우저와 호환되는 사용하기 쉬운 Gradio 기반 웹 UI를 제공합니다.
7. **GUI 추론:** API 서버와 원활하게 작동하는 PyQt6 그래픽 인터페이스를 제공합니다. Linux, Windows, macOS를 지원합니다. [GUI 보기](https://github.com/AnyaCoder/fish-speech-gui).
8. **배포 친화적:** Linux, Windows, MacOS의 네이티브 지원으로 추론 서버를 쉽게 설정하여 속도 손실을 최소화합니다.
## **면책 조항**
코드베이스의 불법적인 사용에 대해서는 일체 책임을 지지 않습니다. 귀하 지역의 DMCA 및 기타 관련 법률을 참고하시기 바랍니다.
## **미디어 및 데모**
#### 🚧 곧 출시 예정
동영상 데모와 튜토리얼이 현재 개발 중입니다.
## **문서**
### 빠른 시작
- [환경 구축](install.md) - 개발 환경 설정
- [추론 가이드](inference.md) - 모델 실행 및 음성 생성
## **커뮤니티 및 지원**
- **Discord:** [Discord 커뮤니티](https://discord.gg/Es5qTB9BcN)에 참여하세요
- **웹사이트:** 최신 업데이트는 [OpenAudio.com](https://openaudio.com)을 방문하세요
- **온라인 체험:** [Fish Audio Playground](https://fish.audio)

View File

@ -34,9 +34,7 @@ python fish_speech/models/text2semantic/inference.py \
--text "변환하고 싶은 텍스트" \
--prompt-text "참조 텍스트" \
--prompt-tokens "fake.npy" \
--checkpoint-path "checkpoints/openaudio-s1-mini" \
--num-samples 2 \
--compile # 더 빠른 속도를 원한다면
--compile
```
이 명령은 작업 디렉토리에 `codes_N` 파일을 생성합니다. 여기서 N은 0부터 시작하는 정수입니다.
@ -50,15 +48,12 @@ python fish_speech/models/text2semantic/inference.py \
### 3. 의미 토큰에서 음성 생성:
#### VQGAN 디코더
!!! warning "향후 경고"
원래 경로(tools/vqgan/inference.py)에서 액세스 가능한 인터페이스를 유지하고 있지만, 이 인터페이스는 향후 릴리스에서 제거될 수 있으므로 가능한 한 빨리 코드를 변경해 주세요.
```bash
python fish_speech/models/dac/inference.py \
-i "codes_0.npy" \
--checkpoint-path "checkpoints/openaudiio-s1-mini/codec.pth"
-i "codes_0.npy"
```
## HTTP API 추론
@ -103,5 +98,3 @@ python -m tools.run_webui
!!! note
`GRADIO_SHARE`, `GRADIO_SERVER_PORT`, `GRADIO_SERVER_NAME`과 같은 Gradio 환경 변수를 사용하여 WebUI를 구성할 수 있습니다.
즐기세요!

30
docs/ko/install.md Normal file
View File

@ -0,0 +1,30 @@
## 시스템 요구사항
- GPU 메모리: 12GB (추론)
- 시스템: Linux, WSL
## 설정
먼저 오디오 처리에 사용되는 pyaudio와 sox를 설치해야 합니다.
``` bash
apt install portaudio19-dev libsox-dev ffmpeg
```
### Conda
```bash
conda create -n fish-speech python=3.12
conda activate fish-speech
pip install -e .
```
### UV
```bash
uv sync --python 3.12
```
!!! warning
`compile` 옵션은 Windows와 macOS에서 지원되지 않습니다. compile로 실행하려면 triton을 직접 설치해야 합니다.

View File

@ -1,4 +1,14 @@
# Introdução
# OpenAudio (anteriormente Fish-Speech)
<div align="center">
<div align="center">
<img src="../assets/openaudio.jpg" alt="OpenAudio" style="display: block; margin: 0 auto; width: 35%;"/>
</div>
<strong>Série Avançada de Modelos Text-to-Speech</strong>
<div>
<a target="_blank" href="https://discord.gg/Es5qTB9BcN">
@ -12,39 +22,113 @@
</a>
</div>
!!! warning
Não assumimos nenhuma responsabilidade pelo uso ilegal da base de código. Consulte as leis locais sobre DMCA (Digital Millennium Copyright Act) e outras leis relevantes em sua área. <br/>
Esta base de código é lançada sob a licença Apache 2.0 e todos os modelos são lançados sob a licença CC-BY-NC-SA-4.0.
<strong>Experimente agora:</strong> <a href="https://fish.audio">Fish Audio Playground</a> | <strong>Saiba mais:</strong> <a href="https://openaudio.com">Site OpenAudio</a>
## Requisitos
</div>
- Memória GPU: 12GB (Inferência)
- Sistema: Linux, Windows
---
## Configuração
!!! warning "Aviso Legal"
Não assumimos nenhuma responsabilidade pelo uso ilegal da base de código. Consulte as leis locais sobre DMCA (Digital Millennium Copyright Act) e outras leis relevantes em sua área.
**Licença:** Esta base de código é lançada sob a licença Apache 2.0 e todos os modelos são lançados sob a licença CC-BY-NC-SA-4.0.
Primeiro, precisamos criar um ambiente conda para instalar os pacotes.
## **Introdução**
```bash
Estamos empolgados em anunciar que mudamos nossa marca para **OpenAudio** - introduzindo uma nova série de modelos avançados de Text-to-Speech que se baseia na fundação do Fish-Speech com melhorias significativas e novas capacidades.
conda create -n fish-speech python=3.12
conda activate fish-speech
**Openaudio-S1-mini**: [Vídeo](A ser carregado); [Hugging Face](https://huggingface.co/fishaudio/openaudio-s1-mini);
pip install sudo apt-get install portaudio19-dev # Para pyaudio
pip install -e . # Isso baixará todos os pacotes restantes.
**Fish-Speech v1.5**: [Vídeo](https://www.bilibili.com/video/BV1EKiDYBE4o/); [Hugging Face](https://huggingface.co/fishaudio/fish-speech-1.5);
apt install libsox-dev ffmpeg # Se necessário.
## **Destaques**
### **Controle Emocional**
O OpenAudio S1 **suporta uma variedade de marcadores emocionais, de tom e especiais** para aprimorar a síntese de fala:
- **Emoções básicas**:
```
(angry) (sad) (excited) (surprised) (satisfied) (delighted)
(scared) (worried) (upset) (nervous) (frustrated) (depressed)
(empathetic) (embarrassed) (disgusted) (moved) (proud) (relaxed)
(grateful) (confident) (interested) (curious) (confused) (joyful)
```
!!! warning
A opção `compile` não é suportada no Windows e macOS, se você quiser executar com compile, precisa instalar o trition por conta própria.
- **Emoções avançadas**:
```
(disdainful) (unhappy) (anxious) (hysterical) (indifferent)
(impatient) (guilty) (scornful) (panicked) (furious) (reluctant)
(keen) (disapproving) (negative) (denying) (astonished) (serious)
(sarcastic) (conciliative) (comforting) (sincere) (sneering)
(hesitating) (yielding) (painful) (awkward) (amused)
```
## Agradecimentos
- **Marcadores de tom**:
```
(in a hurry tone) (shouting) (screaming) (whispering) (soft tone)
```
- [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2)
- [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2)
- [GPT VITS](https://github.com/innnky/gpt-vits)
- [MQTTS](https://github.com/b04901014/MQTTS)
- [GPT Fast](https://github.com/pytorch-labs/gpt-fast)
- [Transformers](https://github.com/huggingface/transformers)
- [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS)
- **Efeitos sonoros especiais**:
```
(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting)
(groaning) (crowd laughing) (background laughter) (audience laughing)
```
Você também pode usar Ha,ha,ha para controlar, há muitos outros casos esperando para serem explorados por você mesmo.
### **Qualidade TTS Excelente**
Utilizamos as métricas Seed TTS Eval para avaliar o desempenho do modelo, e os resultados mostram que o OpenAudio S1 alcança **0.008 WER** e **0.004 CER** em texto inglês, que é significativamente melhor que modelos anteriores. (Inglês, avaliação automática, baseada na transcrição OpenAI gpt-4o, distância do falante usando Revai/pyannote-wespeaker-voxceleb-resnet34-LM)
| Modelo | Taxa de Erro de Palavras (WER) | Taxa de Erro de Caracteres (CER) | Distância do Falante |
|-------|----------------------|---------------------------|------------------|
| **S1** | **0.008** | **0.004** | **0.332** |
| **S1-mini** | **0.011** | **0.005** | **0.380** |
### **Dois Tipos de Modelos**
| Modelo | Tamanho | Disponibilidade | Características |
|-------|------|--------------|----------|
| **S1** | 4B parâmetros | Disponível em [fish.audio](fish.audio) | Modelo principal com todas as funcionalidades |
| **S1-mini** | 0.5B parâmetros | Disponível no huggingface [hf space](https://huggingface.co/spaces/fishaudio/openaudio-s1-mini) | Versão destilada com capacidades principais |
Tanto o S1 quanto o S1-mini incorporam Aprendizado por Reforço Online com Feedback Humano (RLHF).
## **Características**
1. **TTS Zero-shot e Few-shot:** Insira uma amostra vocal de 10 a 30 segundos para gerar saída TTS de alta qualidade. **Para diretrizes detalhadas, veja [Melhores Práticas de Clonagem de Voz](https://docs.fish.audio/text-to-speech/voice-clone-best-practices).**
2. **Suporte Multilíngue e Cross-lingual:** Simplesmente copie e cole texto multilíngue na caixa de entrada—não precisa se preocupar com o idioma. Atualmente suporta inglês, japonês, coreano, chinês, francês, alemão, árabe e espanhol.
3. **Sem Dependência de Fonemas:** O modelo tem fortes capacidades de generalização e não depende de fonemas para TTS. Pode lidar com texto em qualquer script de idioma.
4. **Altamente Preciso:** Alcança uma baixa Taxa de Erro de Caracteres (CER) de cerca de 0,4% e Taxa de Erro de Palavras (WER) de cerca de 0,8% para Seed-TTS Eval.
5. **Rápido:** Com aceleração fish-tech, o fator de tempo real é aproximadamente 1:5 em um laptop Nvidia RTX 4060 e 1:15 em um Nvidia RTX 4090.
6. **Inferência WebUI:** Apresenta uma interface web fácil de usar baseada em Gradio, compatível com Chrome, Firefox, Edge e outros navegadores.
7. **Inferência GUI:** Oferece uma interface gráfica PyQt6 que funciona perfeitamente com o servidor API. Suporta Linux, Windows e macOS. [Ver GUI](https://github.com/AnyaCoder/fish-speech-gui).
8. **Amigável para Deploy:** Configure facilmente um servidor de inferência com suporte nativo para Linux, Windows e MacOS, minimizando a perda de velocidade.
## **Isenção de Responsabilidade**
Não assumimos nenhuma responsabilidade pelo uso ilegal da base de código. Consulte suas leis locais sobre DMCA e outras leis relacionadas.
## **Mídia e Demos**
#### 🚧 Em Breve
Demonstrações em vídeo e tutoriais estão atualmente em desenvolvimento.
## **Documentação**
### Início Rápido
- [Configurar Ambiente](install.md) - Configure seu ambiente de desenvolvimento
- [Guia de Inferência](inference.md) - Execute o modelo e gere fala
## **Comunidade e Suporte**
- **Discord:** Junte-se à nossa [comunidade Discord](https://discord.gg/Es5qTB9BcN)
- **Site:** Visite [OpenAudio.com](https://openaudio.com) para as últimas atualizações
- **Experimente Online:** [Fish Audio Playground](https://fish.audio)

View File

@ -34,9 +34,7 @@ python fish_speech/models/text2semantic/inference.py \
--text "O texto que você quer converter" \
--prompt-text "Seu texto de referência" \
--prompt-tokens "fake.npy" \
--checkpoint-path "checkpoints/openaudio-s1-mini" \
--num-samples 2 \
--compile # se você quiser uma velocidade mais rápida
--compile
```
Este comando criará um arquivo `codes_N` no diretório de trabalho, onde N é um inteiro começando de 0.
@ -50,15 +48,12 @@ Este comando criará um arquivo `codes_N` no diretório de trabalho, onde N é u
### 3. Gerar vocais a partir de tokens semânticos:
#### Decodificador VQGAN
!!! warning "Aviso Futuro"
Mantivemos a interface acessível do caminho original (tools/vqgan/inference.py), mas esta interface pode ser removida em versões subsequentes, então por favor altere seu código o mais breve possível.
```bash
python fish_speech/models/dac/inference.py \
-i "codes_0.npy" \
--checkpoint-path "checkpoints/openaudiio-s1-mini/codec.pth"
-i "codes_0.npy"
```
## Inferência com API HTTP
@ -103,5 +98,3 @@ python -m tools.run_webui
!!! note
Você pode usar variáveis de ambiente do Gradio, como `GRADIO_SHARE`, `GRADIO_SERVER_PORT`, `GRADIO_SERVER_NAME` para configurar o WebUI.
Divirta-se!

30
docs/pt/install.md Normal file
View File

@ -0,0 +1,30 @@
## Requisitos
- Memória GPU: 12GB (Inferência)
- Sistema: Linux, WSL
## Configuração
Primeiro você precisa instalar pyaudio e sox, que são usados para processamento de áudio.
``` bash
apt install portaudio19-dev libsox-dev ffmpeg
```
### Conda
```bash
conda create -n fish-speech python=3.12
conda activate fish-speech
pip install -e .
```
### UV
```bash
uv sync --python 3.12
```
!!! warning
A opção `compile` não é suportada no Windows e macOS, se você quiser executar com compile, precisa instalar o triton por conta própria.

View File

@ -1,4 +1,14 @@
# 简介
# OpenAudio (原 Fish-Speech)
<div align="center">
<div align="center">
<img src="../assets/openaudio.jpg" alt="OpenAudio" style="display: block; margin: 0 auto; width: 35%;"/>
</div>
<strong>先进的文字转语音模型系列</strong>
<div>
<a target="_blank" href="https://discord.gg/Es5qTB9BcN">
@ -12,39 +22,113 @@
</a>
</div>
!!! warning
我们不对代码库的任何非法使用承担责任。请参考您所在地区有关 DMCA数字千年版权法和其他相关法律的规定。<br/>
此代码库在 Apache 2.0 许可证下发布,所有模型在 CC-BY-NC-SA-4.0 许可证下发布。
<strong>立即试用:</strong> <a href="https://fish.audio">Fish Audio Playground</a> | <strong>了解更多:</strong> <a href="https://openaudio.com">OpenAudio 网站</a>
## 系统要求
</div>
- GPU 内存12GB推理
- 系统Linux、Windows
---
## 安装
!!! warning "法律声明"
我们不对代码库的任何非法使用承担责任。请参考您所在地区有关 DMCA数字千年版权法和其他相关法律的规定。
**许可证:** 此代码库在 Apache 2.0 许可证下发布,所有模型在 CC-BY-NC-SA-4.0 许可证下发布。
首先,我们需要创建一个 conda 环境来安装包。
## **介绍**
```bash
我们很高兴地宣布,我们已经更名为 **OpenAudio** - 推出全新的先进文字转语音模型系列,在 Fish-Speech 的基础上进行了重大改进并增加了新功能。
conda create -n fish-speech python=3.12
conda activate fish-speech
**Openaudio-S1-mini**: [视频](即将上传); [Hugging Face](https://huggingface.co/fishaudio/openaudio-s1-mini);
pip install sudo apt-get install portaudio19-dev # 用于 pyaudio
pip install -e . # 这将下载所有其余的包。
**Fish-Speech v1.5**: [视频](https://www.bilibili.com/video/BV1EKiDYBE4o/); [Hugging Face](https://huggingface.co/fishaudio/fish-speech-1.5);
apt install libsox-dev ffmpeg # 如果需要的话。
## **亮点**
### **情感控制**
OpenAudio S1 **支持多种情感、语调和特殊标记**来增强语音合成效果:
- **基础情感**
```
(angry) (sad) (excited) (surprised) (satisfied) (delighted)
(scared) (worried) (upset) (nervous) (frustrated) (depressed)
(empathetic) (embarrassed) (disgusted) (moved) (proud) (relaxed)
(grateful) (confident) (interested) (curious) (confused) (joyful)
```
!!! warning
`compile` 选项在 Windows 和 macOS 上不受支持,如果您想使用 compile 运行,需要自己安装 trition。
- **高级情感**
```
(disdainful) (unhappy) (anxious) (hysterical) (indifferent)
(impatient) (guilty) (scornful) (panicked) (furious) (reluctant)
(keen) (disapproving) (negative) (denying) (astonished) (serious)
(sarcastic) (conciliative) (comforting) (sincere) (sneering)
(hesitating) (yielding) (painful) (awkward) (amused)
```
## 致谢
- **语调标记**
```
(in a hurry tone) (shouting) (screaming) (whispering) (soft tone)
```
- [VITS2 (daniilrobnikov)](https://github.com/daniilrobnikov/vits2)
- [Bert-VITS2](https://github.com/fishaudio/Bert-VITS2)
- [GPT VITS](https://github.com/innnky/gpt-vits)
- [MQTTS](https://github.com/b04901014/MQTTS)
- [GPT Fast](https://github.com/pytorch-labs/gpt-fast)
- [Transformers](https://github.com/huggingface/transformers)
- [GPT-SoVITS](https://github.com/RVC-Boss/GPT-SoVITS)
- **特殊音效**
```
(laughing) (chuckling) (sobbing) (crying loudly) (sighing) (panting)
(groaning) (crowd laughing) (background laughter) (audience laughing)
```
您还可以使用 Ha,ha,ha 来控制,还有许多其他用法等待您自己探索。
### **卓越的 TTS 质量**
我们使用 Seed TTS 评估指标来评估模型性能,结果显示 OpenAudio S1 在英文文本上达到了 **0.008 WER****0.004 CER**,明显优于以前的模型。(英语,自动评估,基于 OpenAI gpt-4o-转录,说话人距离使用 Revai/pyannote-wespeaker-voxceleb-resnet34-LM
| 模型 | 词错误率 (WER) | 字符错误率 (CER) | 说话人距离 |
|-------|----------------------|---------------------------|------------------|
| **S1** | **0.008** | **0.004** | **0.332** |
| **S1-mini** | **0.011** | **0.005** | **0.380** |
### **两种模型类型**
| 模型 | 规模 | 可用性 | 特性 |
|-------|------|--------------|----------|
| **S1** | 40亿参数 | 在 [fish.audio](fish.audio) 上可用 | 功能齐全的旗舰模型 |
| **S1-mini** | 5亿参数 | 在 huggingface [hf space](https://huggingface.co/spaces/fishaudio/openaudio-s1-mini) 上可用 | 具有核心功能的蒸馏版本 |
S1 和 S1-mini 都集成了在线人类反馈强化学习 (RLHF)。
## **功能特性**
1. **零样本和少样本 TTS** 输入 10 到 30 秒的语音样本即可生成高质量的 TTS 输出。**详细指南请参见 [语音克隆最佳实践](https://docs.fish.audio/text-to-speech/voice-clone-best-practices)。**
2. **多语言和跨语言支持:** 只需复制粘贴多语言文本到输入框即可——无需担心语言问题。目前支持英语、日语、韩语、中文、法语、德语、阿拉伯语和西班牙语。
3. **无音素依赖:** 该模型具有强大的泛化能力,不依赖音素进行 TTS。它可以处理任何语言文字的文本。
4. **高度准确:** 在 Seed-TTS Eval 中实现低字符错误率 (CER) 约 0.4% 和词错误率 (WER) 约 0.8%。
5. **快速:** 通过 fish-tech 加速,在 Nvidia RTX 4060 笔记本电脑上实时因子约为 1:5在 Nvidia RTX 4090 上约为 1:15。
6. **WebUI 推理:** 具有易于使用的基于 Gradio 的网络界面,兼容 Chrome、Firefox、Edge 和其他浏览器。
7. **GUI 推理:** 提供与 API 服务器无缝配合的 PyQt6 图形界面。支持 Linux、Windows 和 macOS。[查看 GUI](https://github.com/AnyaCoder/fish-speech-gui)。
8. **部署友好:** 轻松设置推理服务器,原生支持 Linux、Windows 和 MacOS最小化速度损失。
## **免责声明**
我们不对代码库的任何非法使用承担责任。请参考您当地关于 DMCA 和其他相关法律的规定。
## **媒体和演示**
#### 🚧 即将推出
视频演示和教程正在开发中。
## **文档**
### 快速开始
- [构建环境](install.md) - 设置您的开发环境
- [推理指南](inference.md) - 运行模型并生成语音
## **社区和支持**
- **Discord** 加入我们的 [Discord 社区](https://discord.gg/Es5qTB9BcN)
- **网站:** 访问 [OpenAudio.com](https://openaudio.com) 获取最新更新
- **在线试用:** [Fish Audio Playground](https://fish.audio)

View File

@ -1,6 +1,6 @@
# 推理
由于声码器模型已更改,您需要比以前更多的显存建议使用12GB显存以便流畅推理。
由于声码器模型已更改,您需要比以前更多的 VRAM建议使用 12GB 进行流畅推理。
我们支持命令行、HTTP API 和 WebUI 进行推理,您可以选择任何您喜欢的方法。
@ -17,7 +17,7 @@ huggingface-cli download fishaudio/openaudio-s1-mini --local-dir checkpoints/ope
!!! note
如果您计划让模型随机选择音色,可以跳过此步骤。
### 1. 从参考音频获取VQ tokens
### 1. 从参考音频获取 VQ 令牌
```bash
python fish_speech/models/dac/inference.py \
@ -27,38 +27,33 @@ python fish_speech/models/dac/inference.py \
您应该会得到一个 `fake.npy` 和一个 `fake.wav`
### 2. 从文本生成语义tokens
### 2. 从文本生成语义令牌
```bash
python fish_speech/models/text2semantic/inference.py \
--text "您想要转换的文本" \
--prompt-text "您的参考文本" \
--prompt-tokens "fake.npy" \
--checkpoint-path "checkpoints/openaudio-s1-mini" \
--num-samples 2 \
--compile # 如果您想要更快的速度
--compile
```
此命令将在工作目录中创建一个 `codes_N` 文件其中N是从0开始的整数。
此命令将在工作目录中创建一个 `codes_N` 文件,其中 N 是从 0 开始的整数。
!!! note
您可能想要使用 `--compile` 来融合CUDA内核以获得更快的推理速度约30 tokens/秒 -> 约500 tokens/秒)。
相应地,如果您不打算使用加速,可以删除 `--compile` 参数的注释
您可能希望使用 `--compile` 来融合 CUDA 内核以实现更快的推理(~30 令牌/秒 -> ~500 令牌/秒)。
相应地,如果您不计划使用加速,可以注释掉 `--compile` 参数
!!! info
对于不支持bf16的GPU您可能需要使用 `--half` 参数。
对于不支持 bf16 GPU您可能需要使用 `--half` 参数。
### 3. 从语义tokens生成人声
#### VQGAN 解码器
### 3. 从语义令牌生成声音:
!!! warning "未来警告"
我们保留了从原始路径tools/vqgan/inference.py访问的接口,但此接口可能在后续版本中被移除,请尽快更改您的代码。
我们保留了从原始路径tools/vqgan/inference.py访问接口的能力,但此接口可能在后续版本中被删除,因此请尽快更改您的代码。
```bash
python fish_speech/models/dac/inference.py \
-i "codes_0.npy" \
--checkpoint-path "checkpoints/openaudiio-s1-mini/codec.pth"
```
## HTTP API 推理

30
docs/zh/install.md Normal file
View File

@ -0,0 +1,30 @@
## 系统要求
- GPU 内存12GB推理
- 系统Linux、WSL
## 安装
首先需要安装 pyaudio 和 sox用于音频处理。
``` bash
apt install portaudio19-dev libsox-dev ffmpeg
```
### Conda
```bash
conda create -n fish-speech python=3.12
conda activate fish-speech
pip install -e .
```
### UV
```bash
uv sync --python 3.12
```
!!! warning
`compile` 选项在 Windows 和 macOS 上不受支持,如果您想使用 compile 运行,需要自己安装 triton。

View File

@ -1,4 +1,4 @@
site_name: Fish Speech
site_name: OpenAudio
site_description: Targeting SOTA TTS solutions.
site_url: https://speech.fish.audio
@ -12,7 +12,7 @@ copyright: Copyright &copy; 2023-2025 by Fish Audio
theme:
name: material
favicon: assets/figs/logo-circle.png
favicon: assets/openaudio.png
language: en
features:
- content.action.edit
@ -25,8 +25,7 @@ theme:
- search.highlight
- search.share
- content.code.copy
icon:
logo: fontawesome/solid/fish
logo: assets/openaudio.png
palette:
# Palette toggle for automatic mode
@ -56,7 +55,8 @@ theme:
code: Roboto Mono
nav:
- Installation: en/index.md
- Introduction: en/index.md
- Installation: en/install.md
- Inference: en/inference.md
# Plugins
@ -80,25 +80,29 @@ plugins:
name: 简体中文
build: true
nav:
- 安装: zh/index.md
- 介绍: zh/index.md
- 安装: zh/install.md
- 推理: zh/inference.md
- locale: ja
name: 日本語
build: true
nav:
- インストール: ja/index.md
- はじめに: ja/index.md
- インストール: ja/install.md
- 推論: ja/inference.md
- locale: pt
name: Português (Brasil)
build: true
nav:
- Instalação: pt/index.md
- Introdução: pt/index.md
- Instalação: pt/install.md
- Inferência: pt/inference.md
- locale: ko
name: 한국어
build: true
nav:
- 설치: ko/index.md
- 소개: ko/index.md
- 설치: ko/install.md
- 추론: ko/inference.md
markdown_extensions: