Add doc for MatchaTTS (#689)

k2-fsa · Jan 2, 2025 · ee6ea5e · ee6ea5e
1 parent 1869117
commit ee6ea5e
Show file tree

Hide file tree

Showing 8 changed files with 385 additions and 14 deletions.
diff --git a/docs/source/_static/matcha-icefall-baker-zh/matcha-baker-0.wav b/docs/source/_static/matcha-icefall-baker-zh/matcha-baker-0.wav
diff --git a/docs/source/_static/matcha-icefall-baker-zh/matcha-baker-1.wav b/docs/source/_static/matcha-icefall-baker-zh/matcha-baker-1.wav
diff --git a/docs/source/_static/matcha-icefall-baker-zh/matcha-baker-2.wav b/docs/source/_static/matcha-icefall-baker-zh/matcha-baker-2.wav
diff --git a/docs/source/_static/matcha-icefall-en_US-ljspeech/matcha-ljspeech-0.wav b/docs/source/_static/matcha-icefall-en_US-ljspeech/matcha-ljspeech-0.wav
diff --git a/docs/source/_static/matcha-icefall-en_US-ljspeech/matcha-ljspeech-1.wav b/docs/source/_static/matcha-icefall-en_US-ljspeech/matcha-ljspeech-1.wav
diff --git a/docs/source/onnx/tts/pretrained_models/index.rst b/docs/source/onnx/tts/pretrained_models/index.rst
@@ -14,4 +14,5 @@ This page list pre-trained models for text-to-speech.
 .. toctree::
    :maxdepth: 5
 
+   ./matcha
    ./vits
diff --git a/docs/source/onnx/tts/pretrained_models/matcha.rst b/docs/source/onnx/tts/pretrained_models/matcha.rst
@@ -0,0 +1,370 @@
+Matcha
+======
+
+
+This page lists pre-trained models using `Matcha-TTS <https://arxiv.org/abs/2309.03199>`_.
+
+.. caution::
+
+   Models are from `icefall <https://github.com/k2-fsa/icefall>`_.
+
+   We don't support models from  `<https://github.com/shivammehta25/Matcha-TTS>`_.
+
+matcha-icefall-en_US-ljspeech (American English, 1 female speaker)
+------------------------------------------------------------------
+
+This model is trained using
+
+  `<https://github.com/k2-fsa/icefall/tree/master/egs/ljspeech/TTS#matcha>`_
+
+The dataset used to train the model is from
+
+  `<https://keithito.com/LJ-Speech-Dataset//>`_.
+
+In the following, we describe how to download it and use it with `sherpa-onnx`_.
+
+Download the model
+~~~~~~~~~~~~~~~~~~
+
+Please use the following commands to download it.
+
+.. code-block:: bash
+
+  cd /path/to/sherpa-onnx
+
+  wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/matcha-icefall-en_US-ljspeech.tar.bz2
+  tar xvf matcha-icefall-en_US-ljspeech.tar.bz2
+  rm matcha-icefall-en_US-ljspeech.tar.bz2
+
+  wget https://github.com/k2-fsa/sherpa-onnx/releases/download/vocoder-models/hifigan_v2.onnx
+
+.. caution::
+
+   Remember to also download the vocoder model. We use `hifigan_v2 <https://github.com/k2-fsa/sherpa-onnx/releases/download/vocoder-models/hifigan_v2.onnx>`_ in the example.
+   You can also select `hifigan_v1 <https://github.com/k2-fsa/sherpa-onnx/releases/download/vocoder-models/hifigan_v1.onnx>`_ or
+   `hifigan_v3 <https://github.com/k2-fsa/sherpa-onnx/releases/download/vocoder-models/hifigan_v3.onnx>`_.
+
+Please check that the file sizes of the pre-trained models are correct. See
+the file sizes of ``*.onnx`` files below.
+
+.. code-block:: bash
+
+  ls -lh matcha-icefall-en_US-ljspeech/
+  total 144856
+  -rw-r--r--    1 fangjun  staff   251B Jan  2 11:05 README.md
+  drwxr-xr-x  122 fangjun  staff   3.8K Nov 28  2023 espeak-ng-data
+  -rw-r--r--@   1 fangjun  staff    71M Jan  2 04:04 model-steps-3.onnx
+  -rw-r--r--    1 fangjun  staff   954B Jan  2 11:05 tokens.txt
+
+  ls -lh hifigan_v2.onnx
+  -rw-r--r--  1 fangjun  staff   3.6M Dec 30 17:10 hifigan_v2.onnx
+
+Generate speech with executables compiled from C++
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: bash
+
+  cd /path/to/sherpa-onnx
+
+  ./build/bin/sherpa-onnx-offline-tts \
+    --matcha-acoustic-model=./matcha-icefall-en_US-ljspeech/model-steps-3.onnx \
+    --matcha-vocoder=./hifigan_v2.onnx \
+    --matcha-tokens=./matcha-icefall-en_US-ljspeech/tokens.txt \
+    --matcha-data-dir=./matcha-icefall-en_US-ljspeech/espeak-ng-data \
+    --num-threads=2 \
+    --output-filename=./matcha-ljspeech-0.wav \
+    --debug=1 \
+    "Today as always, men fall into two groups: slaves and free men. Whoever does not have two-thirds of his day for himself, is a slave, whatever he may be: a statesman, a businessman, an official, or a scholar."
+
+After running, it will generate a file ``matcha-ljspeech-0.wav`` in the
+current directory.
+
+.. code-block:: bash
+
+  soxi ./matcha-ljspeech-0.wav
+
+  Input File     : './matcha-ljspeech-0.wav'
+  Channels       : 1
+  Sample Rate    : 22050
+  Precision      : 16-bit
+  Duration       : 00:00:15.06 = 332032 samples ~ 1129.36 CDDA sectors
+  File Size      : 664k
+  Bit Rate       : 353k
+  Sample Encoding: 16-bit Signed Integer PCM
+
+.. raw:: html
+
+  <table>
+    <tr>
+      <th>Wave filename</th>
+      <th>Content</th>
+      <th>Text</th>
+    </tr>
+    <tr>
+      <td>matcha-ljspeech-0.wav</td>
+      <td>
+       <audio title="Generated ./matcha-ljspeech-0.wav" controls="controls">
+             <source src="/sherpa/_static/matcha-icefall-en_US-ljspeech/matcha-ljspeech-0.wav" type="audio/wav">
+             Your browser does not support the <code>audio</code> element.
+       </audio>
+      </td>
+      <td>
+    "Today as always, men fall into two groups: slaves and free men. Whoever does not have two-thirds of his day for himself, is a slave, whatever he may be: a statesman, a businessman, an official, or a scholar."
+      </td>
+    </tr>
+  </table>
+
+Generate speech with Python script
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: bash
+
+  cd /path/to/sherpa-onnx
+
+  python3 ./python-api-examples/offline-tts.py \
+    --matcha-acoustic-model=./matcha-icefall-en_US-ljspeech/model-steps-3.onnx \
+    --matcha-vocoder=./hifigan_v2.onnx \
+    --matcha-tokens=./matcha-icefall-en_US-ljspeech/tokens.txt \
+    --matcha-data-dir=./matcha-icefall-en_US-ljspeech/espeak-ng-data \
+    --num-threads=2 \
+    --output-filename=./matcha-ljspeech-1.wav \
+    --debug=1 \
+    "Friends fell out often because life was changing so fast. The easiest thing in the world was to lose touch with someone."
+
+.. code-block::
+
+  soxi ./matcha-ljspeech-1.wav
+
+  Input File     : './matcha-ljspeech-1.wav'
+  Channels       : 1
+  Sample Rate    : 22050
+  Precision      : 16-bit
+  Duration       : 00:00:07.92 = 174592 samples ~ 593.85 CDDA sectors
+  File Size      : 349k
+  Bit Rate       : 353k
+  Sample Encoding: 16-bit Signed Integer PCM
+
+.. raw:: html
+
+  <table>
+    <tr>
+      <th>Wave filename</th>
+      <th>Content</th>
+      <th>Text</th>
+    </tr>
+    <tr>
+      <td>matcha-ljspeech-1.wav</td>
+      <td>
+       <audio title="Generated ./matcha-ljspeech-1.wav" controls="controls">
+             <source src="/sherpa/_static/matcha-icefall-en_US-ljspeech/matcha-ljspeech-1.wav" type="audio/wav">
+             Your browser does not support the <code>audio</code> element.
+       </audio>
+      </td>
+      <td>
+    "Friends fell out often because life was changing so fast. The easiest thing in the world was to lose touch with someone."
+      </td>
+    </tr>
+  </table>
+
+matcha-icefall-zh-baker (Chinese, 1 female speaker)
+---------------------------------------------------
+
+This model is trained using
+
+  `<https://github.com/k2-fsa/icefall/tree/master/egs/baker_zh/TTS#matcha>`_
+
+The dataset used to train the model is from
+
+  `<https://en.data-baker.com/datasets/freeDatasets/>`_.
+
+.. caution::
+
+   The dataset is for ``non-commercial`` use only.
+
+In the following, we describe how to download it and use it with `sherpa-onnx`_.
+
+Download the model
+~~~~~~~~~~~~~~~~~~
+
+Please use the following commands to download it.
+
+.. code-block:: bash
+
+  cd /path/to/sherpa-onnx
+
+  wget https://github.com/k2-fsa/sherpa-onnx/releases/download/tts-models/matcha-icefall-zh-baker.tar.bz2
+  tar xvf matcha-icefall-zh-baker.tar.bz2
+  rm matcha-icefall-zh-baker.tar.bz2
+
+  wget https://github.com/k2-fsa/sherpa-onnx/releases/download/vocoder-models/hifigan_v2.onnx
+
+.. caution::
+
+   Remember to also download the vocoder model. We use `hifigan_v2 <https://github.com/k2-fsa/sherpa-onnx/releases/download/vocoder-models/hifigan_v2.onnx>`_ in the example.
+   You can also select `hifigan_v1 <https://github.com/k2-fsa/sherpa-onnx/releases/download/vocoder-models/hifigan_v1.onnx>`_ or
+   `hifigan_v3 <https://github.com/k2-fsa/sherpa-onnx/releases/download/vocoder-models/hifigan_v3.onnx>`_.
+
+Please check that the file sizes of the pre-trained models are correct. See
+the file sizes of ``*.onnx`` files below.
+
+.. code-block:: bash
+
+  ls -lh matcha-icefall-zh-baker/
+  total 167344
+  -rw-r--r--  1 fangjun  staff   370B Dec 31 14:51 README.md
+  -rw-r--r--  1 fangjun  staff    58K Dec 31 14:51 date.fst
+  drwxr-xr-x  9 fangjun  staff   288B Apr 19  2024 dict
+  -rw-r--r--  1 fangjun  staff   1.3M Dec 31 14:51 lexicon.txt
+  -rw-r--r--  1 fangjun  staff    72M Dec 31 14:51 model-steps-3.onnx
+  -rw-r--r--  1 fangjun  staff    63K Dec 31 14:51 number.fst
+  -rw-r--r--  1 fangjun  staff    87K Dec 31 14:51 phone.fst
+  -rw-r--r--  1 fangjun  staff    19K Dec 31 14:51 tokens.txt
+
+  ls -lh hifigan_v2.onnx
+  -rw-r--r--  1 fangjun  staff   3.6M Dec 30 17:10 hifigan_v2.onnx
+
+Generate speech with executables compiled from C++
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: bash
+
+  cd /path/to/sherpa-onnx
+
+  ./build/bin/sherpa-onnx-offline-tts \
+    --matcha-acoustic-model=./matcha-icefall-zh-baker/model-steps-3.onnx \
+    --matcha-vocoder=./hifigan_v2.onnx \
+    --matcha-lexicon=./matcha-icefall-zh-baker/lexicon.txt \
+    --matcha-tokens=./matcha-icefall-zh-baker/tokens.txt \
+    --matcha-dict-dir=./matcha-icefall-zh-baker/dict \
+    --num-threads=2 \
+    --output-filename=./matcha-baker-0.wav \
+    --debug=1 \
+    "当夜幕降临，星光点点，伴随着微风拂面，我在静谧中感受着时光的流转，思念如涟漪荡漾，梦境如画卷展开，我与自然融为一体，沉静在这片宁静的美丽之中，感受着生命的奇迹与温柔."
+
+  ./build/bin/sherpa-onnx-offline-tts \
+     --matcha-acoustic-model=./matcha-icefall-zh-baker/model-steps-3.onnx \
+     --matcha-vocoder=./hifigan_v2.onnx \
+     --matcha-lexicon=./matcha-icefall-zh-baker/lexicon.txt \
+     --matcha-tokens=./matcha-icefall-zh-baker/tokens.txt \
+     --tts-rule-fsts=./matcha-icefall-zh-baker/phone.fst,./matcha-icefall-zh-baker/date.fst,./matcha-icefall-zh-baker/number.fst \
+     --matcha-dict-dir=./matcha-icefall-zh-baker/dict \
+     --output-filename=./matcha-baker-1.wav \
+     "某某银行的副行长和一些行政领导表示，他们去过长江和长白山; 经济不断增长。2024年12月31号，拨打110或者18920240511。123456块钱。"
+
+After running, it will generate two files, ``matcha-baker-0.wav`` and
+``matcha-baker-1.wav``, in the current directory.
+
+.. code-block:: bash
+
+  soxi matcha-baker-*.wav
+
+  Input File     : 'matcha-baker-0.wav'
+  Channels       : 1
+  Sample Rate    : 22050
+  Precision      : 16-bit
+  Duration       : 00:00:22.65 = 499456 samples ~ 1698.83 CDDA sectors
+  File Size      : 999k
+  Bit Rate       : 353k
+  Sample Encoding: 16-bit Signed Integer PCM
+
+
+  Input File     : 'matcha-baker-1.wav'
+  Channels       : 1
+  Sample Rate    : 22050
+  Precision      : 16-bit
+  Duration       : 00:00:22.65 = 499456 samples ~ 1698.83 CDDA sectors
+  File Size      : 999k
+  Bit Rate       : 353k
+  Sample Encoding: 16-bit Signed Integer PCM
+
+  Total Duration of 2 files: 00:00:45.30
+
+.. raw:: html
+
+  <table>
+    <tr>
+      <th>Wave filename</th>
+      <th>Content</th>
+      <th>Text</th>
+    </tr>
+    <tr>
+      <td>matcha-baker-0.wav</td>
+      <td>
+       <audio title="Generated ./matcha-baker-0.wav" controls="controls">
+             <source src="/sherpa/_static/matcha-icefall-baker-zh/matcha-baker-0.wav" type="audio/wav">
+             Your browser does not support the <code>audio</code> element.
+       </audio>
+      </td>
+      <td>
+    "当夜幕降临，星光点点，伴随着微风拂面，我在静谧中感受着时光的流转，思念如涟漪荡漾，梦境如画卷展开，我与自然融为一体，沉静在这片宁静的美丽之中，感受着生命的奇迹与温柔."
+      </td>
+    </tr>
+
+    <tr>
+      <td>matcha-baker-1.wav</td>
+      <td>
+       <audio title="Generated ./matcha-baker-1.wav" controls="controls">
+             <source src="/sherpa/_static/matcha-icefall-baker-zh/matcha-baker-1.wav" type="audio/wav">
+             Your browser does not support the <code>audio</code> element.
+       </audio>
+      </td>
+      <td>
+     "某某银行的副行长和一些行政领导表示，他们去过长江和长白山; 经济不断增长。2024年12月31号，拨打110或者18920240511。123456块钱。"
+      </td>
+    </tr>
+  </table>
+
+Generate speech with Python script
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: bash
+
+  cd /path/to/sherpa-onnx
+
+  python3 ./python-api-examples/offline-tts.py \
+   --matcha-acoustic-model=./matcha-icefall-zh-baker/model-steps-3.onnx \
+   --matcha-vocoder=./hifigan_v2.onnx \
+   --matcha-lexicon=./matcha-icefall-zh-baker/lexicon.txt \
+   --matcha-tokens=./matcha-icefall-zh-baker/tokens.txt \
+   --tts-rule-fsts=./matcha-icefall-zh-baker/phone.fst,./matcha-icefall-zh-baker/date.fst,./matcha-icefall-zh-baker/number.fst \
+   --matcha-dict-dir=./matcha-icefall-zh-baker/dict \
+   --output-filename=./matcha-baker-2.wav \
+   --debug=1 \
+   "三百六十行，行行出状元。你行的！明天就是 2025年1月1号啦！银行卡被卡住了，你帮个忙，行不行？"
+
+After running, it will generate a file ``matcha-baker-zh-2.wav`` in the current directory.
+
+.. code-block:: bash
+
+  soxi matcha-baker-2.wav
+
+  Input File     : 'matcha-baker-2.wav'
+  Channels       : 1
+  Sample Rate    : 22050
+  Precision      : 16-bit
+  Duration       : 00:00:12.71 = 280320 samples ~ 953.469 CDDA sectors
+  File Size      : 561k
+  Bit Rate       : 353k
+  Sample Encoding: 16-bit Signed Integer PCM
+
+.. raw:: html
+
+  <table>
+    <tr>
+      <th>Wave filename</th>
+      <th>Content</th>
+      <th>Text</th>
+    </tr>
+    <tr>
+      <td>matcha-baker-2.wav</td>
+      <td>
+       <audio title="Generated ./matcha-baker-2.wav" controls="controls">
+             <source src="/sherpa/_static/matcha-icefall-baker-zh/matcha-baker-2.wav" type="audio/wav">
+             Your browser does not support the <code>audio</code> element.
+       </audio>
+      </td>
+      <td>
+        "三百六十行，行行出状元。你行的！明天就是 2025年1月1号啦！银行卡被卡住了，你帮个忙，行不行？"
+      </td>
+    </tr>
+  </table>