Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
47e3097
shorter augmentations in yaml
Feb 8, 2024
5ab888a
layout to 80 char
Feb 8, 2024
a3bf472
listed label replication
Feb 8, 2024
c86d687
listed label replication
Feb 8, 2024
761bf93
listed label replication
Feb 8, 2024
09cfde3
Refact CTC
Feb 8, 2024
e60396f
Refact transducer
Feb 8, 2024
d6a5524
Refact seq2seq
Feb 8, 2024
9daba50
call replicate label instead of duplication
Feb 8, 2024
6bf2361
refactor aishell
Feb 8, 2024
7ec92c5
refactor aishell
Feb 8, 2024
ebae569
CommonLanuageÃ
Feb 8, 2024
088a0eb
fix error + CV CTC
Feb 8, 2024
bfb9bc2
Giga OOF
Feb 8, 2024
21353d5
Giga OOF
Feb 8, 2024
9971121
Giga OOF
Feb 8, 2024
f879302
Giga OOF
Feb 8, 2024
95c5ea4
Giga OOF
Feb 8, 2024
1b24844
Giga OOF
Feb 8, 2024
a5a97aa
Giga OOF
Feb 8, 2024
55904dd
Giga OOF
Feb 8, 2024
7f366bb
Giga OOF
Feb 8, 2024
963bda4
Finishing OOF
Feb 8, 2024
922024a
final touch LULZ
Feb 8, 2024
819f8c8
fix tests
Feb 8, 2024
8ade568
Tests???Ã
Feb 8, 2024
9e73c10
fix augment in some recipes
mravanelli Feb 10, 2024
b2b8f56
merge
Feb 20, 2024
f0e9f6d
Merge branch 'develop' of https://github.com/TParcollet/speechbrain-r…
Feb 20, 2024
afd37a1
Merge branch 'develop' of https://github.com/speechbrain/speechbrain …
Feb 22, 2024
331ff7d
Merge branch 'develop' of https://github.com/speechbrain/speechbrain …
Feb 26, 2024
7fb5529
new CV recipe for transducer streamingÂ
Feb 26, 2024
bc22091
augment warmup
Feb 26, 2024
58537b5
fix tests
Feb 26, 2024
f9be58a
fix tests
Feb 26, 2024
a7d68fa
fix tests
Feb 26, 2024
2536f01
relpos instead of regular
Feb 26, 2024
ab45af9
Merge remote-tracking branch 'speechbrain/develop' into cv_streaming
Adel-Moumen Mar 22, 2024
61c1789
pre commit
Adel-Moumen Mar 22, 2024
8bd6304
fix precommit
Adel-Moumen Mar 26, 2024
72b6bbc
update results
Adel-Moumen Mar 26, 2024
3880da2
update csv
Adel-Moumen Mar 26, 2024
74508ff
fix issue & recipe test passed
Adel-Moumen Mar 26, 2024
bf5bb57
fix unused variables
Adel-Moumen Mar 26, 2024
9ea8769
remove output_wer_folder
Adel-Moumen Mar 26, 2024
a78dc08
fix kenlm & import
Adel-Moumen Mar 26, 2024
8a85da1
Update README.md
Adel-Moumen Mar 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/docs-requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
better-apidoc>=0.3.1
ctc-segmentation>=1.7.0
fairseq
kenlm
https://github.com/kpu/kenlm/archive/master.zip
numba>=0.54.1
pyctcdecode
recommonmark>=0.7.1
Expand Down
73 changes: 59 additions & 14 deletions recipes/CommonVoice/ASR/transducer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,37 +2,82 @@
This folder contains scripts necessary to run an ASR experiment with the CommonVoice 14.0 dataset: [CommonVoice Homepage](https://commonvoice.mozilla.org/) and pytorch 2.0

# Extra-Dependencies
This recipe support two implementation of Transducer loss, see `use_torchaudio` arg in Yaml file:
1- Transducer loss from torchaudio (if torchaudio version >= 0.10.0) (Default)
2- Speechbrain Implementation using Numba lib. (this allow you to have a direct access in python to the Transducer loss implementation)
This recipe supports two implementations of the transducer loss, see `use_torchaudio` arg in the yaml file:
1. Transducer loss from torchaudio (this requires torchaudio version >= 0.10.0).
2. Speechbrain implementation using Numba. To use it, please set `use_torchaudio=False` in the yaml file. This version is implemented within SpeechBrain and allows you to directly access the python code of the transducer loss (and directly modify it if needed).

The Numba implementation is currently enabled by default as the `use_torchaudio` option is incompatible with `bfloat16` training.

Note: Before running this recipe, make sure numba is installed. Otherwise, run:
```
pip install numba
```

# How to run
python train.py hparams/{hparam_file}.py
# How to run it
```shell
python train.py hparams/conformer_transducer.yaml
```

# Data preparation
It is important to note that CommonVoice initially offers mp3 audio files at 42Hz. Hence, audio files are downsampled on the fly within the dataio function of the training script.
## Precision Notes
If your GPU effectively supports fp16 (half-precision) computations, it is recommended to execute the training script with the `--precision=fp16` (or `--precision=bf16`) option.
Enabling half precision can significantly reduce the peak VRAM requirements. For example, in the case of the Conformer Transducer recipe trained with Librispeech, the peak VRAM decreases from 39GB to 12GB when using fp16.
According to our tests, the performance is not affected.

# Languages
Here is a list of the different languages that we tested within the CommonVoice dataset
with our transducers:
- French
- Italian
- German
- English

# Results (non-streaming)

# Results
Results are obtained with beam search and no LM (no-streaming i.e. full context).

| Language | Release | hyperparams file | LM | Val. CER | Val. WER | Test CER | Test WER | Model link | GPUs |
| ------------- |:-------------:|:---------------------------:| -----:| -----:| -----:| -----:| -----:| :-----------:| :-----------:|
| French | 2023-08-15 | train_fr.yaml | No | 5.75 | 14.53 | 7.61 | 17.58 | [model](https://huggingface.co/speechbrain/asr-transducer-commonvoice-14-fr) | [model](https://www.dropbox.com/sh/nv2pnpo5n3besn3/AADZ7l41oLt11ZuOE4MqoJhCa?dl=0) | 1xV100 16GB |
| Italian | 2023-08-15 | train_it.yaml | No | 4.66 | 14.08 | 5.11 | 14.88 | [model](https://huggingface.co/speechbrain/asr-transducer-commonvoice-14-it) | [model](https://www.dropbox.com/sh/ksm08x0wwiomrgs/AABnjPePWGPxqIqW7bJHp1jea?dl=0) | 1xV100 16GB |
| German | 2023-08-15 | train_de.yaml | No | 4.32 | 13.09 | 5.43 | 15.25 | [model](https://huggingface.co/speechbrain/asr-transducer-commonvoice-14-de) | [model](https://www.dropbox.com/sh/jfge6ixbtoje64t/AADeAgL5un0A8uEjPSM84ex8a?dl=0) | 1xV100 16GB |
| Language | Release | LM | Val. CER | Val. WER | Test CER | Test WER | Model link | GPUs |
| ------------- |:-------------:| -----:| -----:| -----:| -----:| -----:| :-----------:| :-----------:|
| French | 2024-03-22 | No | 3.51 | 10.30 | 4.64 | 12.47 | [model](https://www.dropbox.com/scl/fo/kue72ik3vc55xu6u8zjr7/h?rlkey=ie98ktqf9gbunn4x9i3pskedq&dl=0) | [model]() | 4xV100 32GB |
| Italian | 2024-03-22 | No | 2.47 | 8.49 | 2.69 | 8.92 | [model](https://www.dropbox.com/scl/fo/uyqfo3kwcpkaq26au2foj/h?rlkey=gxlj7xn6bnhjfb5jds8p80fe6&dl=0) | [model]() | 4xV100 32GB |

The output folders with checkpoints and logs can be found [here](https://www.dropbox.com/sh/852eq7pbt6d65ai/AACv4wAzk1pWbDo4fjVKLICYa?dl=0).

## Streaming model

### WER vs chunk size & left context

The following matrix presents the Word Error Rate (WER%) achieved on CommonVoice
`test` with various chunk sizes (in ms).

The relative difference is not trivial to interpret, because we are not testing
against a continuous stream of speech, but rather against utterances of various
lengths. This tends to bias results in favor of larger chunk sizes.

The chunk size might not accurately represent expected latency due to slight
padding differences in streaming contexts.

The left chunk size is not representative of the receptive field of the model.
Because the model caches the streaming context at different layers, the model
may end up forming indirect dependencies to audio many seconds ago.

| | full | cs=32 (1280ms) | 16 (640ms) | 8 (320ms) |
|:-----:|:----:|:-----:|:-----:|:-----:|
| it full | 8.92 | - | - | - |
| it lc=32 | - | 10.04 | 10.82 | 12.01 |
| fr full | 12.47 | - | - | - |
| fr lc=32 | - | 13.92 | 14.88 | 16.22 |

### Inference

Once your model is trained, you need a few manual steps in order to use it with the high-level streaming interfaces (`speechbrain.inference.ASR.StreamingASR`):

1. Create a new directory where you want to store the model.
2. Copy `results/conformer_transducer/<seed>/lm.ckpt` (optional; currently, for streaming rescoring LMs might be unsupported) and `tokenizer.ckpt` to that directory.
3. Copy `results/conformer_transducer/<seed>/save/CKPT+????/model.ckpt` and `normalizer.ckpt` to that directory.
4. Copy your hyperparameters file to that directory. Uncomment the streaming specific keys and remove any training-specific keys. Alternatively, grab the inference hyperparameters YAML for this model from HuggingFace and adapt it to any changes you may have done.
5. You can now instantiate a `StreamingASR` with your model using `StreamingASR.from_hparams("/path/to/model/")`.

The contents of that directory may be uploaded as a HuggingFace model, in which case the model source path can just be specified as `youruser/yourmodel`.

# **About SpeechBrain**
- Website: https://speechbrain.github.io/
- Code: https://github.com/speechbrain/speechbrain/
Expand Down
Loading