试验了两种接入LM的方式
- charater级别,CTC beam search with LM
- word级别的,CTC WFST search with LM
CTC beam search with LM
参考github.com/wenet-e2e/wenet/blob/main/runtime/gpu/README.md
使用的是kenlm工具统计LM
- Prepare dict
● units.txt :T is the model unit in E2E training
● corpus:根据lexicon分词 - Train lm :使用kenlm工具生成.apra模型
- Decoding with runtime(set--lm_path in the convert_start_server.sh.)
Add language model: set--lm_path in the convert_start_server.sh. Notice the path of your language model is the path in docker. There's a space between two characters.
部署方式是GPU inference with NV's Triton and TensorRT,这种方式很方便。
CTC WFST search with LM
参考 wenet/examples/aishell/s0/run.sh。
使用的是srilm工具统计LM
- Prepare dict
● units.txt :T is the model unit in E2E training
● lexicon.txt :L is the lexicon
● corpus:根据lexicon分词 - Build decoding TLG.fst(tools/fst/compile_lexicon_token_fst.sh)
- Decoding with runtime(tools/decode.sh)
local/aishell_train_lms.sh
# 7.3 Build decoding TLG
tools/fst/compile_lexicon_token_fst.sh \
data/local/dict data/local/tmp data/local/lang
tools/fst/make_tlg.sh data/local/lm data/local/lang data/lang_test || exit 1;
# 7.4 Decoding with runtime
chunk_size=-1
./tools/decode.sh --nj 1 \
--beam 15.0 --lattice_beam 7.5 --max_active 7000 \
--blank_skip_thresh 0.98 --ctc_weight 0.5 --rescoring_weight 1.0 \
--chunk_size $chunk_size \
--fst_path data/lang_test/TLG.fst \
--dict_path data/lang_test/words.txt \
data/test/wav.scp data/test/text $dir/final.zip \
data/lang_test/units.txt data/lm_with_runtime
# Please see $dir/lm_with_runtime for wer
这种需要有个中间 lexicon 。尝试使用分词工具jieba对语料分词/使用jieba现有的词典,decode效果来看比基线有提升,但是没有第一种方式提升大。