语音系列-wenet增加语言模型流程-天翼云开发者社区

试验了两种接入LM的方式

charater级别，CTC beam search with LM
word级别的，CTC WFST search with LM

CTC beam search with LM

参考github.com/wenet-e2e/wenet/blob/main/runtime/gpu/README.md
使用的是kenlm工具统计LM
调节参数: alpha 、 beta

Prepare dict
● units.txt ：T is the model unit in E2E training
● corpus：根据lexicon分词
Train lm ：使用kenlm工具生成.apra模型
Decoding with runtime（set--lm_path in the convert_start_server.sh.）

Add language model: set--lm_path in the convert_start_server.sh. Notice the path of your language model is the path in docker. There's a space between two characters.

部署方式是GPU inference with NV's Triton and TensorRT，这种方式很方便。

CTC WFST search with LM

参考 wenet/examples/aishell/s0/run.sh。
使用的是srilm工具统计LM
调节参数: acoustic_scale 、 length_penalty

Prepare dict
● units.txt ：T is the model unit in E2E training
● lexicon.txt ：L is the lexicon
● corpus：根据lexicon分词
Build decoding TLG.fst（tools/fst/compile_lexicon_token_fst.sh）
Decoding with runtime（tools/decode.sh）

local/aishell_train_lms.sh
  # 7.3 Build decoding TLG
  tools/fst/compile_lexicon_token_fst.sh \
    data/local/dict data/local/tmp data/local/lang
  tools/fst/make_tlg.sh data/local/lm data/local/lang data/lang_test || exit 1;
  # 7.4 Decoding with runtime
  chunk_size=-1
  ./tools/decode.sh --nj 1 \
    --beam 15.0 --lattice_beam 7.5 --max_active 7000 \
    --blank_skip_thresh 0.98 --ctc_weight 0.5 --rescoring_weight 1.0 \
    --chunk_size $chunk_size \
    --fst_path data/lang_test/TLG.fst \
    --dict_path data/lang_test/words.txt \
    data/test/wav.scp data/test/text $dir/final.zip \
    data/lang_test/units.txt data/lm_with_runtime
  # Please see $dir/lm_with_runtime for wer

这种需要有个中间 lexicon 。尝试使用分词工具jieba对语料分词/使用jieba现有的词典，decode效果来看比基线有提升，但是没有第一种方式提升大。

CTC beam search with LM

参考github.com/wenet-e2e/wenet/blob/main/runtime/gpu/README.md
使用的是kenlm工具统计LM
调节参数: alpha 、 beta

Prepare dict
● units.txt ：T is the model unit in E2E training
● corpus：根据lexicon分词

Train lm ：使用kenlm工具生成.apra模型

Decoding with runtime（set--lm_path in the convert_start_server.sh.）

Add language model: set--lm_path in the convert_start_server.sh. Notice the path of your language model is the path in docker. There's a space between two characters.

部署方式是GPU inference with NV's Triton and TensorRT，这种方式很方便。

CTC WFST search with LM

参考 wenet/examples/aishell/s0/run.sh。
使用的是srilm工具统计LM
调节参数: acoustic_scale 、 length_penalty

Prepare dict
● units.txt ：T is the model unit in E2E training
● lexicon.txt ：L is the lexicon
● corpus：根据lexicon分词

Build decoding TLG.fst（tools/fst/compile_lexicon_token_fst.sh）

Decoding with runtime（tools/decode.sh）

local/aishell_train_lms.sh # 7.3 Build decoding TLG tools/fst/compile_lexicon_token_fst.sh \ data/local/dict data/local/tmp data/local/lang tools/fst/make_tlg.sh data/local/lm data/local/lang data/lang_test || exit 1; # 7.4 Decoding with runtime chunk_size=-1 ./tools/decode.sh --nj 1 \ --beam 15.0 --lattice_beam 7.5 --max_active 7000 \ --blank_skip_thresh 0.98 --ctc_weight 0.5 --rescoring_weight 1.0 \ --chunk_size $chunk_size \ --fst_path data/lang_test/TLG.fst \ --dict_path data/lang_test/words.txt \ data/test/wav.scp data/test/text $dir/final.zip \ data/lang_test/units.txt data/lm_with_runtime # Please see $dir/lm_with_runtime for wer

这种需要有个中间 lexicon 。尝试使用分词工具jieba对语料分词/使用jieba现有的词典，decode效果来看比基线有提升，但是没有第一种方式提升大。

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

语音系列-wenet增加语言模型流程

试验了两种接入LM的方式

CTC beam search with LM

CTC WFST search with LM

语音系列-wenet增加语言模型流程

试验了两种接入LM的方式

CTC beam search with LM

CTC WFST search with LM

活动

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

语音系列-wenet增加语言模型流程

试验了两种接入LM的方式

CTC beam search with LM

CTC WFST search with LM

语音系列-wenet增加语言模型流程

试验了两种接入LM的方式

CTC beam search with LM

CTC WFST search with LM