wenet增加语言模型-天翼云开发者社区

试验了两种接入LM的方式

charater级别，CTC beam search with LM
word级别的，CTC WFST search with LM

CTC beam search with LM

参考github.com/wenet-e2e/wenet/blob/main/runtime/gpu/README.md
使用的是kenlm工具统计LM

Prepare dict
● units.txt ：T is the model unit in E2E training
● corpus：根据lexicon分词
Train lm ：使用kenlm工具生成.apra模型
Decoding with runtime（set--lm_path in the convert_start_server.sh.）

Add language model: set--lm_path in the convert_start_server.sh. Notice the path of your language model is the path in docker. There's a space between two characters.

部署方式是GPU inference with NV's Triton and TensorRT，这种方式很方便。

CTC WFST search with LM

参考 wenet/examples/aishell/s0/run.sh。
使用的是srilm工具统计LM

Prepare dict
● units.txt ：T is the model unit in E2E training
● lexicon.txt ：L is the lexicon
● corpus：根据lexicon分词
Build decoding TLG.fst（tools/fst/compile_lexicon_token_fst.sh）
Decoding with runtime（tools/decode.sh）

local/aishell_train_lms.sh
  # 7.3 Build decoding TLG
  tools/fst/compile_lexicon_token_fst.sh \
    data/local/dict data/local/tmp data/local/lang
  tools/fst/make_tlg.sh data/local/lm data/local/lang data/lang_test || exit 1;
  # 7.4 Decoding with runtime
  chunk_size=-1
  ./tools/decode.sh --nj 1 \
    --beam 15.0 --lattice_beam 7.5 --max_active 7000 \
    --blank_skip_thresh 0.98 --ctc_weight 0.5 --rescoring_weight 1.0 \
    --chunk_size $chunk_size \
    --fst_path data/lang_test/TLG.fst \
    --dict_path data/lang_test/words.txt \
    data/test/wav.scp data/test/text $dir/final.zip \
    data/lang_test/units.txt data/lm_with_runtime
  # Please see $dir/lm_with_runtime for wer

这种需要有个中间 lexicon 。尝试使用分词工具jieba对语料分词/使用jieba现有的词典，decode效果来看比基线有提升，但是没有第一种方式提升大。

CTC beam search with LM

参考github.com/wenet-e2e/wenet/blob/main/runtime/gpu/README.md
使用的是kenlm工具统计LM

Prepare dict
● units.txt ：T is the model unit in E2E training
● corpus：根据lexicon分词

Train lm ：使用kenlm工具生成.apra模型

Decoding with runtime（set--lm_path in the convert_start_server.sh.）

Add language model: set--lm_path in the convert_start_server.sh. Notice the path of your language model is the path in docker. There's a space between two characters.

部署方式是GPU inference with NV's Triton and TensorRT，这种方式很方便。

CTC WFST search with LM

参考 wenet/examples/aishell/s0/run.sh。
使用的是srilm工具统计LM

Prepare dict
● units.txt ：T is the model unit in E2E training
● lexicon.txt ：L is the lexicon
● corpus：根据lexicon分词

Build decoding TLG.fst（tools/fst/compile_lexicon_token_fst.sh）

Decoding with runtime（tools/decode.sh）

local/aishell_train_lms.sh # 7.3 Build decoding TLG tools/fst/compile_lexicon_token_fst.sh \ data/local/dict data/local/tmp data/local/lang tools/fst/make_tlg.sh data/local/lm data/local/lang data/lang_test || exit 1; # 7.4 Decoding with runtime chunk_size=-1 ./tools/decode.sh --nj 1 \ --beam 15.0 --lattice_beam 7.5 --max_active 7000 \ --blank_skip_thresh 0.98 --ctc_weight 0.5 --rescoring_weight 1.0 \ --chunk_size $chunk_size \ --fst_path data/lang_test/TLG.fst \ --dict_path data/lang_test/words.txt \ data/test/wav.scp data/test/text $dir/final.zip \ data/lang_test/units.txt data/lm_with_runtime # Please see $dir/lm_with_runtime for wer

这种需要有个中间 lexicon 。尝试使用分词工具jieba对语料分词/使用jieba现有的词典，decode效果来看比基线有提升，但是没有第一种方式提升大。

息壤智算

应用商城

定价

合作伙伴

开发者

支持与服务

了解天翼云

wenet增加语言模型

试验了两种接入LM的方式

CTC beam search with LM

CTC WFST search with LM

wenet增加语言模型

试验了两种接入LM的方式

CTC beam search with LM

CTC WFST search with LM

活动

息壤智算

应用商城

定价

合作伙伴

开发者

支持与服务

了解天翼云

wenet增加语言模型

试验了两种接入LM的方式

CTC beam search with LM

CTC WFST search with LM

wenet增加语言模型

试验了两种接入LM的方式

CTC beam search with LM

CTC WFST search with LM