TensorRT+Triton实战-天翼云开发者社区

import numpy as np
import tritonclient.htt p as htt pclient
from utils import preprocess, postprocess

triton_client = htt pclient.InferenceServerClient(url='localhost:8000')
mapping = {0: 'negtive', 1: 'positive'}

if __name__ == "__main__":

    text_list = ['我今天特别开心', '我今天特别生气']
    token_ids, segment_ids = preprocess(text_list)

    inputs = []
    inputs.append(htt pclient.InferInput('input_ids', [token_ids.shape[0], 512], "INT32"))
    inputs.append(htt pclient.InferInput('segment_ids', [segment_ids.shape[0], 512], "INT32"))

    inputs[0].set_data_from_numpy(token_ids.astype(np.int32), binary_data=False)
    inputs[1].set_data_from_numpy(segment_ids.astype(np.int32), binary_data=False)
    outputs = []
    outputs.append(htt pclient.InferRequestedOutput('output', binary_data=False))

    results = triton_client.infer('sentence_classification', inputs=inputs, outputs=outputs)

    prob = results.as_numpy("output")
    pred = prob.argmax(axis=-1)
    result = [mapping[i] for i in pred]
    print(result)

介绍

TensorRT可促进NVIDIA GPU上的高性能推理。

TensorRT采用一个经过训练的网络，该网络由一个网络定义和一组经过训练的参数组成，并生成一个高度优化的运行时引擎，为该网络执行推理。

TensorRT通过C++和Python提供API，通过使用Network Definition API帮助表达深度学习模型，或通过使用解析器加载预定义模型，从而允许TensorRT在NVIDIA GPU上优化和运行这些模型。

Triton 是服务器端推理引擎，用于模型部署和推理。Triton 可以与NVIDIA深度学习框架 TensorRT 集成，使开发人员能够轻松地将训练的模型部署到生产环境中。Triton 还提供了一个简单的 REST API，可用于在客户端应用程序中轻松调用机器学习模型，从而实现推理功能。

拉取TensorRT镜像

docker pull nvcr.io/nvidia/tensorrt:23.05-py3

版本对应关系

模型文件和配置文件准备

# 文件获取：pan.baidu.com/s/1fZbzd8zRA2tciK47tjdgAQ?pwd=nizv#list/path=%2F
文件目录

.
└── triton-model
    ├── 1
    │   └── model.onnx
    └── config.pbtxt

or onnx转tensorrt权重（见下一节）

启动TensorRT镜像并使用trtexec工具转换模型格式

# 启动tensorrt镜像
docker run --gpus all -it --rm -v ./model_repository:/models nvcr.io/nvidia/tensorrt:23.05-py3
# onnx转tensorrt权重
# 使用trtexec来把onnx转成trt格式
trtexec --onnx=bert_cls.onnx --saveEngine=./model.plan --minShapes=input_ids:1x512,segment_ids:1x512 --optShapes=input_ids:1x512,segment_ids:1x512 --maxShapes=input_ids:20x512,segment_ids:20x512 --device=0
# 情感二分类bert_cls.onnx模型

# 参数解释
# trtexec是TensorRT库中的一个命令行工具，用于在推理阶段执行和优化深度学习模型。这里对命令中的各个参数进行解释：

# --onnx=bert_cls.onnx：指定输入的模型文件为bert_cls.onnx，这是一个使用ONNX格式保存的BERT分类模型。

# --saveEngine=./model.plan：指定将优化后的推理引擎保存为model.plan文件。

# --minShapes=input_ids:1x512,segment_ids:1x512：指定最小的输入形状，其中input_ids和segment_ids是模型的输入名称，1x512表示输入的批量大小为1，每个样本的序列长度为512。

# --optShapes=input_ids:1x512,segment_ids:1x512：指定优化过程中探索的输入形状。与--minShapes相同，这里指定输入批量大小为1，序列长度为512。

# --maxShapes=input_ids:20x512,segment_ids:20x512：指定最大的输入形状，用于动态批处理。这里指定输入批量大小的上限为20，序列长度为512。

# --device=0：指定运行推理引擎的设备编号。这里设备编号为0，表示在第一个可用的GPU设备上执行推理。

# 总之，通过该命令，你将使用TensorRT的trtexec工具加载BERT分类模型，并生成一个经过优化的推理引擎，该引擎将保存为model.plan文件，并可以在GPU设备上执行推理。输入形状可以根据实际情况进行调整，并且支持动态批处理。

shapes和配置文件对齐：

name: "sentence_classification"
platform: "tensorrt_plan"
max_batch_size: 8
version_policy: { latest { num_versions: 1 }}
input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "segment_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
  }
]
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ -1 ]
  }
]
# 解释

# 这是一个Triton Inference Server的模型配置文件示例，用于部署一个基于TensorRT推理引擎的BERT文本分类模型。具体解释如下：

# name: "sentence_classification"：模型名称为sentence_classification。

# platform: "tensorrt_plan"：指定了模型使用TensorRT推理引擎生成的plan文件格式。

# max_batch_size: 8：指定模型支持的最大批处理大小为8。

# version_policy: { latest { num_versions: 1 }}：指定模型版本管理策略，这里表示只保留一个最新版本。

# input：指定模型的输入信息，包括名称、数据类型和形状。这个例子中输入包含两个张量，名称分别为“input_ids”和“segment_ids”，数据类型均为INT32，形状为不定长序列（-1表示可以是任意长度，由实际输入确定）。

# output：指定模型的输出信息，包括名称、数据类型和形状。这个例子中输出只有一个张量，名称为“output”，数据类型为FP32，形状为不定长序列。

# 需要保证Triton配置文件中的inputs和outputs与模型中定义的输入和输出信息保持一致，这样才能保证Triton Inference Server正确地将输入发送到模型、将模型的输出返回给客户端。

启动triton服务

docker run -it --gpus=1 --rm --net=host -v ./model_repository:/models tritonserver:23.05-py3 tritonserver --model-repository=/models

triton client请求

从huggingface.co的bert-base-chinese目录获取词表文件
trition client请求推理

import numpy as np
import tritonclient.htt p as htt pclient
from utils import preprocess, postprocess

triton_client = htt pclient.InferenceServerClient(url='localhost:8000')
mapping = {0: 'negtive', 1: 'positive'}

if __name__ == "__main__":

    text_list = ['我今天特别开心', '我今天特别生气']
    token_ids, segment_ids = preprocess(text_list)

    inputs = []
    inputs.append(htt pclient.InferInput('input_ids', [token_ids.shape[0], 512], "INT32"))
    inputs.append(htt pclient.InferInput('segment_ids', [segment_ids.shape[0], 512], "INT32"))

    inputs[0].set_data_from_numpy(token_ids.astype(np.int32), binary_data=False)
    inputs[1].set_data_from_numpy(segment_ids.astype(np.int32), binary_data=False)
    outputs = []
    outputs.append(htt pclient.InferRequestedOutput('output', binary_data=False))

    results = triton_client.infer('sentence_classification', inputs=inputs, outputs=outputs)

    prob = results.as_numpy("output")
    pred = prob.argmax(axis=-1)
    result = [mapping[i] for i in pred]
    print(result)

from bert4torch.tokenizers import Tokenizer
from bert4torch.snippets import sequence_padding
import numpy as np

dict_path = './vocab.txt'
# 建立分词器
tokenizer = Tokenizer(dict_path, do_lower_case=True)

def preprocess(text_list):
    batch_token_ids, batch_segment_ids = [], []
    for text in text_list:
        token_ids, segment_ids = tokenizer.encode(text, maxlen=512)
        batch_token_ids.append(token_ids)
        batch_segment_ids.append(segment_ids)
    batch_token_ids = sequence_padding(batch_token_ids, length=512)
    batch_segment_ids = sequence_padding(batch_segment_ids, length=512)
    return batch_token_ids, batch_segment_ids

def postprocess(res):
    '''后处理
    '''
    mapping = {0: 'negtive', 1: 'positive'}
    result = []
    for item in res['outputs']:
        prob = np.array(item['data']).reshape(item['shape'])
        pred = prob.argmax(axis=-1)
        result.append([mapping[i] for i in pred])
    return result

python3 client_tritonclient.py
# ['positive', 'negtive']

import numpy as np
import tritonclient.htt p as htt pclient
from utils import preprocess, postprocess

triton_client = htt pclient.InferenceServerClient(url='localhost:8000')
mapping = {0: 'negtive', 1: 'positive'}

if __name__ == "__main__":

    text_list = ['我今天特别开心', '我今天特别生气']
    token_ids, segment_ids = preprocess(text_list)

    inputs = []
    inputs.append(htt pclient.InferInput('input_ids', [token_ids.shape[0], 512], "INT32"))
    inputs.append(htt pclient.InferInput('segment_ids', [segment_ids.shape[0], 512], "INT32"))

    inputs[0].set_data_from_numpy(token_ids.astype(np.int32), binary_data=False)
    inputs[1].set_data_from_numpy(segment_ids.astype(np.int32), binary_data=False)
    outputs = []
    outputs.append(htt pclient.InferRequestedOutput('output', binary_data=False))

    results = triton_client.infer('sentence_classification', inputs=inputs, outputs=outputs)

    prob = results.as_numpy("output")
    pred = prob.argmax(axis=-1)
    result = [mapping[i] for i in pred]
    print(result)

介绍

TensorRT可促进NVIDIA GPU上的高性能推理。

TensorRT采用一个经过训练的网络，该网络由一个网络定义和一组经过训练的参数组成，并生成一个高度优化的运行时引擎，为该网络执行推理。

拉取TensorRT镜像

docker pull nvcr.io/nvidia/tensorrt:23.05-py3

版本对应关系

模型文件和配置文件准备

# 文件获取：pan.baidu.com/s/1fZbzd8zRA2tciK47tjdgAQ?pwd=nizv#list/path=%2F
文件目录

.
└── triton-model
    ├── 1
    │   └── model.onnx
    └── config.pbtxt

or onnx转tensorrt权重（见下一节）

启动TensorRT镜像并使用trtexec工具转换模型格式

# 启动tensorrt镜像
docker run --gpus all -it --rm -v ./model_repository:/models nvcr.io/nvidia/tensorrt:23.05-py3
# onnx转tensorrt权重
# 使用trtexec来把onnx转成trt格式
trtexec --onnx=bert_cls.onnx --saveEngine=./model.plan --minShapes=input_ids:1x512,segment_ids:1x512 --optShapes=input_ids:1x512,segment_ids:1x512 --maxShapes=input_ids:20x512,segment_ids:20x512 --device=0
# 情感二分类bert_cls.onnx模型

# 参数解释
# trtexec是TensorRT库中的一个命令行工具，用于在推理阶段执行和优化深度学习模型。这里对命令中的各个参数进行解释：

# --onnx=bert_cls.onnx：指定输入的模型文件为bert_cls.onnx，这是一个使用ONNX格式保存的BERT分类模型。

# --saveEngine=./model.plan：指定将优化后的推理引擎保存为model.plan文件。

# --minShapes=input_ids:1x512,segment_ids:1x512：指定最小的输入形状，其中input_ids和segment_ids是模型的输入名称，1x512表示输入的批量大小为1，每个样本的序列长度为512。

# --optShapes=input_ids:1x512,segment_ids:1x512：指定优化过程中探索的输入形状。与--minShapes相同，这里指定输入批量大小为1，序列长度为512。

# --maxShapes=input_ids:20x512,segment_ids:20x512：指定最大的输入形状，用于动态批处理。这里指定输入批量大小的上限为20，序列长度为512。

# --device=0：指定运行推理引擎的设备编号。这里设备编号为0，表示在第一个可用的GPU设备上执行推理。

# 总之，通过该命令，你将使用TensorRT的trtexec工具加载BERT分类模型，并生成一个经过优化的推理引擎，该引擎将保存为model.plan文件，并可以在GPU设备上执行推理。输入形状可以根据实际情况进行调整，并且支持动态批处理。

shapes和配置文件对齐：

name: "sentence_classification"
platform: "tensorrt_plan"
max_batch_size: 8
version_policy: { latest { num_versions: 1 }}
input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "segment_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
  }
]
output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ -1 ]
  }
]
# 解释

# 这是一个Triton Inference Server的模型配置文件示例，用于部署一个基于TensorRT推理引擎的BERT文本分类模型。具体解释如下：

# name: "sentence_classification"：模型名称为sentence_classification。

# platform: "tensorrt_plan"：指定了模型使用TensorRT推理引擎生成的plan文件格式。

# max_batch_size: 8：指定模型支持的最大批处理大小为8。

# version_policy: { latest { num_versions: 1 }}：指定模型版本管理策略，这里表示只保留一个最新版本。

# input：指定模型的输入信息，包括名称、数据类型和形状。这个例子中输入包含两个张量，名称分别为“input_ids”和“segment_ids”，数据类型均为INT32，形状为不定长序列（-1表示可以是任意长度，由实际输入确定）。

# output：指定模型的输出信息，包括名称、数据类型和形状。这个例子中输出只有一个张量，名称为“output”，数据类型为FP32，形状为不定长序列。

# 需要保证Triton配置文件中的inputs和outputs与模型中定义的输入和输出信息保持一致，这样才能保证Triton Inference Server正确地将输入发送到模型、将模型的输出返回给客户端。

启动triton服务

docker run -it --gpus=1 --rm --net=host -v ./model_repository:/models tritonserver:23.05-py3 tritonserver --model-repository=/models

triton client请求

从huggingface.co的bert-base-chinese目录获取词表文件
trition client请求推理

import numpy as np
import tritonclient.htt p as htt pclient
from utils import preprocess, postprocess

triton_client = htt pclient.InferenceServerClient(url='localhost:8000')
mapping = {0: 'negtive', 1: 'positive'}

if __name__ == "__main__":

    text_list = ['我今天特别开心', '我今天特别生气']
    token_ids, segment_ids = preprocess(text_list)

    inputs = []
    inputs.append(htt pclient.InferInput('input_ids', [token_ids.shape[0], 512], "INT32"))
    inputs.append(htt pclient.InferInput('segment_ids', [segment_ids.shape[0], 512], "INT32"))

    inputs[0].set_data_from_numpy(token_ids.astype(np.int32), binary_data=False)
    inputs[1].set_data_from_numpy(segment_ids.astype(np.int32), binary_data=False)
    outputs = []
    outputs.append(htt pclient.InferRequestedOutput('output', binary_data=False))

    results = triton_client.infer('sentence_classification', inputs=inputs, outputs=outputs)

    prob = results.as_numpy("output")
    pred = prob.argmax(axis=-1)
    result = [mapping[i] for i in pred]
    print(result)

from bert4torch.tokenizers import Tokenizer
from bert4torch.snippets import sequence_padding
import numpy as np

dict_path = './vocab.txt'
# 建立分词器
tokenizer = Tokenizer(dict_path, do_lower_case=True)

def preprocess(text_list):
    batch_token_ids, batch_segment_ids = [], []
    for text in text_list:
        token_ids, segment_ids = tokenizer.encode(text, maxlen=512)
        batch_token_ids.append(token_ids)
        batch_segment_ids.append(segment_ids)
    batch_token_ids = sequence_padding(batch_token_ids, length=512)
    batch_segment_ids = sequence_padding(batch_segment_ids, length=512)
    return batch_token_ids, batch_segment_ids

def postprocess(res):
    '''后处理
    '''
    mapping = {0: 'negtive', 1: 'positive'}
    result = []
    for item in res['outputs']:
        prob = np.array(item['data']).reshape(item['shape'])
        pred = prob.argmax(axis=-1)
        result.append([mapping[i] for i in pred])
    return result

python3 client_tritonclient.py
# ['positive', 'negtive']

智算服务

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

TensorRT+Triton实战

介绍

拉取TensorRT镜像

模型文件和配置文件准备

启动TensorRT镜像并使用trtexec工具转换模型格式

启动triton服务

triton client请求

TensorRT+Triton实战

介绍

拉取TensorRT镜像

模型文件和配置文件准备

启动TensorRT镜像并使用trtexec工具转换模型格式

启动triton服务

triton client请求

活动

智算服务

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

TensorRT+Triton实战

介绍

拉取TensorRT镜像

模型文件和配置文件准备

启动TensorRT镜像并使用trtexec工具转换模型格式

启动triton服务

triton client请求

TensorRT+Triton实战

介绍

拉取TensorRT镜像

模型文件和配置文件准备

启动TensorRT镜像并使用trtexec工具转换模型格式

启动triton服务

triton client请求