searchusermenu
  • 发布文章
  • 消息中心
点赞
收藏
评论
分享
原创

从0开始进行llama模型转换(hf->onnx,onnx->trt)和triton部署

2023-10-12 08:04:12
703
0

环境安装(NVIDIA\Docker)

配置仓库

curl -fsSL ****s://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L ****s://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb ****s://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] ****s://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
  && \
    sudo apt-get update

以上命令用于在 Ubuntu 系统上安装 NVIDIA Container Toolkit 的命令。

  1. curl -fsSL ****s://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

    这个命令使用 curl 下载 NVIDIA Container Toolkit 的 GPG 密钥,并通过 sudo gpg --dearmor 对密钥进行解密、转换。然后将解密后的密钥保存到 /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg 文件中,以供后续操作使用。

  2. curl -s -L ****s://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb ****s://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] ****s://#g'

    这一部分使用 curl 下载 NVIDIA Container Toolkit 的软件包源列表,并通过 sed 命令对下载的源列表进行处理。具体来说,它将源列表中的 deb ****s:// 替换为 deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] ****s://,以确保软件包在安装时经过验证。

  3. sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

    这个命令将修改后的软件包源列表写入 /etc/apt/sources.list.d/nvidia-container-toolkit.list 文件中。

  4. sudo apt-get update

    最后,这个命令用于更新软件包列表,以使系统能够识别并安装 NVIDIA Container Toolkit。

综上所述,该命令的目的是下载 NVIDIA Container Toolkit 的 GPG 密钥,配置适当的软件包源列表,并使用 apt-get update 命令更新软件包列表,以便安装 NVIDIA Container Toolkit。

安装NVIDIA Container Toolkit 包

sudo apt-get install -y nvidia-container-toolkit

安装NVIDIA驱动程序

  • 在 Linux 环境下,你可以使用以下方法来查看显卡类型:
  1. 使用 lspci 命令: 打开终端,并执行以下命令:

    lspci | grep -i vga

    这将显示与图形相关的设备信息,包括显卡型号。

  2. 使用 lshw 命令: 打开终端,并执行以下命令:

    sudo lshw -C display

    这将显示详细的显卡信息,包括显卡型号、供应商等。

  3. 使用系统监控工具: 大多数 Linux 发行版都提供了系统监控工具,如 GNOME System Monitor、KSysGuard 等。这些工具通常提供图形化界面,你可以通过它们查看显卡型号和其他硬件信息。

一旦你确定了显卡型号,你可以访问 NVIDIA 官方网站*****://***.nvidia.com/Download/index.aspx并选择正确的驱动程序版本来下载。确保选择与你的显卡型号和 Linux 发行版相匹配的驱动程序版本。在安装驱动程序之前,请确保你已经按照适用于你的 Linux 发行版的指南正确安装了所需的依赖项和预安装要求。

  • 执行安装
 
# 对安装包添加执行权限
chmod +x NVIDIA-Linux-x86_64-535.104.12.run
# 安装gcc和linux-kernel-headers
sudo apt-get install gcc linux-kernel-headers
# 运行驱动安装程序
sudo sh NVIDIA-Linux-x86_64-535.104.12.run --disable-nouveau
# 查看驱动是否安装成功
nvidia-smi
  • 安装NVIDIA Container Toolkit 组件
wget ****://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda_11.7.0_515.43.04_linux.run
# 安装CUDA
bash cuda_11.7.0_515.43.04_linux.run
# 编辑环境变量文件 
vi ~/.bashrc
# 增加环境变量
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
# 使环境变量生效
source ~/.bashrc
# 查看是否安装成功
nvcc -V

安装Miniconda

wget *****://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# 安装Miniconda3
bash Miniconda3-latest-Linux-x86_64.sh
# 配置conda环境变量
vim /etc/profile
# 添加环境变量
export ANACONDA_PATH=~/miniconda3
export PATH=$PATH:$ANACONDA_PATH/bin
# 使环境变量生效
source /etc/profile
# 查看是否安装成功
which anaconda
conda --version
conda info -e
python
# 查看虚拟环境
conda env list
 

安装cudnn

从*****://developer.nvidia.com/rdp/cudnn-download下载cudnn压缩包并上传至GPU云主机,按照如下步骤进行安装。

# 解压
tar -xf cudnn-linux-x86_64-8.9.4.25_cuda11-archive.tar.xz
# 进目录
cd cudnn cudnn-linux-x86_64-8.9.4.25_cuda11-archive
# 复制
cp ./include/*  /usr/local/cuda-11.7/include/
cp ./lib/libcudnn*  /usr/local/cuda-11.7/lib64/ 
# 授权
chmod a+r /usr/local/cuda-11.7/include/* /usr/local/cuda-11.7/lib64/libcudnn*
# 查看是否安装成功
cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
​

 

安装docker

在 Linux Nvidia 环境下安装 Docker,需要先安装 Nvidia Container Toolkit,然后再安装 Docker。

以下是具体步骤:

1.安装依赖包

sudo apt-get update
sudo apt-get install -y curl gnupg2 software-properties-common

2.添加 Docker 官方 GPG 密钥

curl *****://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

3.添加 Docker 软件源

sudo add-apt-repository "deb [arch=amd64] *****://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

4.安装 Docker

sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io

配置Docker

configure the docker container runtime , it will modify /etc/docker/daemon.json

sudo nvidia-ctk runtime configure --runtime=docker

Restart docker service to enable the config

sudo systemctl restart docker

 

解释:

  1. 配置 Docker 容器运行时:一旦安装了 NVIDIA Container Toolkit,你需要配置 Docker 容器运行时以使用 NVIDIA Container Toolkit。你可以使用以下命令来配置 Docker 容器运行时:

    sudo nvidia-container-runtime configure --runtime=docker
  2. 重启 Docker:最后,为了使更改生效,你需要重启 Docker 服务。你可以使用以下命令来重新启动 Docker 服务:

    sudo systemctl restart docker

现在,你已经成功地在 NVIDIA GPU 环境中安装了 Docker,并且将其配置为使用 NVIDIA Container Toolkit 运行容器。你可以通过运行 docker run 命令来启动一个 Docker 容器,并确保它能够正常工作。

验证Docker是否可以使用

it just use nvidia-smi to list GPU status

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

安装模型转换的依赖

  • 编写requirements.txt
onnx==1.14.1
numpy
onnxruntime==1.15.1
torch==2.0.1
transformers==4.31.0
transformers_stream_generator==0.0.4
sentencepiece==0.1.99
tritonclient
gevent****client
gevent


  • 安装依赖
pip install -r requirements.txt -i ****s://pypi.mirrors.ustc.edu.cn/simple/

创建llm_models/Llama-2-7b-chat-ms/modeling_llama.py

# coding=utf-8
# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
#
# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
# and OPT implementations in this library. It has been modified from its
# original forms to accommodate minor architectural differences compared
# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     ****://***.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" PyTorch LLaMA model."""
import math
from typing import List, Optional, Tuple, Union

import torch
import torch.nn.functional as F
import torch.utils.checkpoint
from torch import nn
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss

from transformers.activations import ACT2FN
from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast
from transformers.modeling_utils import PreTrainedModel
from transformers.utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
from .configuration_llama import LlamaConfig

logger = logging.get_logger(__name__)

_CONFIG_FOR_DOC = "LlamaConfig"


# Copied from transformers.models.bart.modeling_bart._make_causal_mask
def _make_causal_mask(
    input_ids_shape: torch.Size, dtype: torch.dtype, device: torch.device, past_key_values_length: int = 0
):
    """
    Make causal mask used for bi-directional self-attention.
    """
    bsz, tgt_len = input_ids_shape
    mask = torch.full((tgt_len, tgt_len), torch.finfo(dtype).min, device=device)
    mask_cond = torch.arange(mask.size(-1), device=device)
    mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
    mask = mask.to(dtype)

    if past_key_values_length > 0:
        mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype, device=device), mask], dim=-1)
    return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length)


# Copied from transformers.models.bart.modeling_bart._expand_mask
def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
    """
    Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
    """
    bsz, src_len = mask.size()
    tgt_len = tgt_len if tgt_len is not None else src_len

    expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)

    inverted_mask = 1.0 - expanded_mask

    return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min)


class LlamaRMSNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        """
        LlamaRMSNorm is equivalent to T5LayerNorm
        """
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(torch.float32)
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        return self.weight * hidden_states.to(input_dtype)


class LlamaRotaryEmbedding(torch.nn.Module):
    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
        super().__init__()

        self.dim = dim
        self.max_position_embeddings = max_position_embeddings
        self.base = base
        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
        self.register_buffer("inv_freq", inv_freq)

        # Build here to make `torch.jit.trace` work.
        self._set_cos_sin_cache(
            seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
        )

    def _set_cos_sin_cache(self, seq_len, device, dtype):
        self.max_seq_len_cached = seq_len
        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)

        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
        # Different from paper, but it uses a different permutation in order to obtain the same calculation
        emb = torch.cat((freqs, freqs), dim=-1)
        self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
        self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)

    def forward(self, x, seq_len=None):
        # x: [bs, num_attention_heads, seq_len, head_size]
        if seq_len > self.max_seq_len_cached:
            self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)

        return (
            self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
            self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
        )


class LlamaLinearScalingRotaryEmbedding(LlamaRotaryEmbedding):
    """LlamaRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""

    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
        self.scaling_factor = scaling_factor
        super().__init__(dim, max_position_embeddings, base, device)

    def _set_cos_sin_cache(self, seq_len, device, dtype):
        self.max_seq_len_cached = seq_len
        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
        t = t / self.scaling_factor

        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
        # Different from paper, but it uses a different permutation in order to obtain the same calculation
        emb = torch.cat((freqs, freqs), dim=-1)
        self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
        self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)


class LlamaDynamicNTKScalingRotaryEmbedding(LlamaRotaryEmbedding):
    """LlamaRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""

    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
        self.scaling_factor = scaling_factor
        super().__init__(dim, max_position_embeddings, base, device)

    def _set_cos_sin_cache(self, seq_len, device, dtype):
        self.max_seq_len_cached = seq_len

        if seq_len > self.max_position_embeddings:
            base = self.base * (
                (self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)
            ) ** (self.dim / (self.dim - 2))
            inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
            self.register_buffer("inv_freq", inv_freq)

        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)

        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
        # Different from paper, but it uses a different permutation in order to obtain the same calculation
        emb = torch.cat((freqs, freqs), dim=-1)
        self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
        self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)


def rotate_half(x):
    """Rotates half the hidden dims of the input."""
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)


def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
    # The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
    # cos = cos.squeeze(1).squeeze(0)  # [seq_len, dim]
    # sin = sin.squeeze(1).squeeze(0)  # [seq_len, dim]
    #cos = torch.squeeze(cos)  # [seq_len, dim]
    #sin = torch.squeeze(sin)  # [seq_len, dim]
    #cos = cos[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
    #sin = sin[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
    cos = cos.view(cos.shape[2], cos.shape[3])   # [seq_len, dim]
    cos = cos[position_ids].view(position_ids.shape[0], 1, position_ids.shape[1], cos.shape[1])  # [bs, 1, seq_len, dim]
    sin = sin.view(sin.shape[2], sin.shape[3])  # [seq_len, dim]
    sin = sin[position_ids].view(position_ids.shape[0], 1, position_ids.shape[1], sin.shape[1])  # [bs, 1, seq_len, dim]

    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed


class LlamaMLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.pretraining_tp = config.pretraining_tp
        self.hidden_size = config.hidden_size
        self.intermediate_size = config.intermediate_size
        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
        self.act_fn = ACT2FN[config.hidden_act]

    def forward(self, x):
        if self.pretraining_tp > 1:
            slice = self.intermediate_size // self.pretraining_tp
            gate_proj_slices = self.gate_proj.weight.split(slice, dim=0)
            up_proj_slices = self.up_proj.weight.split(slice, dim=0)
            down_proj_slices = self.down_proj.weight.split(slice, dim=1)

            gate_proj = torch.cat([F.linear(x, gate_proj_slices[i]) for i in range(self.pretraining_tp)], dim=-1)
            up_proj = torch.cat([F.linear(x, up_proj_slices[i]) for i in range(self.pretraining_tp)], dim=-1)

            intermediate_states = (self.act_fn(gate_proj) * up_proj).split(slice, dim=2)
            down_proj = [F.linear(intermediate_states[i], down_proj_slices[i]) for i in range(self.pretraining_tp)]
            down_proj = sum(down_proj)
        else:
            down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))

        return down_proj


def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
    """
    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
    """
    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
    if n_rep == 1:
        return hidden_states
    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)


class LlamaAttention(nn.Module):
    """Multi-headed attention from 'Attention Is All You Need' paper"""

    def __init__(self, config: LlamaConfig):
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size
        self.num_heads = config.num_attention_heads
        self.head_dim = self.hidden_size // self.num_heads
        self.num_key_value_heads = config.num_key_value_heads
        self.num_key_value_groups = self.num_heads // self.num_key_value_heads
        self.pretraining_tp = config.pretraining_tp
        self.max_position_embeddings = config.max_position_embeddings

        if (self.head_dim * self.num_heads) != self.hidden_size:
            raise ValueError(
                f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
                f" and `num_heads`: {self.num_heads})."
            )
        self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
        self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
        self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
        self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
        self._init_rope()

    def _init_rope(self):
        if self.config.rope_scaling is None:
            self.rotary_emb = LlamaRotaryEmbedding(self.head_dim, max_position_embeddings=self.max_position_embeddings)
        else:
            scaling_type = self.config.rope_scaling["type"]
            scaling_factor = self.config.rope_scaling["factor"]
            if scaling_type == "linear":
                self.rotary_emb = LlamaLinearScalingRotaryEmbedding(
                    self.head_dim, max_position_embeddings=self.max_position_embeddings, scaling_factor=scaling_factor
                )
            elif scaling_type == "dynamic":
                self.rotary_emb = LlamaDynamicNTKScalingRotaryEmbedding(
                    self.head_dim, max_position_embeddings=self.max_position_embeddings, scaling_factor=scaling_factor
                )
            else:
                raise ValueError(f"Unknown RoPE scaling type {scaling_type}")

    def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
        return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_value: Optional[Tuple[torch.Tensor]] = None,
        output_attentions: bool = False,
        use_cache: bool = False,
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
        bsz, q_len, _ = hidden_states.size()

        if self.pretraining_tp > 1:
            key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.pretraining_tp
            query_slices = self.q_proj.weight.split((self.num_heads * self.head_dim) // self.pretraining_tp, dim=0)
            key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
            value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)

            query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.pretraining_tp)]
            query_states = torch.cat(query_states, dim=-1)

            key_states = [F.linear(hidden_states, key_slices[i]) for i in range(self.pretraining_tp)]
            key_states = torch.cat(key_states, dim=-1)

            value_states = [F.linear(hidden_states, value_slices[i]) for i in range(self.pretraining_tp)]
            value_states = torch.cat(value_states, dim=-1)

        else:
            query_states = self.q_proj(hidden_states)
            key_states = self.k_proj(hidden_states)
            value_states = self.v_proj(hidden_states)

        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
        key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
        value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)

        kv_seq_len = key_states.shape[-2]
        if past_key_value is not None:
            kv_seq_len += past_key_value[0].shape[-2]
        cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)

        if past_key_value is not None:
            # reuse k, v, self_attention
            key_states = torch.cat([past_key_value[0], key_states], dim=2)
            value_states = torch.cat([past_key_value[1], value_states], dim=2)

        past_key_value = (key_states, value_states) if use_cache else None

        # repeat k/v heads if n_kv_heads < n_heads
        key_states = repeat_kv(key_states, self.num_key_value_groups)
        value_states = repeat_kv(value_states, self.num_key_value_groups)

        attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)

        if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
            raise ValueError(
                f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
                f" {attn_weights.size()}"
            )

        if attention_mask is not None:
            if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
                raise ValueError(
                    f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
                )
            attn_weights = attn_weights + attention_mask

        # upcast attention to fp32
        attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
        attn_output = torch.matmul(attn_weights, value_states)

        if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
            raise ValueError(
                f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
                f" {attn_output.size()}"
            )

        attn_output = attn_output.transpose(1, 2).contiguous()
        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)

        if self.pretraining_tp > 1:
            attn_output = attn_output.split(self.hidden_size // self.pretraining_tp, dim=2)
            o_proj_slices = self.o_proj.weight.split(self.hidden_size // self.pretraining_tp, dim=1)
            attn_output = sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.pretraining_tp)])
        else:
            attn_output = self.o_proj(attn_output)

        if not output_attentions:
            attn_weights = None

        return attn_output, attn_weights, past_key_value


class LlamaDecoderLayer(nn.Module):
    def __init__(self, config: LlamaConfig):
        super().__init__()
        self.hidden_size = config.hidden_size
        self.self_attn = LlamaAttention(config=config)
        self.mlp = LlamaMLP(config)
        self.input_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
        self.post_attention_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_value: Optional[Tuple[torch.Tensor]] = None,
        output_attentions: Optional[bool] = False,
        use_cache: Optional[bool] = False,
    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
        """
        Args:
            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
            attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
            output_attentions (`bool`, *optional*):
                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                returned tensors for more detail.
            use_cache (`bool`, *optional*):
                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
                (see `past_key_values`).
            past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
        """

        residual = hidden_states

        hidden_states = self.input_layernorm(hidden_states)

        # Self Attention
        hidden_states, self_attn_weights, present_key_value = self.self_attn(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_value=past_key_value,
            output_attentions=output_attentions,
            use_cache=use_cache,
        )
        hidden_states = residual + hidden_states

        # Fully Connected
        residual = hidden_states
        hidden_states = self.post_attention_layernorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        hidden_states = residual + hidden_states

        outputs = (hidden_states,)

        if output_attentions:
            outputs += (self_attn_weights,)

        if use_cache:
            outputs += (present_key_value,)

        return outputs


LLAMA_START_DOCSTRING = r"""
    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
    etc.)

    This model is also a PyTorch [torch.nn.Module](****s://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
    and behavior.

    Parameters:
        config ([`LlamaConfig`]):
            Model configuration class with all the parameters of the model. Initializing with a config file does not
            load the weights associated with the model, only the configuration. Check out the
            [`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""


@add_start_docstrings(
    "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
    LLAMA_START_DOCSTRING,
)
class LlamaPreTrainedModel(PreTrainedModel):
    config_class = LlamaConfig
    base_model_prefix = "model"
    supports_gradient_checkpointing = True
    _no_split_modules = ["LlamaDecoderLayer"]
    _skip_keys_device_placement = "past_key_values"

    def _init_weights(self, module):
        std = self.config.initializer_range
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()

    def _set_gradient_checkpointing(self, module, value=False):
        if isinstance(module, LlamaModel):
            module.gradient_checkpointing = value


LLAMA_INPUTS_DOCSTRING = r"""
    Args:
        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
            it.

            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
            [`PreTrainedTokenizer.__call__`] for details.

            [What are input IDs?](../glossary#input-ids)
        attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

            - 1 for tokens that are **not masked**,
            - 0 for tokens that are **masked**.

            [What are attention masks?](../glossary#attention-mask)

            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
            [`PreTrainedTokenizer.__call__`] for details.

            If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
            `past_key_values`).

            If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
            and modify to your needs. See diagram 1 in [the paper](****s://arxiv.org/abs/1910.13461) for more
            information on the default strategy.

            - 1 indicates the head is **not masked**,
            - 0 indicates the head is **masked**.
        position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
            config.n_positions - 1]`.

            [What are position IDs?](../glossary#position-ids)
        past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
            Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
            `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
            `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.

            Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
            blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.

            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
        inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
            model's internal embedding lookup matrix.
        use_cache (`bool`, *optional*):
            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
            `past_key_values`).
        output_attentions (`bool`, *optional*):
            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
            tensors for more detail.
        output_hidden_states (`bool`, *optional*):
            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
            more detail.
        return_dict (`bool`, *optional*):
            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
"""


@add_start_docstrings(
    "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
    LLAMA_START_DOCSTRING,
)
class LlamaModel(LlamaPreTrainedModel):
    """
    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`LlamaDecoderLayer`]

    Args:
        config: LlamaConfig
    """

    def __init__(self, config: LlamaConfig):
        super().__init__(config)
        self.padding_idx = config.pad_token_id
        self.vocab_size = config.vocab_size

        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
        self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
        self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)

        self.gradient_checkpointing = False
        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        return self.embed_tokens

    def set_input_embeddings(self, value):
        self.embed_tokens = value

    # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask
    def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
        # create causal mask
        # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
        combined_attention_mask = None
        if input_shape[-1] > 1:
            combined_attention_mask = _make_causal_mask(
                input_shape,
                inputs_embeds.dtype,
                device=inputs_embeds.device,
                past_key_values_length=past_key_values_length,
            )

        if attention_mask is not None:
            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
            expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(
                inputs_embeds.device
            )
            combined_attention_mask = (
                expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
            )

        return combined_attention_mask

    @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
    def forward(
        self,
        input_ids: torch.LongTensor = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutputWithPast]:
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        use_cache = use_cache if use_cache is not None else self.config.use_cache

        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # retrieve input_ids and inputs_embeds
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
        elif input_ids is not None:
            batch_size, seq_length = input_ids.shape
        elif inputs_embeds is not None:
            batch_size, seq_length, _ = inputs_embeds.shape
        else:
            raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")

        seq_length_with_past = seq_length
        past_key_values_length = 0

        if past_key_values is not None:
            past_key_values_length = past_key_values[0][0].shape[2]
            seq_length_with_past = seq_length_with_past + past_key_values_length

        if position_ids is None:
            device = input_ids.device if input_ids is not None else inputs_embeds.device
            position_ids = torch.arange(
                past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
            )
            position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
        else:
            position_ids = position_ids.view(-1, seq_length).long()

        if inputs_embeds is None:
            inputs_embeds = self.embed_tokens(input_ids)
        # embed positions
        if attention_mask is None:
            attention_mask = torch.ones(
                (batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embeds.device
            )
        attention_mask = self._prepare_decoder_attention_mask(
            attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
        )

        hidden_states = inputs_embeds

        if self.gradient_checkpointing and self.training:
            if use_cache:
                logger.warning_once(
                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
                )
                use_cache = False

        # decoder layers
        all_hidden_states = () if output_hidden_states else None
        all_self_attns = () if output_attentions else None
        next_decoder_cache = () if use_cache else None

        for idx, decoder_layer in enumerate(self.layers):
            if output_hidden_states:
                all_hidden_states += (hidden_states,)

            past_key_value = past_key_values[idx] if past_key_values is not None else None

            if self.gradient_checkpointing and self.training:

                def create_custom_forward(module):
                    def custom_forward(*inputs):
                        # None for past_key_value
                        return module(*inputs, output_attentions, None)

                    return custom_forward

                layer_outputs = torch.utils.checkpoint.checkpoint(
                    create_custom_forward(decoder_layer),
                    hidden_states,
                    attention_mask,
                    position_ids,
                    None,
                )
            else:
                layer_outputs = decoder_layer(
                    hidden_states,
                    attention_mask=attention_mask,
                    position_ids=position_ids,
                    past_key_value=past_key_value,
                    output_attentions=output_attentions,
                    use_cache=use_cache,
                )

            hidden_states = layer_outputs[0]

            if use_cache:
                next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)

            if output_attentions:
                all_self_attns += (layer_outputs[1],)

        hidden_states = self.norm(hidden_states)

        # add hidden states from the last decoder layer
        if output_hidden_states:
            all_hidden_states += (hidden_states,)

        next_cache = next_decoder_cache if use_cache else None
        if not return_dict:
            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
        return BaseModelOutputWithPast(
            last_hidden_state=hidden_states,
            past_key_values=next_cache,
            hidden_states=all_hidden_states,
            attentions=all_self_attns,
        )


class LlamaForCausalLM(LlamaPreTrainedModel):
    _tied_weights_keys = ["lm_head.weight"]

    def __init__(self, config):
        super().__init__(config)
        self.model = LlamaModel(config)
        self.pretraining_tp = config.pretraining_tp
        self.vocab_size = config.vocab_size
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        return self.model.embed_tokens

    def set_input_embeddings(self, value):
        self.model.embed_tokens = value

    def get_output_embeddings(self):
        return self.lm_head

    def set_output_embeddings(self, new_embeddings):
        self.lm_head = new_embeddings

    def set_decoder(self, decoder):
        self.model = decoder

    def get_decoder(self):
        return self.model

    @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
    @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
    def forward(
        self,
        input_ids: torch.LongTensor = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        labels: Optional[torch.LongTensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, CausalLMOutputWithPast]:
        r"""
        Args:
            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.

        Returns:

        Example:

        ```python
        >>> from transformers import AutoTokenizer, LlamaForCausalLM

        >>> model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
        >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)

        >>> prompt = "Hey, are you conscious? Can you talk to me?"
        >>> inputs = tokenizer(prompt, return_tensors="pt")

        >>> # Generate
        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
        ```"""

        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
        outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        hidden_states = outputs[0]
        if self.pretraining_tp > 1:
            lm_head_slices = self.lm_head.weight.split(self.vocab_size // self.pretraining_tp, dim=0)
            logits = [F.linear(hidden_states, lm_head_slices[i]) for i in range(self.pretraining_tp)]
            logits = torch.cat(logits, dim=-1)
        else:
            logits = self.lm_head(hidden_states)
        logits = logits.float()

        loss = None
        if labels is not None:
            # Shift so that tokens < n predict n
            shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            # Flatten the tokens
            loss_fct = CrossEntropyLoss()
            shift_logits = shift_logits.view(-1, self.config.vocab_size)
            shift_labels = shift_labels.view(-1)
            # Enable model parallelism
            shift_labels = shift_labels.to(shift_logits.device)
            loss = loss_fct(shift_logits, shift_labels)

        if not return_dict:
            output = (logits,) + outputs[1:]
            return (loss,) + output if loss is not None else output

        return CausalLMOutputWithPast(
            loss=loss,
            logits=logits,
            past_key_values=outputs.past_key_values,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

    def prepare_inputs_for_generation(
        self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
    ):
        if past_key_values:
            input_ids = input_ids[:, -1:]

        position_ids = kwargs.get("position_ids", None)
        if attention_mask is not None and position_ids is None:
            # create position_ids on the fly for batch generation
            position_ids = attention_mask.long().cumsum(-1) - 1
            position_ids.masked_fill_(attention_mask == 0, 1)
            if past_key_values:
                position_ids = position_ids[:, -1].unsqueeze(-1)

        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
        if inputs_embeds is not None and past_key_values is None:
            model_inputs = {"inputs_embeds": inputs_embeds}
        else:
            model_inputs = {"input_ids": input_ids}

        model_inputs.update(
            {
                "position_ids": position_ids,
                "past_key_values": past_key_values,
                "use_cache": kwargs.get("use_cache"),
                "attention_mask": attention_mask,
            }
        )
        return model_inputs

    @staticmethod
    def _reorder_cache(past_key_values, beam_idx):
        reordered_past = ()
        for layer_past in past_key_values:
            reordered_past += (
                tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
            )
        return reordered_past


@add_start_docstrings(
    """
    The LLaMa Model transformer with a sequence classification head on top (linear layer).

    [`LlamaForSequenceClassification`] uses the last token in order to do the classification, as other causal models
    (e.g. GPT-2) do.

    Since it does classification on the last token, it requires to know the position of the last token. If a
    `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
    no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
    padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
    each row of the batch).
    """,
    LLAMA_START_DOCSTRING,
)
class LlamaForSequenceClassification(LlamaPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.model = LlamaModel(config)
        self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        return self.model.embed_tokens

    def set_input_embeddings(self, value):
        self.model.embed_tokens = value

    @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
    def forward(
        self,
        input_ids: torch.LongTensor = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        labels: Optional[torch.LongTensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
        r"""
        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        transformer_outputs = self.model(
            input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        hidden_states = transformer_outputs[0]
        logits = self.score(hidden_states)

        if input_ids is not None:
            batch_size = input_ids.shape[0]
        else:
            batch_size = inputs_embeds.shape[0]

        if self.config.pad_token_id is None and batch_size != 1:
            raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
        if self.config.pad_token_id is None:
            sequence_lengths = -1
        else:
            if input_ids is not None:
                sequence_lengths = (torch.ne(input_ids, self.config.pad_token_id).sum(-1) - 1).to(logits.device)
            else:
                sequence_lengths = -1

        pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]

        loss = None
        if labels is not None:
            labels = labels.to(logits.device)
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                loss_fct = MSELoss()
                if self.num_labels == 1:
                    loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
                else:
                    loss = loss_fct(pooled_logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = BCEWithLogitsLoss()
                loss = loss_fct(pooled_logits, labels)
        if not return_dict:
            output = (pooled_logits,) + transformer_outputs[1:]
            return ((loss,) + output) if loss is not None else output

        return SequenceClassifierOutputWithPast(
            loss=loss,
            logits=pooled_logits,
            past_key_values=transformer_outputs.past_key_values,
            hidden_states=transformer_outputs.hidden_states,
            attentions=transformer_outputs.attentions,
        )

创建llm_models/Llama-2-7b-chat-ms/configuration_llama.py

# coding=utf-8
# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
#
# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
# and OPT implementations in this library. It has been modified from its
# original forms to accommodate minor architectural differences compared
# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     ****://***.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" LLaMA model configuration"""
​
from transformers.configuration_utils import PretrainedConfig
from transformers.utils import logging
​
​
logger = logging.get_logger(__name__)
​
LLAMA_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
​
​
class LlamaConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`LlamaModel`]. It is used to instantiate an LLaMA
    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
    defaults will yield a similar configuration to that of the LLaMA-7B.
    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.
    Args:
        vocab_size (`int`, *optional*, defaults to 32000):
            Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the
            `inputs_ids` passed when calling [`LlamaModel`]
        hidden_size (`int`, *optional*, defaults to 4096):
            Dimension of the hidden representations.
        intermediate_size (`int`, *optional*, defaults to 11008):
            Dimension of the MLP representations.
        num_hidden_layers (`int`, *optional*, defaults to 32):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 32):
            Number of attention heads for each attention layer in the Transformer encoder.
        num_key_value_heads (`int`, *optional*):
            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
            `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
            by meanpooling all the original heads within that group. For more details checkout [this
            paper](****s://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
            `num_attention_heads`.
        pretraining_tp (`int`, *optional*, defaults to `1`):
            Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
            document](****s://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is
            necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
            issue](****s://github.com/pytorch/pytorch/issues/76232).
        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
            The non-linear activation function (function or string) in the decoder.
        max_position_embeddings (`int`, *optional*, defaults to 2048):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        rms_norm_eps (`float`, *optional*, defaults to 1e-12):
            The epsilon used by the rms normalization layers.
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models). Only
            relevant if `config.is_decoder=True`.
        tie_word_embeddings(`bool`, *optional*, defaults to `False`):
            Whether to tie weight embeddings
        rope_scaling (`Dict`, *optional*):
            Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports three scaling
            strategies: linear and dynamic. Their scaling factor must be an float greater than 1. The expected format
            is `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
            `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
            these scaling strategies behave:
            ****s://***.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
            experimental feature, subject to breaking API changes in future versions.
        Example:
    ```python
    >>> from transformers import LlamaModel, LlamaConfig
    >>> # Initializing a LLaMA llama-7b style configuration
    >>> configuration = LlamaConfig()
    >>> # Initializing a model from the llama-7b style configuration
    >>> model = LlamaModel(configuration)
    >>> # Accessing the model configuration
    >>> configuration = model.config
    ```"""
    model_type = "llama"
    keys_to_ignore_at_inference = ["past_key_values"]
​
    def __init__(
        self,
        vocab_size=32000,
        hidden_size=4096,
        intermediate_size=11008,
        num_hidden_layers=32,
        num_attention_heads=32,
        num_key_value_heads=None,
        hidden_act="silu",
        max_position_embeddings=2048,
        initializer_range=0.02,
        rms_norm_eps=1e-6,
        use_cache=True,
        pad_token_id=0,
        bos_token_id=1,
        eos_token_id=2,
        pretraining_tp=1,
        tie_word_embeddings=False,
        rope_scaling=None,
        **kwargs,
    ):
        self.vocab_size = vocab_size
        self.max_position_embeddings = max_position_embeddings
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
​
        # for backward compatibility
        if num_key_value_heads is None:
            num_key_value_heads = num_attention_heads
​
        self.num_key_value_heads = num_key_value_heads
        self.hidden_act = hidden_act
        self.initializer_range = initializer_range
        self.rms_norm_eps = rms_norm_eps
        self.pretraining_tp = pretraining_tp
        self.use_cache = use_cache
        self.rope_scaling = rope_scaling
        self._rope_scaling_validation()
​
        super().__init__(
            pad_token_id=pad_token_id,
            bos_token_id=bos_token_id,
            eos_token_id=eos_token_id,
            tie_word_embeddings=tie_word_embeddings,
            **kwargs,
        )
​
    def _rope_scaling_validation(self):
        """
        Validate the `rope_scaling` configuration.
        """
        if self.rope_scaling is None:
            return
​
        if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
            raise ValueError(
                "`rope_scaling` must be a dictionary with with two fields, `name` and `factor`, "
                f"got {self.rope_scaling}"
            )
        rope_scaling_type = self.rope_scaling.get("type", None)
        rope_scaling_factor = self.rope_scaling.get("factor", None)
        if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]:
            raise ValueError(
                f"`rope_scaling`'s name field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
            )
        if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:
            raise ValueError(f"`rope_scaling`'s factor field must be an float > 1, got {rope_scaling_factor}")

创建llm_models/Llama-2-7b-chat-ms/config.json

{
  "architectures": [
    "LlamaForCausalLM"
  ],
  "auto_map": {
    "AutoModelForCausalLM": "modeling_llama.LlamaForCausalLM"
  },
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.31.0.dev0",
  "use_cache": true,
  "vocab_size": 32000
}

hf->onnx

  • 转换代码
import os
import base64
import glob
import shutil
import argparse
import torch
import numpy as np
import onnxruntime as ort
import sentencepiece as spm
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

# some wrapper class for export
class Embedding(torch.nn.Module):
    def __init__(self, embed, using_bf16: bool = False):
        super().__init__()
        self.bf16 = using_bf16
        if using_bf16:
            # using bf16 embedding weight
            self.embed = embed.bfloat16()
        else:
            self.embed = embed

    def forward(self, input_ids):
        res = self.embed(input_ids)
        if self.bf16:
            res = res.float()
        return res.view(-1, 1, 4096)

class Lm(torch.nn.Module):
    def __init__(self, lm):
        super().__init__()
        self.lm = lm

    def forward(self, hidden_states):
        m_logits = self.lm(hidden_states)
        token = torch.argmax(m_logits)
        return token

class LLM(torch.nn.Module):
    '''
    Base class for all llm model. Inherits from [`torch.nn.Module`].
    '''

    def __init__(self, args):
        super().__init__()
        self.export_path = args.export_path
        self.export_verbose = args.export_verbose
        self.export_test = args.export_test
        self.embed_bf16 = args.embed_bf16
        tokenizer_model = os.path.join(args.path, 'tokenizer.model')
        if os.path.exists(tokenizer_model):
            self.sp_model = spm.SentencePieceProcessor(tokenizer_model)
        else:
            self.sp_model = None
        self.load_model(args.path)
        self.max_length = 1024

    def load_model(self, model_path: str):
        raise NotImplementedError

    def get_attention_mask(self) -> torch.Tensor:
        raise NotImplementedError

    def get_position_ids(self) -> torch.Tensor:
        raise NotImplementedError

    def export_vocab(self):
        raise NotImplementedError

    def forward(self, input_ids, attention_mask, position_ids, past_key_values):
        hidden_states = self.embed(input_ids)
        presents = []
        for i in range(self.block_nums):
            hidden_states, kv = self.blocks[i](hidden_states, attention_mask, position_ids, past_key_values[i])
            presents.append(kv)
        token_id = self.lm(hidden_states).view(1)
        presents = torch.stack(presents)
        self.seq_len += 1
        self.token_len += 1
        return token_id, presents

    # some test functions
    def build_prompt(self, query):
        if hasattr(self.tokenizer, 'build_prompt'):
            prompt = self.tokenizer.build_prompt(query)
        else:
            prompt = query
        return prompt

    def str_to_ids(self, prompt):
        input_ids = self.tokenizer(prompt, return_tensors="pt")['input_ids']
        return input_ids

    def id_to_str(self, token_id):
        word = self.tokenizer._convert_id_to_token(int(token_id))
        word = self.tokenizer.convert_tokens_to_string([word])
        return word

    def response(self, query):
        prompt = self.build_prompt(query)
        input_ids = self.str_to_ids(prompt)
        self.seq_len = input_ids.numel()
        self.context_len = self.seq_len - 2
        self.token_len = 0
        past_key_values = [None for i in range(self.block_nums)]
        token_id = input_ids
        while self.token_len < self.max_length:
            attention_mask = self.get_attention_mask()
            position_ids = self.get_position_ids()
            token_id, past_key_values = self.forward(token_id, attention_mask, position_ids, past_key_values)
            if token_id == self.stop_id:
                print("", end='\n')
                break
            word = self.id_to_str(token_id)
            print(word, end="", flush=True)

    # some export functions
    def assert_equal(self, torch_outs, onnx_outs):
        if type(torch_outs) not in (list, tuple):
            torch_outs = (torch_outs, )
            onnx_outs = (onnx_outs, )
        same = True
        for orig, onnx in zip(torch_outs, onnx_outs):
            orig = orig.detach().numpy()
            if not np.allclose(orig, onnx, rtol=1e-3, atol=1e-3):
                print('Error: onnx outputs dont match original. [shape = {}] onnx: {}, original: {}'.format(onnx.shape, onnx, orig))
                same = False
                break
        if same:
            print('onnx test SUCCESS')

    def export_lm(self):
        model = self.lm
        hidden_states = torch.randn(1, 4096)
        onnx_model = f'./{self.export_path}/lm.onnx'
        torch.onnx.export(model, (hidden_states),
                        onnx_model,
                        verbose=self.export_verbose,
                        input_names=['hidden_states'],
                        output_names=['token_id'],
                        do_constant_folding=True,
                        opset_version=15)
        # test lm
        if self.export_test:
            original_outs = model(hidden_states)
            ort_session = ort.InferenceSession(onnx_model, providers=['CPUExecutionProvider'])
            inputs = {
                'hidden_states' : hidden_states.numpy(),
            }
            onnx_outs = ort_session.run(None, inputs)
            self.assert_equal(original_outs, onnx_outs)

    def export_embed(self):
        model = self.embed
        input_ids = torch.arange(3, dtype=torch.long)
        onnx_model = f'./{self.export_path}/embedding.onnx'
        torch.onnx.export(model, (input_ids),
                        onnx_model,
                        verbose=self.export_verbose,
                        input_names=['input_ids'],
                        output_names=['inputs_embeds'],
                        dynamic_axes={"input_ids": {
                            0: "length"
                        }},
                        do_constant_folding=True,
                        opset_version=15)
        # test
        if self.export_test:
            original_outs = model(input_ids)
            ort_session = ort.InferenceSession(onnx_model, providers=['CPUExecutionProvider'])
            inputs = {
                'input_ids' : input_ids.numpy(),
            }
            onnx_outs = ort_session.run(None, inputs)
            self.assert_equal(original_outs, onnx_outs)

    def export_block(self, block_id: int):
        self.seq_len = 3
        self.token_len = 0
        inputs_embeds = torch.randn((self.seq_len, 1, 4096))
        attention_mask =  self.get_attention_mask()
        position_ids = self.get_position_ids()
        past_key_values = torch.zeros(self.past_kv_shape[1:])
        model = self.blocks[block_id]
        onnx_model = f'./{self.export_path}/block_{block_id}.onnx'
        torch.onnx.export(
            model, (inputs_embeds, attention_mask, position_ids, past_key_values),
            onnx_model,
            verbose=self.export_verbose,
            input_names=[
                'inputs_embeds', 'attention_mask', 'position_ids', 'past_key_values'
            ],
            output_names=['hidden_states', 'presents'],
            dynamic_axes=self.block_dynamic_axes,
            do_constant_folding=True,
            opset_version=15)
        if self.export_test:
            original_outs = model(inputs_embeds, attention_mask, position_ids, past_key_values)
            ort_session = ort.InferenceSession(onnx_model, providers=['CPUExecutionProvider'])
            inputs = {
                'inputs_embeds' : inputs_embeds.detach().numpy(),
                'attention_mask' : attention_mask.numpy(),
                'position_ids' : position_ids.numpy(),
                'past_key_values' : past_key_values.numpy()
            }
            onnx_outs = ort_session.run(None, inputs)
            self.assert_equal(original_outs, onnx_outs)

    def export_blocks(self):
        for i in range(self.block_nums):
            self.export_block(i)

    def export(self):
        model = self
        self.seq_len = 3
        self.token_len = 0
        input_ids = torch.arange(3, dtype=torch.long)
        attention_mask =  self.get_attention_mask()
        position_ids = self.get_position_ids()
        past_key_values = torch.zeros(self.past_kv_shape)
        onnx_model = f'./{self.export_path}/llm.onnx'
        torch.onnx.export(
            model, (input_ids, attention_mask, position_ids, past_key_values),
            onnx_model,
            verbose=self.export_verbose,
            input_names=[
                'input_ids', 'attention_mask', 'position_ids', 'past_key_values'
            ],
            output_names=['token_id', 'presents'],
            dynamic_axes=self.model_dynamic_axes,
            do_constant_folding=True,
            opset_version=15)
        if self.export_test:
            # test
            original_outs = model(input_ids, attention_mask, position_ids, past_key_values)
            ort_session = ort.InferenceSession(onnx_model, providers=['CPUExecutionProvider'])
            inputs = {
                'input_ids' : input_ids.detach().numpy(),
                'attention_mask' : attention_mask.numpy(),
                'position_ids' : position_ids.numpy(),
                'past_key_values' : past_key_values.numpy()
            }
            onnx_outs = ort_session.run(None, inputs)
            self.assert_equal(original_outs, onnx_outs)

    def export_tokenizer(self):
        file_path = os.path.join(self.export_path, "tokenizer.txt")
        if self.sp_model is not None:
            # senetencepiece
            NORMAL = 1; UNKNOWN = 2; CONTROL = 3
            USER_DEFINED = 4; UNUSED = 5; BYTE = 6
            fp = open(file_path, "w", encoding="utf8")
            for i in range(self.sp_model.GetPieceSize()):
                token = self.sp_model.IdToPiece(i)
                score = self.sp_model.GetScore(i)
                type = NORMAL
                if self.sp_model.IsUnknown(i):
                    type = UNKNOWN
                elif self.sp_model.IsControl(i):
                    type = CONTROL
                elif self.sp_model.IsUnused(i):
                    type = UNUSED
                elif self.sp_model.IsByte(i):
                    type = BYTE
                if self.model_name == 'Chatglm_6b':
                    if '<n>' in token: token = '\n'
                    if '<|tab|>' in token: token = '\t'
                    if '<|blank_' in token: token = ' ' * int(token[8:token.find('|>')])
                if '▁' in token: token = token.replace('▁', ' ')
                token_encode = base64.b64encode(token.encode("utf-8")).decode("utf8")
                fp.write(f'{token_encode} {score} {type}\n')
            fp.close()
        else:
            # tikton
            with open(file_path, "w", encoding="utf8") as fp:
                for k, v in self.tokenizer.mergeable_ranks.items():
                    line = base64.b64encode(k).decode("utf8") + "\n"
                    fp.write(line)

# chatglm
class GLMBlock(torch.nn.Module):
    def __init__(self, block, block_id, final_layernorm = None):
        super().__init__()
        self.block = block
        self.block_id = block_id
        self.final_layernorm = final_layernorm

    def forward(self, hidden_states, attention_mask, position_ids, past_kv):
        hidden_states, presents = self.block(hidden_states,
                                             position_ids,
                                             attention_mask,
                                             self.block_id,
                                             past_kv,
                                             use_cache=True)
        if self.final_layernorm is not None:
            hidden_states = self.final_layernorm(hidden_states)
            hidden_states = hidden_states.view(-1, 4096)[-1].view(1, 1, 4096)
        if isinstance(presents, tuple):
            presents = torch.stack(presents)
        return hidden_states, presents

class Chatglm_6b(LLM):
    def __init__(self, args):
        super().__init__(args)
        self.model_name = 'Chatglm_6b'

    def load_model(self, model_path: str):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        model = AutoModel.from_pretrained(model_path, trust_remote_code=True).float().eval()
        transformer = model.transformer
        self.lm_ = model.lm_head
        self.embed_ = transformer.word_embeddings
        self.blocks_ = transformer.layers
        self.final_layernorm_ = transformer.final_layernorm
        # some wrapper
        self.stop_id = self.tokenizer._convert_token_to_id(self.tokenizer.eos_token)
        self.block_nums = len(self.blocks_)
        self.lm = Lm(self.lm_)
        # chatglm embedding and lm using same param, copy embedding when using bf16
        if self.embed_bf16:
            import copy
            embed_copy = copy.deepcopy(self.embed_)
            self.embed = Embedding(embed_copy, self.embed_bf16)
        else:
            self.embed = Embedding(self.embed_, self.embed_bf16)
        self.blocks = [GLMBlock(self.blocks_[i], i, self.final_layernorm_ if i == len(self.blocks_) - 1 else None) for i in range(self.block_nums)]
        # some config for export
        self.past_kv_shape = [28, 2, 0, 1, 32, 128]
        self.block_dynamic_axes = {
            "inputs_embeds" : { 0: "seq_len" },
            "attention_mask" : { 2: "seq_len", 3: "seq_len" },
            "position_ids" : { 2: "seq_len" },
            "past_key_values" : { 1: "history_len" }
        }
        self.model_dynamic_axes = {
            "input_ids" : { 0: "seq_len" },
            "attention_mask" : { 2: "seq_len", 3: "seq_len" },
            "position_ids" : { 2: "seq_len" },
            "past_key_values" : { 2: "history_len" }
        }

    def get_attention_mask(self) -> torch.Tensor:
        if self.token_len:
            return torch.zeros([1]).bool().reshape([1, 1, 1, 1])
        attention_mask = torch.zeros([self.seq_len, self.seq_len], dtype=torch.bool)
        for i in range(self.seq_len):
            attention_mask[i][-1] = True
        attention_mask = attention_mask.reshape([1, 1, self.seq_len, self.seq_len])
        return attention_mask

    def get_position_ids(self) -> torch.Tensor:
        if self.token_len:
            return torch.tensor([1, self.seq_len - self.context_len]).reshape([1, 2, 1])
        position_ids_0 = torch.arange(self.seq_len, dtype=torch.long)
        position_ids_1 = torch.zeros(self.seq_len, dtype=torch.long)
        position_ids_1[-1] = 1
        position_ids = torch.stack([position_ids_0, position_ids_1]).view(1, 2, -1)
        return position_ids

# chatglm2
class GLM2Block(torch.nn.Module):
    def __init__(self, block, block_id, final_layernorm = None):
        super().__init__()
        self.block = block
        self.block_id = block_id
        self.final_layernorm = final_layernorm

    def forward(self, hidden_states, attention_mask, position_ids, past_kv):
        theta = 1.0 / (10000 ** (torch.arange(0, 64, 2, dtype=torch.float32) / 64))
        position_ids = position_ids.float().reshape(-1, 1)
        idx_theta = position_ids * theta
        rotary_pos_emb = torch.stack([torch.cos(idx_theta), torch.sin(idx_theta)], dim=-1).unsqueeze(0).contiguous()
        hidden_states, presents = self.block(hidden_states,
                                            attention_mask,
                                            kv_cache=past_kv,
                                            rotary_pos_emb=rotary_pos_emb)
        if self.final_layernorm is not None:
            hidden_states = self.final_layernorm(hidden_states)
            hidden_states = hidden_states.view(-1, 4096)[-1].view(1, 1, 4096)
        if isinstance(presents, tuple):
            presents = torch.stack(presents)
        return hidden_states, presents

class Chatglm2_6b(LLM):
    def __init__(self, args):
        super().__init__(args)
        self.model_name = 'Chatglm2_6b'
        if 'codegeex2-6b' in args.path:
            self.model_name = 'Codegeex2_6b'

    def load_model(self, model_path: str):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        model = AutoModel.from_pretrained(model_path, trust_remote_code=True).float().eval()
        transformer = model.transformer
        self.lm_ = transformer.output_layer
        self.embed_ = transformer.embedding.word_embeddings
        self.blocks_ = transformer.encoder.layers
        self.final_layernorm_ = transformer.encoder.final_layernorm
        # some wrapper
        self.stop_id = self.tokenizer.eos_token_id
        if self.stop_id is None:
            # codegeex2-6b
            self.stop_id = self.tokenizer.tokenizer.eos_id
        self.block_nums = len(self.blocks_)
        self.embed = Embedding(self.embed_, self.embed_bf16)
        self.lm = Lm(self.lm_)
        self.blocks = [GLM2Block(self.blocks_[i], i, self.final_layernorm_ if i == len(self.blocks_) - 1 else None) for i in range(self.block_nums)]
        # some config for export
        self.past_kv_shape = [28, 2, 0, 1, 2, 128]
        self.block_dynamic_axes = {
            "inputs_embeds" : { 0: "seq_len" },
            "attention_mask" : { 2: "seq_len", 3: "seq_len" },
            "position_ids" : { 0: "seq_len" },
            "past_key_values" : { 1: "history_len" }
        }
        self.model_dynamic_axes = {
            "input_ids" : { 0: "seq_len" },
            "attention_mask" : { 2: "seq_len", 3: "seq_len" },
            "position_ids" : { 0: "seq_len" },
            "past_key_values" : { 2: "history_len" }
        }

    def get_attention_mask(self) -> torch.Tensor:
        if self.token_len:
            return torch.zeros([1, 1, 1, 1]).bool()
        attention_mask = ~torch.tril(torch.ones([1, 1, self.seq_len, self.seq_len]).bool())
        return attention_mask

    def get_position_ids(self) -> torch.Tensor:
        if self.token_len:
            return torch.tensor([self.token_len], dtype=torch.long)
        return torch.arange(self.seq_len, dtype=torch.long)

# qwen
class QWENBlock(torch.nn.Module):
    def __init__(self, block, block_id, final_layernorm = None):
        super().__init__()
        self.block = block
        self.block_id = block_id
        self.final_layernorm = final_layernorm

    def forward(self, hidden_states, attention_mask, position_ids, past_kv):
        theta = 1.0 / (10000.0 ** (torch.arange(0, 128, 2, dtype=torch.float32) / 128))
        position_ids = position_ids.float().reshape(-1, 1)
        idx_theta = position_ids * theta
        rotary_pos_emb = torch.cat((idx_theta, idx_theta), dim=-1)
        rotary_pos_emb = rotary_pos_emb.unsqueeze(1).unsqueeze(0)
        hidden_states = hidden_states.view(1, -1, 4096)
        hidden_states, presents = self.block(hidden_states,
                                             past_kv,
                                             attention_mask,
                                             rotary_pos_emb,
                                             use_cache=True)
        if self.final_layernorm is not None:
            hidden_states = self.final_layernorm(hidden_states)
            hidden_states = hidden_states.view(-1, 4096)[-1].view(1, 1, 4096)
        if isinstance(presents, tuple):
            presents = torch.stack(presents)
        return hidden_states, presents

class Qwen_7b_Chat(LLM):
    def __init__(self, args):
        super().__init__(args)

    def load_model(self, model_path: str):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).float().eval()
        transformer = model.transformer
        self.lm_ = model.lm_head
        self.embed_ = transformer.wte
        self.blocks_ = transformer.h
        self.final_layernorm_ = transformer.ln_f
        # some wrapper
        self.stop_id = self.tokenizer.im_end_id
        self.block_nums = len(self.blocks_)
        self.embed = Embedding(self.embed_, self.embed_bf16)
        self.lm = Lm(self.lm_)
        self.blocks = [QWENBlock(self.blocks_[i], i, self.final_layernorm_ if i == len(self.blocks_) - 1 else None) for i in range(self.block_nums)]
        # some config for export
        self.past_kv_shape = [32, 2, 1, 0, 32, 128]
        self.block_dynamic_axes = {
            "inputs_embeds" : { 0: "seq_len" },
            "attention_mask" : { 2: "seq_len", 3: "seq_len" },
            "position_ids" : { 0: "seq_len" },
            "past_key_values" : { 2: "history_len" }
        }
        self.model_dynamic_axes = {
            "input_ids" : { 0: "seq_len" },
            "attention_mask" : { 2: "seq_len", 3: "seq_len" },
            "position_ids" : { 0: "seq_len" },
            "past_key_values" : { 3: "history_len" }
        }

    def build_prompt(self, query):
        return f'\n<|im_start|>user\n{query}<|im_end|>\n<|im_start|>assistant\n'

    def get_attention_mask(self) -> torch.Tensor:
        if self.token_len:
            return torch.ones([1, 1, 1, 1]).bool()
        return torch.tril(torch.ones([1, 1, self.seq_len, self.seq_len]).bool())

    def get_position_ids(self) -> torch.Tensor:
        if self.token_len:
            return torch.tensor([self.seq_len - 1], dtype=torch.long)
        return torch.arange(self.seq_len, dtype=torch.long)

# llama2
class LLAMA2Block(torch.nn.Module):
    def __init__(self, block, block_id, final_layernorm = None):
        super().__init__()
        self.block = block
        self.block_id = block_id
        self.final_layernorm = final_layernorm

    def forward(self, hidden_states, attention_mask, position_ids, past_kv):
        hidden_states = hidden_states.view(1, -1, 4096)
        hidden_states, presents = self.block(hidden_states,
                                             attention_mask,
                                             position_ids,
                                             past_kv,
                                             use_cache=True)
        if self.final_layernorm is not None:
            hidden_states = self.final_layernorm(hidden_states)
            hidden_states = hidden_states.view(-1, 4096)[-1].view(1, 1, 4096)
        if isinstance(presents, tuple):
            presents = torch.stack(presents)
        return hidden_states, presents

class Llama2_7b_Chat(LLM):
    def __init__(self, args):
        super().__init__(args)
        self.model_name = 'Llama2_7b'
        if 'Baichuan2' in args.path:
            self.model_name = 'Baichuan2_7B'

    def load_model(self, model_path: str):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).float().eval()
        transformer = model.model
        self.lm_ = model.lm_head
        self.embed_ = transformer.embed_tokens
        self.blocks_ = transformer.layers
        self.final_layernorm_ = transformer.norm
        # some wrapper
        self.stop_id = self.tokenizer.eos_token_id
        self.block_nums = len(self.blocks_)
        self.embed = Embedding(self.embed_, self.embed_bf16)
        self.lm = Lm(self.lm_)
        self.blocks = [LLAMA2Block(self.blocks_[i], i, self.final_layernorm_ if i == len(self.blocks_) - 1 else None) for i in range(self.block_nums)]
        # some config for export
        self.past_kv_shape = [32, 2, 1, 32, 0, 128]
        self.block_dynamic_axes = {
            "inputs_embeds" : { 0: "seq_len" },
            "attention_mask" : { 2: "seq_len", 3: "seq_len" },
            "position_ids" : { 0: "seq_len" },
            "past_key_values" : { 3: "history_len" }
        }
        self.model_dynamic_axes = {
            "input_ids" : { 0: "seq_len" },
            "attention_mask" : { 2: "seq_len", 3: "seq_len" },
            "position_ids" : { 1: "seq_len" },
            "past_key_values" : { 4: "history_len" }
        }

    def build_prompt(self, query):
        if 'Baichuan2' in self.model_name:
            return f'<reserved_106>{query}<reserved_107>'
        return f'[INST]{query}[/INST]'


    def get_attention_mask(self) -> torch.Tensor:
        if self.token_len:
            return torch.zeros([1, 1, 1, self.seq_len], dtype=torch.float32)
        return (1 - torch.tril(torch.ones([1, 1, self.seq_len, self.seq_len]))) * torch.finfo(torch.float32).min

    def get_position_ids(self) -> torch.Tensor:
        if self.token_len:
            return torch.tensor([[self.seq_len - 1]], dtype=torch.long)
        return torch.arange(self.seq_len, dtype=torch.long).unsqueeze(0)

if __name__ == '__main__':
    llm_models = {
        'chatglm-6b': Chatglm_6b,
        'chatglm2-6b': Chatglm2_6b,
        'codegeex2-6b': Chatglm2_6b,
        'Qwen-7B-Chat': Qwen_7b_Chat,
        'Baichuan2-7B-Chat': Llama2_7b_Chat,
        'Llama-2-7b-chat-ms': Llama2_7b_Chat
    }
    parser = argparse.ArgumentParser(description='LLMExporter', formatter_class=argparse.RawTextHelpFormatter)
    parser.add_argument('--path', type=str, default='THUDM/chatglm-6b', required=True,
                        help='path(`str` or `os.PathLike`):\nCan be either:'
                        '\n\t- A string, the *model id* of a pretrained model like `THUDM/chatglm-6b`. [TODO]'
                        '\n\t- A path to a *directory* clone from repo like `../chatglm-6b`.')
    parser.add_argument('--type', type=str, choices=llm_models.keys(), default=None,
                        help='type(`str`, *optional*):'
                        '\n\tThe pretrain llm model type.'
                        )
    parser.add_argument('--export_path', type=str, default='./onnx', help='export onnx model path, defaut is `./onnx`.')
    parser.add_argument('--export_verbose', action='store_true', default=False, help='Whether or not to export onnx with verbose.')
    parser.add_argument('--export_test', action='store_true', help='Whether or not to export onnx with test using onnxruntime.')
    parser.add_argument('--test', type=str, help='test model inference with query `TEST`.')
    parser.add_argument('--export', action='store_true', help='export model to an `onnx` model.')
    parser.add_argument('--export_split', action='store_true',
                        help='export model split to some `onnx` models:'
                        '\n\t- embedding model.'
                        '\n\t- block models.'
                        '\n\t- lm_head model.'
                        )
    parser.add_argument('--export_token', action='store_true', help='export llm tokenizer to a txt file.')
    parser.add_argument('--export_embed', action='store_true', help='export llm embedding to an `onnx` model.')
    parser.add_argument('--export_lm', action='store_true', help='export llm lm_head to an `onnx` model.')
    parser.add_argument('--export_block', type=int, help='export llm block [id] to an `onnx` model.')
    parser.add_argument('--export_blocks', action='store_true', help='export llm all blocks to `onnx` models.')
    parser.add_argument('--embed_bf16', action='store_true', help='using `bfloat16` replace `float32` in embedding.')


    args = parser.parse_args()
    model_path = args.path
    model_type = args.type
    # not sepcify model type, using path
    if model_type is None:
        for model in llm_models:
            if model in model_path:
                model_type = model
    if model_type is None:
        raise RuntimeError('Please specify model type.')

    # copy modeling py file to pretrain model for export
    for file in glob.glob(f'./llm_models/{model_type}/*'):
        shutil.copy2(file, model_path)

    llm_exporter = llm_models[model_type](args)

    # some actions
    if args.test is not None:
        llm_exporter.response(args.test)

    if args.export:
        llm_exporter.export()

    if args.export_token:
        llm_exporter.export_tokenizer()

    if args.export_embed or args.export_split:
        llm_exporter.export_embed()

    if args.export_lm or args.export_split:
        llm_exporter.export_lm()

    if args.export_blocks or args.export_split:
        llm_exporter.export_blocks()

    if args.export_block is not None:
        llm_exporter.export_block(args.export_block)
  • 转换命令
python3 llm_export.py --path ../Llama2-7b/ --type Llama-2-7b-chat-ms --export_path onnx --export

 

Triton部署下载镜像

docker pull nvcr.io/nvidia/tritonserver:23.08-py3
docker pull nvcr.io/nvidia/tensorrt:23.08-py3
  • 加载镜像,启动服务
docker run -it --gpus=1 --rm --net=host -v ./model-repository:/models tritonserver:23.08-py3 tritonserver --model-repository=/models
  • 查看config.pbtxt
 curl localhost:8000/v2/models/{$model_name}/config
  • 加载tensorrt镜像,启动服务
 docker run --gpus all -it --rm -v ./:/models tensorrt:23.08-py3

  • onnx-trt转换
trtexec --onnx=./model.onnx --saveEngine=./trt/model.plan --optShapes=input_ids:1,attention_mask:1x1x1x1026,position_ids:1x1,past_key_values:32x2x1x32x1025x128 --minShapes=input_ids:1,attention_mask:1x1x1x1,position_ids:1x1,past_key_values:32x2x1x32x0x128 --maxShapes=input_ids:1024,attention_mask:1x1x1024x2049,position_ids:1x1024,past_key_values:32x2x1x32x1025x128 --device=1 --fp16
  • config.pbtxt
name: "trt",
platform: "tensorrt_plan",
max_batch_size: 0,
input: [{
        name: "past_key_values",
        data_type: TYPE_FP32,
        dims: [32, 2, 1, 32, -1, 128],
}, {
        name: "position_ids",
        data_type: TYPE_INT32,
        dims: [1, -1],
}, {
        name: "attention_mask",
        data_type: TYPE_FP32,
        dims: [1, 1, -1, -1],
}, {
        name: "input_ids",
        data_type: TYPE_INT32,
        dims: [-1],
}],
output: [{
        name: "presents",
        data_type: TYPE_FP32,
        dims: [32, 2, 1, 32, -1, 128],
}, {
        name: "token_id",
        data_type: TYPE_INT32,
        dims: [1],
}]
default_model_filename: "model.plan"               # TensorRT的文件名

查看onnx的输入输出维度信息

import onnx

model = onnx.load(r"model.onnx")

# The model is represented as a protobuf structure and it can be accessed
# using the standard python-for-protobuf methods

# iterate through inputs of the graph
for input in model.graph.input:
    print (input.name, end=": ")
    # get type of input tensor
    tensor_type = input.type.tensor_type
    # check if it has a shape:
    if (tensor_type.HasField("shape")):
        # iterate through dimensions of the shape:
        for d in tensor_type.shape.dim:
            # the dimension may have a definite (integer) value or a symbolic identifier or neither:
            if (d.HasField("dim_value")):
                print (d.dim_value, end=", ")  # known dimension
            elif (d.HasField("dim_param")):
                print (d.dim_param, end=", ")  # unknown dimension with symbolic name
            else:
                print ("?", end=", ")  # unknown dimension with no name
    else:
        print ("unknown rank", end="")
    print()


# iterate through outputs of the graph
for output in model.graph.output:
    print (output.name, end=": ")
    # get type of input tensor
    tensor_type = output.type.tensor_type
    # check if it has a shape:
    if (tensor_type.HasField("shape")):
        # iterate through dimensions of the shape:
        for d in tensor_type.shape.dim:
            # the dimension may have a definite (integer) value or a symbolic identifier or neither:
            if (d.HasField("dim_value")):
                print (d.dim_value, end=", ")  # known dimension
            elif (d.HasField("dim_param")):
                print (d.dim_param, end=", ")  # unknown dimension with symbolic name
            else:
                print ("?", end=", ")  # unknown dimension with no name
    else:
        print ("unknown rank", end="")
    print()

查看更详细的信息

import onnx
from onnx import helper
import sys,getopt

#加载模型
def loadOnnxModel(path):
    model = onnx.load(path)
    return model

#获取节点和节点的输入输出名列表,一般节点的输入将来自于上一层的输出放在列表前面,参数放在列表后面
def getNodeAndIOname(nodename,model):
    for i in range(len(model.graph.node)):
        if model.graph.node[i].name == nodename:
            Node = model.graph.node[i]
            input_name = model.graph.node[i].input
            output_name = model.graph.node[i].output
    return Node,input_name,output_name

#获取对应输入信息
def getInputTensorValueInfo(input_name,model):
    in_tvi = []
    for name in input_name:
        for params_input in model.graph.input:
            if params_input.name == name:
               in_tvi.append(params_input)
        for inner_output in model.graph.value_info:
            if inner_output.name == name:
                in_tvi.append(inner_output)
    return in_tvi

#获取对应输出信息
def getOutputTensorValueInfo(output_name,model):
    out_tvi = []
    for name in output_name:
        out_tvi = [inner_output for inner_output in model.graph.value_info if inner_output.name == name]
        if name == model.graph.output[0].name:
            out_tvi.append(model.graph.output[0])
    return out_tvi

#获取对应超参数值
def getInitTensorValue(input_name,model):
    init_t = []
    for name in input_name:
        init_t = [init for init in model.graph.initializer if init.name == name]
    return init_t

#构建单个节点onnx模型
def createSingelOnnxModel(ModelPath,nodename,SaveType="",SavePath=""):
    model = loadOnnxModel(str(ModelPath))
    Node,input_name,output_name = getNodeAndIOname(nodename,model)
    in_tvi = getInputTensorValueInfo(input_name,model)
    out_tvi = getOutputTensorValueInfo(output_name,model)
    init_t = getInitTensorValue(input_name,model)

    graph_def = helper.make_graph(
                [Node],
                nodename,
                inputs=in_tvi,  # 输入
                outputs=out_tvi,  # 输出
                initializer=init_t,  # initalizer
            )
    model_def = helper.make_model(graph_def, producer_name='onnx-example')
    print(nodename+"onnx模型生成成功!")
#获取节点数量
def getNodeNum(model):
    return len(model.graph.node)
#获取节点类型
def getNodetype(model):
    op_name = []
    for i in range(len(model.graph.node)):
        if model.graph.node[i].op_type not in op_name:
            op_name.append(model.graph.node[i].op_type)
    return op_name
#获取节点名列表
def getNodeNameList(model):
    NodeNameList = []
    for i in range(len(model.graph.node)):
        NodeNameList.append(model.graph.node[i].name)
    return NodeNameList
#获取模型的输入信息
def getModelInputInfo(model):
    return model.graph.input[0]
#获取模型的输出信息
def getModelOutputInfo(model):
    return model.graph.output[0]







model = onnx.load(r"model.onnx")
node_num = getNodeNum(model)
print("节点数量:", node_num)

node_types = getNodetype(model)
print("节点类型:")
for node_type in node_types:
    print(node_type)


#node_names = getNodeNameList(model)
#print("节点名列表:")
#for node_name in node_names:
#    print(node_name)


model_input_info = getModelInputInfo(model)
print("模型输入信息:")
print(model_input_info)



model_output_info = getModelOutputInfo(model)
print("模型输出信息:")
print(model_output_info)





#output_onnx_file_path="./shape"
#estimated_graph = onnx.shape_inference.infer_shapes(model)
#onnx.save(estimated_graph, output_onnx_file_path)

def parseRepeatedScalarContainer(container):
    values = []
    for element in container:
        print("element.name",element)
        values.append(element)
    return values

def getNodeNameList_test(model):
    NodeNameList = []
    target_nodes = ["/blocks_.0/self_attn/If", "/blocks_.0/self_attn/If_1", "/blocks_.0/self_attn/If_2", "/blocks_.0/self_attn/If_3"]
    for node in model.graph.node:
        if node.name in target_nodes:
            NodeNameList.append(node.name)
            print("节点名称:", node.name)
            #print("类型:", type(node.input))

            input_items=parseRepeatedScalarContainer(node.input)
            #print("类型:", type(input_items))

    return NodeNameList
node_names = getNodeNameList_test(model)
print("节点名列表:")
for node_name in node_names:
    print(node_name)

~

 

  • 推理测试
import os
import base64
import glob
import shutil
import argparse
import torch
import numpy as np
import onnxruntime as ort
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer
import tritonclient.**** as ****client


class LLM(torch.nn.Module):
    '''
    Base class for all llm model. Inherits from [`torch.nn.Module`].
    '''

    def __init__(self, path):
        super().__init__()
        self.load_tokenizer(path)
        self.max_length = 1024

    def load_tokenizer(self, model_path: str):
        raise NotImplementedError

    def get_attention_mask(self) -> torch.Tensor:
        raise NotImplementedError

    def get_position_ids(self) -> torch.Tensor:
        raise NotImplementedError

    # some test functions
    def build_prompt(self, query):
        if hasattr(self.tokenizer, 'build_prompt'):
            prompt = self.tokenizer.build_prompt(query)
        else:
            prompt = query
        return prompt

    def str_to_ids(self, prompt):
        input_ids = self.tokenizer(prompt, return_tensors="pt")['input_ids']
        return input_ids

    def id_to_str(self, token_id):
        word = self.tokenizer._convert_id_to_token(int(token_id[0]))
        print(word)
        word = self.tokenizer.convert_tokens_to_string([word+" "])
        return word


    def generate_inputs(self,query):
        prompt = self.build_prompt(query)
        input_ids = self.str_to_ids(prompt).squeeze()
        self.seq_len = input_ids.numel()
        self.context_len = self.seq_len - 2
        attention_mask =  self.get_attention_mask(self.seq_len)
        position_ids = self.get_position_ids(self.seq_len)
        past_key_values = torch.zeros(self.past_kv_shape)



        inputs = {
            'input_ids' : input_ids.detach().numpy(),
            'attention_mask' : attention_mask.numpy(),
            'position_ids' : position_ids.numpy(),
            'past_key_values' : past_key_values.numpy()
        }




        return inputs


class Llama2_7b_input_generator(LLM):

    def load_tokenizer(self, path: str):
        self.tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
        self.stop_id = self.tokenizer.eos_token_id
        if self.stop_id is None:
            self.stop_id = self.tokenizer.tokenizer.eos_id
        # some config for export
        self.past_kv_shape = [32, 2, 1, 32, 0, 128]




    def get_attention_mask(self, seq_len) -> torch.Tensor:
        return (1 - torch.tril(torch.ones([1, 1, seq_len, seq_len]))) * torch.finfo(torch.float32).min

    def get_position_ids(self, seq_len) -> torch.Tensor:
        return torch.arange(seq_len, dtype=torch.long).unsqueeze(0)




if __name__ == '__main__':

    query = "please introduce java"

    max_infer_length = 1024
    #cp -r model_path/tokenizer /XXX/model-repository/trt/1/tokenizer
    tokenizer_path = "/XXX/model-repository/trt/1/tokenizer"
    input_generator = Llama2_7b_input_generator(tokenizer_path)
    inputs_tokenized = input_generator.generate_inputs(query)
    # print(inputs_tokenized)
    input_ids = inputs_tokenized["input_ids"]
    past_key_values = inputs_tokenized['past_key_values']
    # ort_session = ort.InferenceSession(onnx_model, providers=['CUDAExecutionProvider'])
    # onnx_outs = ort_session.run(None, inputs_tokenized)
    # print("input_ids shape: ", input_ids.shape)
    # print("atten mask shape: ", attention_mask.shape)
    # print("position ids shape: ", position_ids.shape)
    # print("past kv shape: ", past_key_values.shape)
    # output_str = input_generator.id_to_str(onnx_outs[0])
    # print(onnx_outs[0],output_str)


    import tritonclient.**** as ****client

    triton_client = ****client.InferenceServerClient(url="localhost:8000", verbose=False)
    model_name = "trt"

    token_len = 0
    stop_stream = False

    while True:

        cur_input_len = input_ids.shape[0]
        attention_mask = input_generator.get_attention_mask(cur_input_len).numpy()
        position_ids = input_generator.get_position_ids(cur_input_len).numpy()



        inputs = [
            ****client.InferInput('input_ids', list(input_ids.shape), "INT32"),
            ****client.InferInput('attention_mask', list(attention_mask.shape), "FP32"),
            ****client.InferInput('position_ids', list(position_ids.shape), "INT32"),
            ****client.InferInput('past_key_values', [32, 2, 1, past_key_values.shape[3],past_key_values.shape[4], 128], "FP32")

        ]

        input_ids = np.array(input_ids, dtype=np.int32)
        position_ids = np.array(position_ids, dtype=np.int32)

        inputs[0].set_data_from_numpy(input_ids)
        inputs[1].set_data_from_numpy(attention_mask)
        inputs[2].set_data_from_numpy(position_ids)
        inputs[3].set_data_from_numpy(past_key_values)

        outputs = [
            ****client.InferRequestedOutput('token_id'),
            ****client.InferRequestedOutput('presents')
        ]

        results = triton_client.infer(model_name=model_name, inputs=inputs, outputs=outputs)

        infer_token_str = input_generator.id_to_str(results.as_numpy("token_id"))

        print(infer_token_str,end="", flush=True)

        input_ids = results.as_numpy("token_id")
        past_key_values = results.as_numpy("presents")
        # print(past_key_values.shape)
        # print(results.as_numpy("presents"))
0条评论
0 / 1000
l****n
28文章数
5粉丝数
l****n
28 文章 | 5 粉丝
原创

从0开始进行llama模型转换(hf->onnx,onnx->trt)和triton部署

2023-10-12 08:04:12
703
0

环境安装(NVIDIA\Docker)

配置仓库

curl -fsSL ****s://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L ****s://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb ****s://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] ****s://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list \
  && \
    sudo apt-get update

以上命令用于在 Ubuntu 系统上安装 NVIDIA Container Toolkit 的命令。

  1. curl -fsSL ****s://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg

    这个命令使用 curl 下载 NVIDIA Container Toolkit 的 GPG 密钥,并通过 sudo gpg --dearmor 对密钥进行解密、转换。然后将解密后的密钥保存到 /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg 文件中,以供后续操作使用。

  2. curl -s -L ****s://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | sed 's#deb ****s://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] ****s://#g'

    这一部分使用 curl 下载 NVIDIA Container Toolkit 的软件包源列表,并通过 sed 命令对下载的源列表进行处理。具体来说,它将源列表中的 deb ****s:// 替换为 deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] ****s://,以确保软件包在安装时经过验证。

  3. sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

    这个命令将修改后的软件包源列表写入 /etc/apt/sources.list.d/nvidia-container-toolkit.list 文件中。

  4. sudo apt-get update

    最后,这个命令用于更新软件包列表,以使系统能够识别并安装 NVIDIA Container Toolkit。

综上所述,该命令的目的是下载 NVIDIA Container Toolkit 的 GPG 密钥,配置适当的软件包源列表,并使用 apt-get update 命令更新软件包列表,以便安装 NVIDIA Container Toolkit。

安装NVIDIA Container Toolkit 包

sudo apt-get install -y nvidia-container-toolkit

安装NVIDIA驱动程序

  • 在 Linux 环境下,你可以使用以下方法来查看显卡类型:
  1. 使用 lspci 命令: 打开终端,并执行以下命令:

    lspci | grep -i vga

    这将显示与图形相关的设备信息,包括显卡型号。

  2. 使用 lshw 命令: 打开终端,并执行以下命令:

    sudo lshw -C display

    这将显示详细的显卡信息,包括显卡型号、供应商等。

  3. 使用系统监控工具: 大多数 Linux 发行版都提供了系统监控工具,如 GNOME System Monitor、KSysGuard 等。这些工具通常提供图形化界面,你可以通过它们查看显卡型号和其他硬件信息。

一旦你确定了显卡型号,你可以访问 NVIDIA 官方网站*****://***.nvidia.com/Download/index.aspx并选择正确的驱动程序版本来下载。确保选择与你的显卡型号和 Linux 发行版相匹配的驱动程序版本。在安装驱动程序之前,请确保你已经按照适用于你的 Linux 发行版的指南正确安装了所需的依赖项和预安装要求。

  • 执行安装
 
# 对安装包添加执行权限
chmod +x NVIDIA-Linux-x86_64-535.104.12.run
# 安装gcc和linux-kernel-headers
sudo apt-get install gcc linux-kernel-headers
# 运行驱动安装程序
sudo sh NVIDIA-Linux-x86_64-535.104.12.run --disable-nouveau
# 查看驱动是否安装成功
nvidia-smi
  • 安装NVIDIA Container Toolkit 组件
wget ****://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda_11.7.0_515.43.04_linux.run
# 安装CUDA
bash cuda_11.7.0_515.43.04_linux.run
# 编辑环境变量文件 
vi ~/.bashrc
# 增加环境变量
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
# 使环境变量生效
source ~/.bashrc
# 查看是否安装成功
nvcc -V

安装Miniconda

wget *****://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# 安装Miniconda3
bash Miniconda3-latest-Linux-x86_64.sh
# 配置conda环境变量
vim /etc/profile
# 添加环境变量
export ANACONDA_PATH=~/miniconda3
export PATH=$PATH:$ANACONDA_PATH/bin
# 使环境变量生效
source /etc/profile
# 查看是否安装成功
which anaconda
conda --version
conda info -e
python
# 查看虚拟环境
conda env list
 

安装cudnn

从*****://developer.nvidia.com/rdp/cudnn-download下载cudnn压缩包并上传至GPU云主机,按照如下步骤进行安装。

# 解压
tar -xf cudnn-linux-x86_64-8.9.4.25_cuda11-archive.tar.xz
# 进目录
cd cudnn cudnn-linux-x86_64-8.9.4.25_cuda11-archive
# 复制
cp ./include/*  /usr/local/cuda-11.7/include/
cp ./lib/libcudnn*  /usr/local/cuda-11.7/lib64/ 
# 授权
chmod a+r /usr/local/cuda-11.7/include/* /usr/local/cuda-11.7/lib64/libcudnn*
# 查看是否安装成功
cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2
​

 

安装docker

在 Linux Nvidia 环境下安装 Docker,需要先安装 Nvidia Container Toolkit,然后再安装 Docker。

以下是具体步骤:

1.安装依赖包

sudo apt-get update
sudo apt-get install -y curl gnupg2 software-properties-common

2.添加 Docker 官方 GPG 密钥

curl *****://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

3.添加 Docker 软件源

sudo add-apt-repository "deb [arch=amd64] *****://download.docker.com/linux/ubuntu $(lsb_release -cs) stable"

4.安装 Docker

sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io

配置Docker

configure the docker container runtime , it will modify /etc/docker/daemon.json

sudo nvidia-ctk runtime configure --runtime=docker

Restart docker service to enable the config

sudo systemctl restart docker

 

解释:

  1. 配置 Docker 容器运行时:一旦安装了 NVIDIA Container Toolkit,你需要配置 Docker 容器运行时以使用 NVIDIA Container Toolkit。你可以使用以下命令来配置 Docker 容器运行时:

    sudo nvidia-container-runtime configure --runtime=docker
  2. 重启 Docker:最后,为了使更改生效,你需要重启 Docker 服务。你可以使用以下命令来重新启动 Docker 服务:

    sudo systemctl restart docker

现在,你已经成功地在 NVIDIA GPU 环境中安装了 Docker,并且将其配置为使用 NVIDIA Container Toolkit 运行容器。你可以通过运行 docker run 命令来启动一个 Docker 容器,并确保它能够正常工作。

验证Docker是否可以使用

it just use nvidia-smi to list GPU status

sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

安装模型转换的依赖

  • 编写requirements.txt
onnx==1.14.1
numpy
onnxruntime==1.15.1
torch==2.0.1
transformers==4.31.0
transformers_stream_generator==0.0.4
sentencepiece==0.1.99
tritonclient
gevent****client
gevent


  • 安装依赖
pip install -r requirements.txt -i ****s://pypi.mirrors.ustc.edu.cn/simple/

创建llm_models/Llama-2-7b-chat-ms/modeling_llama.py

# coding=utf-8
# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
#
# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
# and OPT implementations in this library. It has been modified from its
# original forms to accommodate minor architectural differences compared
# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     ****://***.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" PyTorch LLaMA model."""
import math
from typing import List, Optional, Tuple, Union

import torch
import torch.nn.functional as F
import torch.utils.checkpoint
from torch import nn
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss

from transformers.activations import ACT2FN
from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast
from transformers.modeling_utils import PreTrainedModel
from transformers.utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
from .configuration_llama import LlamaConfig

logger = logging.get_logger(__name__)

_CONFIG_FOR_DOC = "LlamaConfig"


# Copied from transformers.models.bart.modeling_bart._make_causal_mask
def _make_causal_mask(
    input_ids_shape: torch.Size, dtype: torch.dtype, device: torch.device, past_key_values_length: int = 0
):
    """
    Make causal mask used for bi-directional self-attention.
    """
    bsz, tgt_len = input_ids_shape
    mask = torch.full((tgt_len, tgt_len), torch.finfo(dtype).min, device=device)
    mask_cond = torch.arange(mask.size(-1), device=device)
    mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
    mask = mask.to(dtype)

    if past_key_values_length > 0:
        mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype, device=device), mask], dim=-1)
    return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length)


# Copied from transformers.models.bart.modeling_bart._expand_mask
def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
    """
    Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
    """
    bsz, src_len = mask.size()
    tgt_len = tgt_len if tgt_len is not None else src_len

    expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)

    inverted_mask = 1.0 - expanded_mask

    return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min)


class LlamaRMSNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        """
        LlamaRMSNorm is equivalent to T5LayerNorm
        """
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(torch.float32)
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        return self.weight * hidden_states.to(input_dtype)


class LlamaRotaryEmbedding(torch.nn.Module):
    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
        super().__init__()

        self.dim = dim
        self.max_position_embeddings = max_position_embeddings
        self.base = base
        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
        self.register_buffer("inv_freq", inv_freq)

        # Build here to make `torch.jit.trace` work.
        self._set_cos_sin_cache(
            seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
        )

    def _set_cos_sin_cache(self, seq_len, device, dtype):
        self.max_seq_len_cached = seq_len
        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)

        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
        # Different from paper, but it uses a different permutation in order to obtain the same calculation
        emb = torch.cat((freqs, freqs), dim=-1)
        self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
        self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)

    def forward(self, x, seq_len=None):
        # x: [bs, num_attention_heads, seq_len, head_size]
        if seq_len > self.max_seq_len_cached:
            self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)

        return (
            self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
            self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
        )


class LlamaLinearScalingRotaryEmbedding(LlamaRotaryEmbedding):
    """LlamaRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""

    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
        self.scaling_factor = scaling_factor
        super().__init__(dim, max_position_embeddings, base, device)

    def _set_cos_sin_cache(self, seq_len, device, dtype):
        self.max_seq_len_cached = seq_len
        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
        t = t / self.scaling_factor

        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
        # Different from paper, but it uses a different permutation in order to obtain the same calculation
        emb = torch.cat((freqs, freqs), dim=-1)
        self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
        self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)


class LlamaDynamicNTKScalingRotaryEmbedding(LlamaRotaryEmbedding):
    """LlamaRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""

    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
        self.scaling_factor = scaling_factor
        super().__init__(dim, max_position_embeddings, base, device)

    def _set_cos_sin_cache(self, seq_len, device, dtype):
        self.max_seq_len_cached = seq_len

        if seq_len > self.max_position_embeddings:
            base = self.base * (
                (self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)
            ) ** (self.dim / (self.dim - 2))
            inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
            self.register_buffer("inv_freq", inv_freq)

        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)

        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
        # Different from paper, but it uses a different permutation in order to obtain the same calculation
        emb = torch.cat((freqs, freqs), dim=-1)
        self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
        self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)


def rotate_half(x):
    """Rotates half the hidden dims of the input."""
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2 :]
    return torch.cat((-x2, x1), dim=-1)


def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
    # The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
    # cos = cos.squeeze(1).squeeze(0)  # [seq_len, dim]
    # sin = sin.squeeze(1).squeeze(0)  # [seq_len, dim]
    #cos = torch.squeeze(cos)  # [seq_len, dim]
    #sin = torch.squeeze(sin)  # [seq_len, dim]
    #cos = cos[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
    #sin = sin[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
    cos = cos.view(cos.shape[2], cos.shape[3])   # [seq_len, dim]
    cos = cos[position_ids].view(position_ids.shape[0], 1, position_ids.shape[1], cos.shape[1])  # [bs, 1, seq_len, dim]
    sin = sin.view(sin.shape[2], sin.shape[3])  # [seq_len, dim]
    sin = sin[position_ids].view(position_ids.shape[0], 1, position_ids.shape[1], sin.shape[1])  # [bs, 1, seq_len, dim]

    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed


class LlamaMLP(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.pretraining_tp = config.pretraining_tp
        self.hidden_size = config.hidden_size
        self.intermediate_size = config.intermediate_size
        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
        self.act_fn = ACT2FN[config.hidden_act]

    def forward(self, x):
        if self.pretraining_tp > 1:
            slice = self.intermediate_size // self.pretraining_tp
            gate_proj_slices = self.gate_proj.weight.split(slice, dim=0)
            up_proj_slices = self.up_proj.weight.split(slice, dim=0)
            down_proj_slices = self.down_proj.weight.split(slice, dim=1)

            gate_proj = torch.cat([F.linear(x, gate_proj_slices[i]) for i in range(self.pretraining_tp)], dim=-1)
            up_proj = torch.cat([F.linear(x, up_proj_slices[i]) for i in range(self.pretraining_tp)], dim=-1)

            intermediate_states = (self.act_fn(gate_proj) * up_proj).split(slice, dim=2)
            down_proj = [F.linear(intermediate_states[i], down_proj_slices[i]) for i in range(self.pretraining_tp)]
            down_proj = sum(down_proj)
        else:
            down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))

        return down_proj


def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
    """
    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
    """
    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
    if n_rep == 1:
        return hidden_states
    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)


class LlamaAttention(nn.Module):
    """Multi-headed attention from 'Attention Is All You Need' paper"""

    def __init__(self, config: LlamaConfig):
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size
        self.num_heads = config.num_attention_heads
        self.head_dim = self.hidden_size // self.num_heads
        self.num_key_value_heads = config.num_key_value_heads
        self.num_key_value_groups = self.num_heads // self.num_key_value_heads
        self.pretraining_tp = config.pretraining_tp
        self.max_position_embeddings = config.max_position_embeddings

        if (self.head_dim * self.num_heads) != self.hidden_size:
            raise ValueError(
                f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
                f" and `num_heads`: {self.num_heads})."
            )
        self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
        self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
        self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
        self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
        self._init_rope()

    def _init_rope(self):
        if self.config.rope_scaling is None:
            self.rotary_emb = LlamaRotaryEmbedding(self.head_dim, max_position_embeddings=self.max_position_embeddings)
        else:
            scaling_type = self.config.rope_scaling["type"]
            scaling_factor = self.config.rope_scaling["factor"]
            if scaling_type == "linear":
                self.rotary_emb = LlamaLinearScalingRotaryEmbedding(
                    self.head_dim, max_position_embeddings=self.max_position_embeddings, scaling_factor=scaling_factor
                )
            elif scaling_type == "dynamic":
                self.rotary_emb = LlamaDynamicNTKScalingRotaryEmbedding(
                    self.head_dim, max_position_embeddings=self.max_position_embeddings, scaling_factor=scaling_factor
                )
            else:
                raise ValueError(f"Unknown RoPE scaling type {scaling_type}")

    def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
        return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_value: Optional[Tuple[torch.Tensor]] = None,
        output_attentions: bool = False,
        use_cache: bool = False,
    ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
        bsz, q_len, _ = hidden_states.size()

        if self.pretraining_tp > 1:
            key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.pretraining_tp
            query_slices = self.q_proj.weight.split((self.num_heads * self.head_dim) // self.pretraining_tp, dim=0)
            key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
            value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)

            query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.pretraining_tp)]
            query_states = torch.cat(query_states, dim=-1)

            key_states = [F.linear(hidden_states, key_slices[i]) for i in range(self.pretraining_tp)]
            key_states = torch.cat(key_states, dim=-1)

            value_states = [F.linear(hidden_states, value_slices[i]) for i in range(self.pretraining_tp)]
            value_states = torch.cat(value_states, dim=-1)

        else:
            query_states = self.q_proj(hidden_states)
            key_states = self.k_proj(hidden_states)
            value_states = self.v_proj(hidden_states)

        query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
        key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
        value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)

        kv_seq_len = key_states.shape[-2]
        if past_key_value is not None:
            kv_seq_len += past_key_value[0].shape[-2]
        cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
        query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)

        if past_key_value is not None:
            # reuse k, v, self_attention
            key_states = torch.cat([past_key_value[0], key_states], dim=2)
            value_states = torch.cat([past_key_value[1], value_states], dim=2)

        past_key_value = (key_states, value_states) if use_cache else None

        # repeat k/v heads if n_kv_heads < n_heads
        key_states = repeat_kv(key_states, self.num_key_value_groups)
        value_states = repeat_kv(value_states, self.num_key_value_groups)

        attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)

        if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
            raise ValueError(
                f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
                f" {attn_weights.size()}"
            )

        if attention_mask is not None:
            if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
                raise ValueError(
                    f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
                )
            attn_weights = attn_weights + attention_mask

        # upcast attention to fp32
        attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
        attn_output = torch.matmul(attn_weights, value_states)

        if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
            raise ValueError(
                f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
                f" {attn_output.size()}"
            )

        attn_output = attn_output.transpose(1, 2).contiguous()
        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)

        if self.pretraining_tp > 1:
            attn_output = attn_output.split(self.hidden_size // self.pretraining_tp, dim=2)
            o_proj_slices = self.o_proj.weight.split(self.hidden_size // self.pretraining_tp, dim=1)
            attn_output = sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.pretraining_tp)])
        else:
            attn_output = self.o_proj(attn_output)

        if not output_attentions:
            attn_weights = None

        return attn_output, attn_weights, past_key_value


class LlamaDecoderLayer(nn.Module):
    def __init__(self, config: LlamaConfig):
        super().__init__()
        self.hidden_size = config.hidden_size
        self.self_attn = LlamaAttention(config=config)
        self.mlp = LlamaMLP(config)
        self.input_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
        self.post_attention_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)

    def forward(
        self,
        hidden_states: torch.Tensor,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_value: Optional[Tuple[torch.Tensor]] = None,
        output_attentions: Optional[bool] = False,
        use_cache: Optional[bool] = False,
    ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
        """
        Args:
            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
            attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
            output_attentions (`bool`, *optional*):
                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
                returned tensors for more detail.
            use_cache (`bool`, *optional*):
                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
                (see `past_key_values`).
            past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
        """

        residual = hidden_states

        hidden_states = self.input_layernorm(hidden_states)

        # Self Attention
        hidden_states, self_attn_weights, present_key_value = self.self_attn(
            hidden_states=hidden_states,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_value=past_key_value,
            output_attentions=output_attentions,
            use_cache=use_cache,
        )
        hidden_states = residual + hidden_states

        # Fully Connected
        residual = hidden_states
        hidden_states = self.post_attention_layernorm(hidden_states)
        hidden_states = self.mlp(hidden_states)
        hidden_states = residual + hidden_states

        outputs = (hidden_states,)

        if output_attentions:
            outputs += (self_attn_weights,)

        if use_cache:
            outputs += (present_key_value,)

        return outputs


LLAMA_START_DOCSTRING = r"""
    This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
    library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
    etc.)

    This model is also a PyTorch [torch.nn.Module](****s://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
    Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
    and behavior.

    Parameters:
        config ([`LlamaConfig`]):
            Model configuration class with all the parameters of the model. Initializing with a config file does not
            load the weights associated with the model, only the configuration. Check out the
            [`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""


@add_start_docstrings(
    "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
    LLAMA_START_DOCSTRING,
)
class LlamaPreTrainedModel(PreTrainedModel):
    config_class = LlamaConfig
    base_model_prefix = "model"
    supports_gradient_checkpointing = True
    _no_split_modules = ["LlamaDecoderLayer"]
    _skip_keys_device_placement = "past_key_values"

    def _init_weights(self, module):
        std = self.config.initializer_range
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=std)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()

    def _set_gradient_checkpointing(self, module, value=False):
        if isinstance(module, LlamaModel):
            module.gradient_checkpointing = value


LLAMA_INPUTS_DOCSTRING = r"""
    Args:
        input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
            Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
            it.

            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
            [`PreTrainedTokenizer.__call__`] for details.

            [What are input IDs?](../glossary#input-ids)
        attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
            Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

            - 1 for tokens that are **not masked**,
            - 0 for tokens that are **masked**.

            [What are attention masks?](../glossary#attention-mask)

            Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
            [`PreTrainedTokenizer.__call__`] for details.

            If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
            `past_key_values`).

            If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
            and modify to your needs. See diagram 1 in [the paper](****s://arxiv.org/abs/1910.13461) for more
            information on the default strategy.

            - 1 indicates the head is **not masked**,
            - 0 indicates the head is **masked**.
        position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
            Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
            config.n_positions - 1]`.

            [What are position IDs?](../glossary#position-ids)
        past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
            Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
            `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
            `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.

            Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
            blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.

            If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
            don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
            `decoder_input_ids` of shape `(batch_size, sequence_length)`.
        inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
            Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
            is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
            model's internal embedding lookup matrix.
        use_cache (`bool`, *optional*):
            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
            `past_key_values`).
        output_attentions (`bool`, *optional*):
            Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
            tensors for more detail.
        output_hidden_states (`bool`, *optional*):
            Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
            more detail.
        return_dict (`bool`, *optional*):
            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
"""


@add_start_docstrings(
    "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
    LLAMA_START_DOCSTRING,
)
class LlamaModel(LlamaPreTrainedModel):
    """
    Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`LlamaDecoderLayer`]

    Args:
        config: LlamaConfig
    """

    def __init__(self, config: LlamaConfig):
        super().__init__(config)
        self.padding_idx = config.pad_token_id
        self.vocab_size = config.vocab_size

        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
        self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
        self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)

        self.gradient_checkpointing = False
        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        return self.embed_tokens

    def set_input_embeddings(self, value):
        self.embed_tokens = value

    # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask
    def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
        # create causal mask
        # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
        combined_attention_mask = None
        if input_shape[-1] > 1:
            combined_attention_mask = _make_causal_mask(
                input_shape,
                inputs_embeds.dtype,
                device=inputs_embeds.device,
                past_key_values_length=past_key_values_length,
            )

        if attention_mask is not None:
            # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
            expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(
                inputs_embeds.device
            )
            combined_attention_mask = (
                expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
            )

        return combined_attention_mask

    @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
    def forward(
        self,
        input_ids: torch.LongTensor = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, BaseModelOutputWithPast]:
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        use_cache = use_cache if use_cache is not None else self.config.use_cache

        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # retrieve input_ids and inputs_embeds
        if input_ids is not None and inputs_embeds is not None:
            raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
        elif input_ids is not None:
            batch_size, seq_length = input_ids.shape
        elif inputs_embeds is not None:
            batch_size, seq_length, _ = inputs_embeds.shape
        else:
            raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")

        seq_length_with_past = seq_length
        past_key_values_length = 0

        if past_key_values is not None:
            past_key_values_length = past_key_values[0][0].shape[2]
            seq_length_with_past = seq_length_with_past + past_key_values_length

        if position_ids is None:
            device = input_ids.device if input_ids is not None else inputs_embeds.device
            position_ids = torch.arange(
                past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
            )
            position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
        else:
            position_ids = position_ids.view(-1, seq_length).long()

        if inputs_embeds is None:
            inputs_embeds = self.embed_tokens(input_ids)
        # embed positions
        if attention_mask is None:
            attention_mask = torch.ones(
                (batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embeds.device
            )
        attention_mask = self._prepare_decoder_attention_mask(
            attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
        )

        hidden_states = inputs_embeds

        if self.gradient_checkpointing and self.training:
            if use_cache:
                logger.warning_once(
                    "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
                )
                use_cache = False

        # decoder layers
        all_hidden_states = () if output_hidden_states else None
        all_self_attns = () if output_attentions else None
        next_decoder_cache = () if use_cache else None

        for idx, decoder_layer in enumerate(self.layers):
            if output_hidden_states:
                all_hidden_states += (hidden_states,)

            past_key_value = past_key_values[idx] if past_key_values is not None else None

            if self.gradient_checkpointing and self.training:

                def create_custom_forward(module):
                    def custom_forward(*inputs):
                        # None for past_key_value
                        return module(*inputs, output_attentions, None)

                    return custom_forward

                layer_outputs = torch.utils.checkpoint.checkpoint(
                    create_custom_forward(decoder_layer),
                    hidden_states,
                    attention_mask,
                    position_ids,
                    None,
                )
            else:
                layer_outputs = decoder_layer(
                    hidden_states,
                    attention_mask=attention_mask,
                    position_ids=position_ids,
                    past_key_value=past_key_value,
                    output_attentions=output_attentions,
                    use_cache=use_cache,
                )

            hidden_states = layer_outputs[0]

            if use_cache:
                next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)

            if output_attentions:
                all_self_attns += (layer_outputs[1],)

        hidden_states = self.norm(hidden_states)

        # add hidden states from the last decoder layer
        if output_hidden_states:
            all_hidden_states += (hidden_states,)

        next_cache = next_decoder_cache if use_cache else None
        if not return_dict:
            return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
        return BaseModelOutputWithPast(
            last_hidden_state=hidden_states,
            past_key_values=next_cache,
            hidden_states=all_hidden_states,
            attentions=all_self_attns,
        )


class LlamaForCausalLM(LlamaPreTrainedModel):
    _tied_weights_keys = ["lm_head.weight"]

    def __init__(self, config):
        super().__init__(config)
        self.model = LlamaModel(config)
        self.pretraining_tp = config.pretraining_tp
        self.vocab_size = config.vocab_size
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        return self.model.embed_tokens

    def set_input_embeddings(self, value):
        self.model.embed_tokens = value

    def get_output_embeddings(self):
        return self.lm_head

    def set_output_embeddings(self, new_embeddings):
        self.lm_head = new_embeddings

    def set_decoder(self, decoder):
        self.model = decoder

    def get_decoder(self):
        return self.model

    @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
    @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
    def forward(
        self,
        input_ids: torch.LongTensor = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        labels: Optional[torch.LongTensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, CausalLMOutputWithPast]:
        r"""
        Args:
            labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
                Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
                config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
                (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.

        Returns:

        Example:

        ```python
        >>> from transformers import AutoTokenizer, LlamaForCausalLM

        >>> model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
        >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)

        >>> prompt = "Hey, are you conscious? Can you talk to me?"
        >>> inputs = tokenizer(prompt, return_tensors="pt")

        >>> # Generate
        >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
        >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
        "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
        ```"""

        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
        outputs = self.model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )

        hidden_states = outputs[0]
        if self.pretraining_tp > 1:
            lm_head_slices = self.lm_head.weight.split(self.vocab_size // self.pretraining_tp, dim=0)
            logits = [F.linear(hidden_states, lm_head_slices[i]) for i in range(self.pretraining_tp)]
            logits = torch.cat(logits, dim=-1)
        else:
            logits = self.lm_head(hidden_states)
        logits = logits.float()

        loss = None
        if labels is not None:
            # Shift so that tokens < n predict n
            shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            # Flatten the tokens
            loss_fct = CrossEntropyLoss()
            shift_logits = shift_logits.view(-1, self.config.vocab_size)
            shift_labels = shift_labels.view(-1)
            # Enable model parallelism
            shift_labels = shift_labels.to(shift_logits.device)
            loss = loss_fct(shift_logits, shift_labels)

        if not return_dict:
            output = (logits,) + outputs[1:]
            return (loss,) + output if loss is not None else output

        return CausalLMOutputWithPast(
            loss=loss,
            logits=logits,
            past_key_values=outputs.past_key_values,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )

    def prepare_inputs_for_generation(
        self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
    ):
        if past_key_values:
            input_ids = input_ids[:, -1:]

        position_ids = kwargs.get("position_ids", None)
        if attention_mask is not None and position_ids is None:
            # create position_ids on the fly for batch generation
            position_ids = attention_mask.long().cumsum(-1) - 1
            position_ids.masked_fill_(attention_mask == 0, 1)
            if past_key_values:
                position_ids = position_ids[:, -1].unsqueeze(-1)

        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
        if inputs_embeds is not None and past_key_values is None:
            model_inputs = {"inputs_embeds": inputs_embeds}
        else:
            model_inputs = {"input_ids": input_ids}

        model_inputs.update(
            {
                "position_ids": position_ids,
                "past_key_values": past_key_values,
                "use_cache": kwargs.get("use_cache"),
                "attention_mask": attention_mask,
            }
        )
        return model_inputs

    @staticmethod
    def _reorder_cache(past_key_values, beam_idx):
        reordered_past = ()
        for layer_past in past_key_values:
            reordered_past += (
                tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
            )
        return reordered_past


@add_start_docstrings(
    """
    The LLaMa Model transformer with a sequence classification head on top (linear layer).

    [`LlamaForSequenceClassification`] uses the last token in order to do the classification, as other causal models
    (e.g. GPT-2) do.

    Since it does classification on the last token, it requires to know the position of the last token. If a
    `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
    no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
    padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
    each row of the batch).
    """,
    LLAMA_START_DOCSTRING,
)
class LlamaForSequenceClassification(LlamaPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.model = LlamaModel(config)
        self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)

        # Initialize weights and apply final processing
        self.post_init()

    def get_input_embeddings(self):
        return self.model.embed_tokens

    def set_input_embeddings(self, value):
        self.model.embed_tokens = value

    @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
    def forward(
        self,
        input_ids: torch.LongTensor = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        labels: Optional[torch.LongTensor] = None,
        use_cache: Optional[bool] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
        r"""
        labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
            Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
            config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
            `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        transformer_outputs = self.model(
            input_ids,
            attention_mask=attention_mask,
            position_ids=position_ids,
            past_key_values=past_key_values,
            inputs_embeds=inputs_embeds,
            use_cache=use_cache,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        hidden_states = transformer_outputs[0]
        logits = self.score(hidden_states)

        if input_ids is not None:
            batch_size = input_ids.shape[0]
        else:
            batch_size = inputs_embeds.shape[0]

        if self.config.pad_token_id is None and batch_size != 1:
            raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
        if self.config.pad_token_id is None:
            sequence_lengths = -1
        else:
            if input_ids is not None:
                sequence_lengths = (torch.ne(input_ids, self.config.pad_token_id).sum(-1) - 1).to(logits.device)
            else:
                sequence_lengths = -1

        pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]

        loss = None
        if labels is not None:
            labels = labels.to(logits.device)
            if self.config.problem_type is None:
                if self.num_labels == 1:
                    self.config.problem_type = "regression"
                elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
                    self.config.problem_type = "single_label_classification"
                else:
                    self.config.problem_type = "multi_label_classification"

            if self.config.problem_type == "regression":
                loss_fct = MSELoss()
                if self.num_labels == 1:
                    loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
                else:
                    loss = loss_fct(pooled_logits, labels)
            elif self.config.problem_type == "single_label_classification":
                loss_fct = CrossEntropyLoss()
                loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
            elif self.config.problem_type == "multi_label_classification":
                loss_fct = BCEWithLogitsLoss()
                loss = loss_fct(pooled_logits, labels)
        if not return_dict:
            output = (pooled_logits,) + transformer_outputs[1:]
            return ((loss,) + output) if loss is not None else output

        return SequenceClassifierOutputWithPast(
            loss=loss,
            logits=pooled_logits,
            past_key_values=transformer_outputs.past_key_values,
            hidden_states=transformer_outputs.hidden_states,
            attentions=transformer_outputs.attentions,
        )

创建llm_models/Llama-2-7b-chat-ms/configuration_llama.py

# coding=utf-8
# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
#
# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
# and OPT implementations in this library. It has been modified from its
# original forms to accommodate minor architectural differences compared
# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     ****://***.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" LLaMA model configuration"""
​
from transformers.configuration_utils import PretrainedConfig
from transformers.utils import logging
​
​
logger = logging.get_logger(__name__)
​
LLAMA_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
​
​
class LlamaConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`LlamaModel`]. It is used to instantiate an LLaMA
    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
    defaults will yield a similar configuration to that of the LLaMA-7B.
    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.
    Args:
        vocab_size (`int`, *optional*, defaults to 32000):
            Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the
            `inputs_ids` passed when calling [`LlamaModel`]
        hidden_size (`int`, *optional*, defaults to 4096):
            Dimension of the hidden representations.
        intermediate_size (`int`, *optional*, defaults to 11008):
            Dimension of the MLP representations.
        num_hidden_layers (`int`, *optional*, defaults to 32):
            Number of hidden layers in the Transformer encoder.
        num_attention_heads (`int`, *optional*, defaults to 32):
            Number of attention heads for each attention layer in the Transformer encoder.
        num_key_value_heads (`int`, *optional*):
            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
            `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
            by meanpooling all the original heads within that group. For more details checkout [this
            paper](****s://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
            `num_attention_heads`.
        pretraining_tp (`int`, *optional*, defaults to `1`):
            Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
            document](****s://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is
            necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
            issue](****s://github.com/pytorch/pytorch/issues/76232).
        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
            The non-linear activation function (function or string) in the decoder.
        max_position_embeddings (`int`, *optional*, defaults to 2048):
            The maximum sequence length that this model might ever be used with. Typically set this to something large
            just in case (e.g., 512 or 1024 or 2048).
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        rms_norm_eps (`float`, *optional*, defaults to 1e-12):
            The epsilon used by the rms normalization layers.
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models). Only
            relevant if `config.is_decoder=True`.
        tie_word_embeddings(`bool`, *optional*, defaults to `False`):
            Whether to tie weight embeddings
        rope_scaling (`Dict`, *optional*):
            Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports three scaling
            strategies: linear and dynamic. Their scaling factor must be an float greater than 1. The expected format
            is `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
            `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
            these scaling strategies behave:
            ****s://***.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
            experimental feature, subject to breaking API changes in future versions.
        Example:
    ```python
    >>> from transformers import LlamaModel, LlamaConfig
    >>> # Initializing a LLaMA llama-7b style configuration
    >>> configuration = LlamaConfig()
    >>> # Initializing a model from the llama-7b style configuration
    >>> model = LlamaModel(configuration)
    >>> # Accessing the model configuration
    >>> configuration = model.config
    ```"""
    model_type = "llama"
    keys_to_ignore_at_inference = ["past_key_values"]
​
    def __init__(
        self,
        vocab_size=32000,
        hidden_size=4096,
        intermediate_size=11008,
        num_hidden_layers=32,
        num_attention_heads=32,
        num_key_value_heads=None,
        hidden_act="silu",
        max_position_embeddings=2048,
        initializer_range=0.02,
        rms_norm_eps=1e-6,
        use_cache=True,
        pad_token_id=0,
        bos_token_id=1,
        eos_token_id=2,
        pretraining_tp=1,
        tie_word_embeddings=False,
        rope_scaling=None,
        **kwargs,
    ):
        self.vocab_size = vocab_size
        self.max_position_embeddings = max_position_embeddings
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
​
        # for backward compatibility
        if num_key_value_heads is None:
            num_key_value_heads = num_attention_heads
​
        self.num_key_value_heads = num_key_value_heads
        self.hidden_act = hidden_act
        self.initializer_range = initializer_range
        self.rms_norm_eps = rms_norm_eps
        self.pretraining_tp = pretraining_tp
        self.use_cache = use_cache
        self.rope_scaling = rope_scaling
        self._rope_scaling_validation()
​
        super().__init__(
            pad_token_id=pad_token_id,
            bos_token_id=bos_token_id,
            eos_token_id=eos_token_id,
            tie_word_embeddings=tie_word_embeddings,
            **kwargs,
        )
​
    def _rope_scaling_validation(self):
        """
        Validate the `rope_scaling` configuration.
        """
        if self.rope_scaling is None:
            return
​
        if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
            raise ValueError(
                "`rope_scaling` must be a dictionary with with two fields, `name` and `factor`, "
                f"got {self.rope_scaling}"
            )
        rope_scaling_type = self.rope_scaling.get("type", None)
        rope_scaling_factor = self.rope_scaling.get("factor", None)
        if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]:
            raise ValueError(
                f"`rope_scaling`'s name field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
            )
        if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:
            raise ValueError(f"`rope_scaling`'s factor field must be an float > 1, got {rope_scaling_factor}")

创建llm_models/Llama-2-7b-chat-ms/config.json

{
  "architectures": [
    "LlamaForCausalLM"
  ],
  "auto_map": {
    "AutoModelForCausalLM": "modeling_llama.LlamaForCausalLM"
  },
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 4096,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "pad_token_id": 0,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.31.0.dev0",
  "use_cache": true,
  "vocab_size": 32000
}

hf->onnx

  • 转换代码
import os
import base64
import glob
import shutil
import argparse
import torch
import numpy as np
import onnxruntime as ort
import sentencepiece as spm
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer

# some wrapper class for export
class Embedding(torch.nn.Module):
    def __init__(self, embed, using_bf16: bool = False):
        super().__init__()
        self.bf16 = using_bf16
        if using_bf16:
            # using bf16 embedding weight
            self.embed = embed.bfloat16()
        else:
            self.embed = embed

    def forward(self, input_ids):
        res = self.embed(input_ids)
        if self.bf16:
            res = res.float()
        return res.view(-1, 1, 4096)

class Lm(torch.nn.Module):
    def __init__(self, lm):
        super().__init__()
        self.lm = lm

    def forward(self, hidden_states):
        m_logits = self.lm(hidden_states)
        token = torch.argmax(m_logits)
        return token

class LLM(torch.nn.Module):
    '''
    Base class for all llm model. Inherits from [`torch.nn.Module`].
    '''

    def __init__(self, args):
        super().__init__()
        self.export_path = args.export_path
        self.export_verbose = args.export_verbose
        self.export_test = args.export_test
        self.embed_bf16 = args.embed_bf16
        tokenizer_model = os.path.join(args.path, 'tokenizer.model')
        if os.path.exists(tokenizer_model):
            self.sp_model = spm.SentencePieceProcessor(tokenizer_model)
        else:
            self.sp_model = None
        self.load_model(args.path)
        self.max_length = 1024

    def load_model(self, model_path: str):
        raise NotImplementedError

    def get_attention_mask(self) -> torch.Tensor:
        raise NotImplementedError

    def get_position_ids(self) -> torch.Tensor:
        raise NotImplementedError

    def export_vocab(self):
        raise NotImplementedError

    def forward(self, input_ids, attention_mask, position_ids, past_key_values):
        hidden_states = self.embed(input_ids)
        presents = []
        for i in range(self.block_nums):
            hidden_states, kv = self.blocks[i](hidden_states, attention_mask, position_ids, past_key_values[i])
            presents.append(kv)
        token_id = self.lm(hidden_states).view(1)
        presents = torch.stack(presents)
        self.seq_len += 1
        self.token_len += 1
        return token_id, presents

    # some test functions
    def build_prompt(self, query):
        if hasattr(self.tokenizer, 'build_prompt'):
            prompt = self.tokenizer.build_prompt(query)
        else:
            prompt = query
        return prompt

    def str_to_ids(self, prompt):
        input_ids = self.tokenizer(prompt, return_tensors="pt")['input_ids']
        return input_ids

    def id_to_str(self, token_id):
        word = self.tokenizer._convert_id_to_token(int(token_id))
        word = self.tokenizer.convert_tokens_to_string([word])
        return word

    def response(self, query):
        prompt = self.build_prompt(query)
        input_ids = self.str_to_ids(prompt)
        self.seq_len = input_ids.numel()
        self.context_len = self.seq_len - 2
        self.token_len = 0
        past_key_values = [None for i in range(self.block_nums)]
        token_id = input_ids
        while self.token_len < self.max_length:
            attention_mask = self.get_attention_mask()
            position_ids = self.get_position_ids()
            token_id, past_key_values = self.forward(token_id, attention_mask, position_ids, past_key_values)
            if token_id == self.stop_id:
                print("", end='\n')
                break
            word = self.id_to_str(token_id)
            print(word, end="", flush=True)

    # some export functions
    def assert_equal(self, torch_outs, onnx_outs):
        if type(torch_outs) not in (list, tuple):
            torch_outs = (torch_outs, )
            onnx_outs = (onnx_outs, )
        same = True
        for orig, onnx in zip(torch_outs, onnx_outs):
            orig = orig.detach().numpy()
            if not np.allclose(orig, onnx, rtol=1e-3, atol=1e-3):
                print('Error: onnx outputs dont match original. [shape = {}] onnx: {}, original: {}'.format(onnx.shape, onnx, orig))
                same = False
                break
        if same:
            print('onnx test SUCCESS')

    def export_lm(self):
        model = self.lm
        hidden_states = torch.randn(1, 4096)
        onnx_model = f'./{self.export_path}/lm.onnx'
        torch.onnx.export(model, (hidden_states),
                        onnx_model,
                        verbose=self.export_verbose,
                        input_names=['hidden_states'],
                        output_names=['token_id'],
                        do_constant_folding=True,
                        opset_version=15)
        # test lm
        if self.export_test:
            original_outs = model(hidden_states)
            ort_session = ort.InferenceSession(onnx_model, providers=['CPUExecutionProvider'])
            inputs = {
                'hidden_states' : hidden_states.numpy(),
            }
            onnx_outs = ort_session.run(None, inputs)
            self.assert_equal(original_outs, onnx_outs)

    def export_embed(self):
        model = self.embed
        input_ids = torch.arange(3, dtype=torch.long)
        onnx_model = f'./{self.export_path}/embedding.onnx'
        torch.onnx.export(model, (input_ids),
                        onnx_model,
                        verbose=self.export_verbose,
                        input_names=['input_ids'],
                        output_names=['inputs_embeds'],
                        dynamic_axes={"input_ids": {
                            0: "length"
                        }},
                        do_constant_folding=True,
                        opset_version=15)
        # test
        if self.export_test:
            original_outs = model(input_ids)
            ort_session = ort.InferenceSession(onnx_model, providers=['CPUExecutionProvider'])
            inputs = {
                'input_ids' : input_ids.numpy(),
            }
            onnx_outs = ort_session.run(None, inputs)
            self.assert_equal(original_outs, onnx_outs)

    def export_block(self, block_id: int):
        self.seq_len = 3
        self.token_len = 0
        inputs_embeds = torch.randn((self.seq_len, 1, 4096))
        attention_mask =  self.get_attention_mask()
        position_ids = self.get_position_ids()
        past_key_values = torch.zeros(self.past_kv_shape[1:])
        model = self.blocks[block_id]
        onnx_model = f'./{self.export_path}/block_{block_id}.onnx'
        torch.onnx.export(
            model, (inputs_embeds, attention_mask, position_ids, past_key_values),
            onnx_model,
            verbose=self.export_verbose,
            input_names=[
                'inputs_embeds', 'attention_mask', 'position_ids', 'past_key_values'
            ],
            output_names=['hidden_states', 'presents'],
            dynamic_axes=self.block_dynamic_axes,
            do_constant_folding=True,
            opset_version=15)
        if self.export_test:
            original_outs = model(inputs_embeds, attention_mask, position_ids, past_key_values)
            ort_session = ort.InferenceSession(onnx_model, providers=['CPUExecutionProvider'])
            inputs = {
                'inputs_embeds' : inputs_embeds.detach().numpy(),
                'attention_mask' : attention_mask.numpy(),
                'position_ids' : position_ids.numpy(),
                'past_key_values' : past_key_values.numpy()
            }
            onnx_outs = ort_session.run(None, inputs)
            self.assert_equal(original_outs, onnx_outs)

    def export_blocks(self):
        for i in range(self.block_nums):
            self.export_block(i)

    def export(self):
        model = self
        self.seq_len = 3
        self.token_len = 0
        input_ids = torch.arange(3, dtype=torch.long)
        attention_mask =  self.get_attention_mask()
        position_ids = self.get_position_ids()
        past_key_values = torch.zeros(self.past_kv_shape)
        onnx_model = f'./{self.export_path}/llm.onnx'
        torch.onnx.export(
            model, (input_ids, attention_mask, position_ids, past_key_values),
            onnx_model,
            verbose=self.export_verbose,
            input_names=[
                'input_ids', 'attention_mask', 'position_ids', 'past_key_values'
            ],
            output_names=['token_id', 'presents'],
            dynamic_axes=self.model_dynamic_axes,
            do_constant_folding=True,
            opset_version=15)
        if self.export_test:
            # test
            original_outs = model(input_ids, attention_mask, position_ids, past_key_values)
            ort_session = ort.InferenceSession(onnx_model, providers=['CPUExecutionProvider'])
            inputs = {
                'input_ids' : input_ids.detach().numpy(),
                'attention_mask' : attention_mask.numpy(),
                'position_ids' : position_ids.numpy(),
                'past_key_values' : past_key_values.numpy()
            }
            onnx_outs = ort_session.run(None, inputs)
            self.assert_equal(original_outs, onnx_outs)

    def export_tokenizer(self):
        file_path = os.path.join(self.export_path, "tokenizer.txt")
        if self.sp_model is not None:
            # senetencepiece
            NORMAL = 1; UNKNOWN = 2; CONTROL = 3
            USER_DEFINED = 4; UNUSED = 5; BYTE = 6
            fp = open(file_path, "w", encoding="utf8")
            for i in range(self.sp_model.GetPieceSize()):
                token = self.sp_model.IdToPiece(i)
                score = self.sp_model.GetScore(i)
                type = NORMAL
                if self.sp_model.IsUnknown(i):
                    type = UNKNOWN
                elif self.sp_model.IsControl(i):
                    type = CONTROL
                elif self.sp_model.IsUnused(i):
                    type = UNUSED
                elif self.sp_model.IsByte(i):
                    type = BYTE
                if self.model_name == 'Chatglm_6b':
                    if '<n>' in token: token = '\n'
                    if '<|tab|>' in token: token = '\t'
                    if '<|blank_' in token: token = ' ' * int(token[8:token.find('|>')])
                if '▁' in token: token = token.replace('▁', ' ')
                token_encode = base64.b64encode(token.encode("utf-8")).decode("utf8")
                fp.write(f'{token_encode} {score} {type}\n')
            fp.close()
        else:
            # tikton
            with open(file_path, "w", encoding="utf8") as fp:
                for k, v in self.tokenizer.mergeable_ranks.items():
                    line = base64.b64encode(k).decode("utf8") + "\n"
                    fp.write(line)

# chatglm
class GLMBlock(torch.nn.Module):
    def __init__(self, block, block_id, final_layernorm = None):
        super().__init__()
        self.block = block
        self.block_id = block_id
        self.final_layernorm = final_layernorm

    def forward(self, hidden_states, attention_mask, position_ids, past_kv):
        hidden_states, presents = self.block(hidden_states,
                                             position_ids,
                                             attention_mask,
                                             self.block_id,
                                             past_kv,
                                             use_cache=True)
        if self.final_layernorm is not None:
            hidden_states = self.final_layernorm(hidden_states)
            hidden_states = hidden_states.view(-1, 4096)[-1].view(1, 1, 4096)
        if isinstance(presents, tuple):
            presents = torch.stack(presents)
        return hidden_states, presents

class Chatglm_6b(LLM):
    def __init__(self, args):
        super().__init__(args)
        self.model_name = 'Chatglm_6b'

    def load_model(self, model_path: str):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        model = AutoModel.from_pretrained(model_path, trust_remote_code=True).float().eval()
        transformer = model.transformer
        self.lm_ = model.lm_head
        self.embed_ = transformer.word_embeddings
        self.blocks_ = transformer.layers
        self.final_layernorm_ = transformer.final_layernorm
        # some wrapper
        self.stop_id = self.tokenizer._convert_token_to_id(self.tokenizer.eos_token)
        self.block_nums = len(self.blocks_)
        self.lm = Lm(self.lm_)
        # chatglm embedding and lm using same param, copy embedding when using bf16
        if self.embed_bf16:
            import copy
            embed_copy = copy.deepcopy(self.embed_)
            self.embed = Embedding(embed_copy, self.embed_bf16)
        else:
            self.embed = Embedding(self.embed_, self.embed_bf16)
        self.blocks = [GLMBlock(self.blocks_[i], i, self.final_layernorm_ if i == len(self.blocks_) - 1 else None) for i in range(self.block_nums)]
        # some config for export
        self.past_kv_shape = [28, 2, 0, 1, 32, 128]
        self.block_dynamic_axes = {
            "inputs_embeds" : { 0: "seq_len" },
            "attention_mask" : { 2: "seq_len", 3: "seq_len" },
            "position_ids" : { 2: "seq_len" },
            "past_key_values" : { 1: "history_len" }
        }
        self.model_dynamic_axes = {
            "input_ids" : { 0: "seq_len" },
            "attention_mask" : { 2: "seq_len", 3: "seq_len" },
            "position_ids" : { 2: "seq_len" },
            "past_key_values" : { 2: "history_len" }
        }

    def get_attention_mask(self) -> torch.Tensor:
        if self.token_len:
            return torch.zeros([1]).bool().reshape([1, 1, 1, 1])
        attention_mask = torch.zeros([self.seq_len, self.seq_len], dtype=torch.bool)
        for i in range(self.seq_len):
            attention_mask[i][-1] = True
        attention_mask = attention_mask.reshape([1, 1, self.seq_len, self.seq_len])
        return attention_mask

    def get_position_ids(self) -> torch.Tensor:
        if self.token_len:
            return torch.tensor([1, self.seq_len - self.context_len]).reshape([1, 2, 1])
        position_ids_0 = torch.arange(self.seq_len, dtype=torch.long)
        position_ids_1 = torch.zeros(self.seq_len, dtype=torch.long)
        position_ids_1[-1] = 1
        position_ids = torch.stack([position_ids_0, position_ids_1]).view(1, 2, -1)
        return position_ids

# chatglm2
class GLM2Block(torch.nn.Module):
    def __init__(self, block, block_id, final_layernorm = None):
        super().__init__()
        self.block = block
        self.block_id = block_id
        self.final_layernorm = final_layernorm

    def forward(self, hidden_states, attention_mask, position_ids, past_kv):
        theta = 1.0 / (10000 ** (torch.arange(0, 64, 2, dtype=torch.float32) / 64))
        position_ids = position_ids.float().reshape(-1, 1)
        idx_theta = position_ids * theta
        rotary_pos_emb = torch.stack([torch.cos(idx_theta), torch.sin(idx_theta)], dim=-1).unsqueeze(0).contiguous()
        hidden_states, presents = self.block(hidden_states,
                                            attention_mask,
                                            kv_cache=past_kv,
                                            rotary_pos_emb=rotary_pos_emb)
        if self.final_layernorm is not None:
            hidden_states = self.final_layernorm(hidden_states)
            hidden_states = hidden_states.view(-1, 4096)[-1].view(1, 1, 4096)
        if isinstance(presents, tuple):
            presents = torch.stack(presents)
        return hidden_states, presents

class Chatglm2_6b(LLM):
    def __init__(self, args):
        super().__init__(args)
        self.model_name = 'Chatglm2_6b'
        if 'codegeex2-6b' in args.path:
            self.model_name = 'Codegeex2_6b'

    def load_model(self, model_path: str):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        model = AutoModel.from_pretrained(model_path, trust_remote_code=True).float().eval()
        transformer = model.transformer
        self.lm_ = transformer.output_layer
        self.embed_ = transformer.embedding.word_embeddings
        self.blocks_ = transformer.encoder.layers
        self.final_layernorm_ = transformer.encoder.final_layernorm
        # some wrapper
        self.stop_id = self.tokenizer.eos_token_id
        if self.stop_id is None:
            # codegeex2-6b
            self.stop_id = self.tokenizer.tokenizer.eos_id
        self.block_nums = len(self.blocks_)
        self.embed = Embedding(self.embed_, self.embed_bf16)
        self.lm = Lm(self.lm_)
        self.blocks = [GLM2Block(self.blocks_[i], i, self.final_layernorm_ if i == len(self.blocks_) - 1 else None) for i in range(self.block_nums)]
        # some config for export
        self.past_kv_shape = [28, 2, 0, 1, 2, 128]
        self.block_dynamic_axes = {
            "inputs_embeds" : { 0: "seq_len" },
            "attention_mask" : { 2: "seq_len", 3: "seq_len" },
            "position_ids" : { 0: "seq_len" },
            "past_key_values" : { 1: "history_len" }
        }
        self.model_dynamic_axes = {
            "input_ids" : { 0: "seq_len" },
            "attention_mask" : { 2: "seq_len", 3: "seq_len" },
            "position_ids" : { 0: "seq_len" },
            "past_key_values" : { 2: "history_len" }
        }

    def get_attention_mask(self) -> torch.Tensor:
        if self.token_len:
            return torch.zeros([1, 1, 1, 1]).bool()
        attention_mask = ~torch.tril(torch.ones([1, 1, self.seq_len, self.seq_len]).bool())
        return attention_mask

    def get_position_ids(self) -> torch.Tensor:
        if self.token_len:
            return torch.tensor([self.token_len], dtype=torch.long)
        return torch.arange(self.seq_len, dtype=torch.long)

# qwen
class QWENBlock(torch.nn.Module):
    def __init__(self, block, block_id, final_layernorm = None):
        super().__init__()
        self.block = block
        self.block_id = block_id
        self.final_layernorm = final_layernorm

    def forward(self, hidden_states, attention_mask, position_ids, past_kv):
        theta = 1.0 / (10000.0 ** (torch.arange(0, 128, 2, dtype=torch.float32) / 128))
        position_ids = position_ids.float().reshape(-1, 1)
        idx_theta = position_ids * theta
        rotary_pos_emb = torch.cat((idx_theta, idx_theta), dim=-1)
        rotary_pos_emb = rotary_pos_emb.unsqueeze(1).unsqueeze(0)
        hidden_states = hidden_states.view(1, -1, 4096)
        hidden_states, presents = self.block(hidden_states,
                                             past_kv,
                                             attention_mask,
                                             rotary_pos_emb,
                                             use_cache=True)
        if self.final_layernorm is not None:
            hidden_states = self.final_layernorm(hidden_states)
            hidden_states = hidden_states.view(-1, 4096)[-1].view(1, 1, 4096)
        if isinstance(presents, tuple):
            presents = torch.stack(presents)
        return hidden_states, presents

class Qwen_7b_Chat(LLM):
    def __init__(self, args):
        super().__init__(args)

    def load_model(self, model_path: str):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).float().eval()
        transformer = model.transformer
        self.lm_ = model.lm_head
        self.embed_ = transformer.wte
        self.blocks_ = transformer.h
        self.final_layernorm_ = transformer.ln_f
        # some wrapper
        self.stop_id = self.tokenizer.im_end_id
        self.block_nums = len(self.blocks_)
        self.embed = Embedding(self.embed_, self.embed_bf16)
        self.lm = Lm(self.lm_)
        self.blocks = [QWENBlock(self.blocks_[i], i, self.final_layernorm_ if i == len(self.blocks_) - 1 else None) for i in range(self.block_nums)]
        # some config for export
        self.past_kv_shape = [32, 2, 1, 0, 32, 128]
        self.block_dynamic_axes = {
            "inputs_embeds" : { 0: "seq_len" },
            "attention_mask" : { 2: "seq_len", 3: "seq_len" },
            "position_ids" : { 0: "seq_len" },
            "past_key_values" : { 2: "history_len" }
        }
        self.model_dynamic_axes = {
            "input_ids" : { 0: "seq_len" },
            "attention_mask" : { 2: "seq_len", 3: "seq_len" },
            "position_ids" : { 0: "seq_len" },
            "past_key_values" : { 3: "history_len" }
        }

    def build_prompt(self, query):
        return f'\n<|im_start|>user\n{query}<|im_end|>\n<|im_start|>assistant\n'

    def get_attention_mask(self) -> torch.Tensor:
        if self.token_len:
            return torch.ones([1, 1, 1, 1]).bool()
        return torch.tril(torch.ones([1, 1, self.seq_len, self.seq_len]).bool())

    def get_position_ids(self) -> torch.Tensor:
        if self.token_len:
            return torch.tensor([self.seq_len - 1], dtype=torch.long)
        return torch.arange(self.seq_len, dtype=torch.long)

# llama2
class LLAMA2Block(torch.nn.Module):
    def __init__(self, block, block_id, final_layernorm = None):
        super().__init__()
        self.block = block
        self.block_id = block_id
        self.final_layernorm = final_layernorm

    def forward(self, hidden_states, attention_mask, position_ids, past_kv):
        hidden_states = hidden_states.view(1, -1, 4096)
        hidden_states, presents = self.block(hidden_states,
                                             attention_mask,
                                             position_ids,
                                             past_kv,
                                             use_cache=True)
        if self.final_layernorm is not None:
            hidden_states = self.final_layernorm(hidden_states)
            hidden_states = hidden_states.view(-1, 4096)[-1].view(1, 1, 4096)
        if isinstance(presents, tuple):
            presents = torch.stack(presents)
        return hidden_states, presents

class Llama2_7b_Chat(LLM):
    def __init__(self, args):
        super().__init__(args)
        self.model_name = 'Llama2_7b'
        if 'Baichuan2' in args.path:
            self.model_name = 'Baichuan2_7B'

    def load_model(self, model_path: str):
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True).float().eval()
        transformer = model.model
        self.lm_ = model.lm_head
        self.embed_ = transformer.embed_tokens
        self.blocks_ = transformer.layers
        self.final_layernorm_ = transformer.norm
        # some wrapper
        self.stop_id = self.tokenizer.eos_token_id
        self.block_nums = len(self.blocks_)
        self.embed = Embedding(self.embed_, self.embed_bf16)
        self.lm = Lm(self.lm_)
        self.blocks = [LLAMA2Block(self.blocks_[i], i, self.final_layernorm_ if i == len(self.blocks_) - 1 else None) for i in range(self.block_nums)]
        # some config for export
        self.past_kv_shape = [32, 2, 1, 32, 0, 128]
        self.block_dynamic_axes = {
            "inputs_embeds" : { 0: "seq_len" },
            "attention_mask" : { 2: "seq_len", 3: "seq_len" },
            "position_ids" : { 0: "seq_len" },
            "past_key_values" : { 3: "history_len" }
        }
        self.model_dynamic_axes = {
            "input_ids" : { 0: "seq_len" },
            "attention_mask" : { 2: "seq_len", 3: "seq_len" },
            "position_ids" : { 1: "seq_len" },
            "past_key_values" : { 4: "history_len" }
        }

    def build_prompt(self, query):
        if 'Baichuan2' in self.model_name:
            return f'<reserved_106>{query}<reserved_107>'
        return f'[INST]{query}[/INST]'


    def get_attention_mask(self) -> torch.Tensor:
        if self.token_len:
            return torch.zeros([1, 1, 1, self.seq_len], dtype=torch.float32)
        return (1 - torch.tril(torch.ones([1, 1, self.seq_len, self.seq_len]))) * torch.finfo(torch.float32).min

    def get_position_ids(self) -> torch.Tensor:
        if self.token_len:
            return torch.tensor([[self.seq_len - 1]], dtype=torch.long)
        return torch.arange(self.seq_len, dtype=torch.long).unsqueeze(0)

if __name__ == '__main__':
    llm_models = {
        'chatglm-6b': Chatglm_6b,
        'chatglm2-6b': Chatglm2_6b,
        'codegeex2-6b': Chatglm2_6b,
        'Qwen-7B-Chat': Qwen_7b_Chat,
        'Baichuan2-7B-Chat': Llama2_7b_Chat,
        'Llama-2-7b-chat-ms': Llama2_7b_Chat
    }
    parser = argparse.ArgumentParser(description='LLMExporter', formatter_class=argparse.RawTextHelpFormatter)
    parser.add_argument('--path', type=str, default='THUDM/chatglm-6b', required=True,
                        help='path(`str` or `os.PathLike`):\nCan be either:'
                        '\n\t- A string, the *model id* of a pretrained model like `THUDM/chatglm-6b`. [TODO]'
                        '\n\t- A path to a *directory* clone from repo like `../chatglm-6b`.')
    parser.add_argument('--type', type=str, choices=llm_models.keys(), default=None,
                        help='type(`str`, *optional*):'
                        '\n\tThe pretrain llm model type.'
                        )
    parser.add_argument('--export_path', type=str, default='./onnx', help='export onnx model path, defaut is `./onnx`.')
    parser.add_argument('--export_verbose', action='store_true', default=False, help='Whether or not to export onnx with verbose.')
    parser.add_argument('--export_test', action='store_true', help='Whether or not to export onnx with test using onnxruntime.')
    parser.add_argument('--test', type=str, help='test model inference with query `TEST`.')
    parser.add_argument('--export', action='store_true', help='export model to an `onnx` model.')
    parser.add_argument('--export_split', action='store_true',
                        help='export model split to some `onnx` models:'
                        '\n\t- embedding model.'
                        '\n\t- block models.'
                        '\n\t- lm_head model.'
                        )
    parser.add_argument('--export_token', action='store_true', help='export llm tokenizer to a txt file.')
    parser.add_argument('--export_embed', action='store_true', help='export llm embedding to an `onnx` model.')
    parser.add_argument('--export_lm', action='store_true', help='export llm lm_head to an `onnx` model.')
    parser.add_argument('--export_block', type=int, help='export llm block [id] to an `onnx` model.')
    parser.add_argument('--export_blocks', action='store_true', help='export llm all blocks to `onnx` models.')
    parser.add_argument('--embed_bf16', action='store_true', help='using `bfloat16` replace `float32` in embedding.')


    args = parser.parse_args()
    model_path = args.path
    model_type = args.type
    # not sepcify model type, using path
    if model_type is None:
        for model in llm_models:
            if model in model_path:
                model_type = model
    if model_type is None:
        raise RuntimeError('Please specify model type.')

    # copy modeling py file to pretrain model for export
    for file in glob.glob(f'./llm_models/{model_type}/*'):
        shutil.copy2(file, model_path)

    llm_exporter = llm_models[model_type](args)

    # some actions
    if args.test is not None:
        llm_exporter.response(args.test)

    if args.export:
        llm_exporter.export()

    if args.export_token:
        llm_exporter.export_tokenizer()

    if args.export_embed or args.export_split:
        llm_exporter.export_embed()

    if args.export_lm or args.export_split:
        llm_exporter.export_lm()

    if args.export_blocks or args.export_split:
        llm_exporter.export_blocks()

    if args.export_block is not None:
        llm_exporter.export_block(args.export_block)
  • 转换命令
python3 llm_export.py --path ../Llama2-7b/ --type Llama-2-7b-chat-ms --export_path onnx --export

 

Triton部署下载镜像

docker pull nvcr.io/nvidia/tritonserver:23.08-py3
docker pull nvcr.io/nvidia/tensorrt:23.08-py3
  • 加载镜像,启动服务
docker run -it --gpus=1 --rm --net=host -v ./model-repository:/models tritonserver:23.08-py3 tritonserver --model-repository=/models
  • 查看config.pbtxt
 curl localhost:8000/v2/models/{$model_name}/config
  • 加载tensorrt镜像,启动服务
 docker run --gpus all -it --rm -v ./:/models tensorrt:23.08-py3

  • onnx-trt转换
trtexec --onnx=./model.onnx --saveEngine=./trt/model.plan --optShapes=input_ids:1,attention_mask:1x1x1x1026,position_ids:1x1,past_key_values:32x2x1x32x1025x128 --minShapes=input_ids:1,attention_mask:1x1x1x1,position_ids:1x1,past_key_values:32x2x1x32x0x128 --maxShapes=input_ids:1024,attention_mask:1x1x1024x2049,position_ids:1x1024,past_key_values:32x2x1x32x1025x128 --device=1 --fp16
  • config.pbtxt
name: "trt",
platform: "tensorrt_plan",
max_batch_size: 0,
input: [{
        name: "past_key_values",
        data_type: TYPE_FP32,
        dims: [32, 2, 1, 32, -1, 128],
}, {
        name: "position_ids",
        data_type: TYPE_INT32,
        dims: [1, -1],
}, {
        name: "attention_mask",
        data_type: TYPE_FP32,
        dims: [1, 1, -1, -1],
}, {
        name: "input_ids",
        data_type: TYPE_INT32,
        dims: [-1],
}],
output: [{
        name: "presents",
        data_type: TYPE_FP32,
        dims: [32, 2, 1, 32, -1, 128],
}, {
        name: "token_id",
        data_type: TYPE_INT32,
        dims: [1],
}]
default_model_filename: "model.plan"               # TensorRT的文件名

查看onnx的输入输出维度信息

import onnx

model = onnx.load(r"model.onnx")

# The model is represented as a protobuf structure and it can be accessed
# using the standard python-for-protobuf methods

# iterate through inputs of the graph
for input in model.graph.input:
    print (input.name, end=": ")
    # get type of input tensor
    tensor_type = input.type.tensor_type
    # check if it has a shape:
    if (tensor_type.HasField("shape")):
        # iterate through dimensions of the shape:
        for d in tensor_type.shape.dim:
            # the dimension may have a definite (integer) value or a symbolic identifier or neither:
            if (d.HasField("dim_value")):
                print (d.dim_value, end=", ")  # known dimension
            elif (d.HasField("dim_param")):
                print (d.dim_param, end=", ")  # unknown dimension with symbolic name
            else:
                print ("?", end=", ")  # unknown dimension with no name
    else:
        print ("unknown rank", end="")
    print()


# iterate through outputs of the graph
for output in model.graph.output:
    print (output.name, end=": ")
    # get type of input tensor
    tensor_type = output.type.tensor_type
    # check if it has a shape:
    if (tensor_type.HasField("shape")):
        # iterate through dimensions of the shape:
        for d in tensor_type.shape.dim:
            # the dimension may have a definite (integer) value or a symbolic identifier or neither:
            if (d.HasField("dim_value")):
                print (d.dim_value, end=", ")  # known dimension
            elif (d.HasField("dim_param")):
                print (d.dim_param, end=", ")  # unknown dimension with symbolic name
            else:
                print ("?", end=", ")  # unknown dimension with no name
    else:
        print ("unknown rank", end="")
    print()

查看更详细的信息

import onnx
from onnx import helper
import sys,getopt

#加载模型
def loadOnnxModel(path):
    model = onnx.load(path)
    return model

#获取节点和节点的输入输出名列表,一般节点的输入将来自于上一层的输出放在列表前面,参数放在列表后面
def getNodeAndIOname(nodename,model):
    for i in range(len(model.graph.node)):
        if model.graph.node[i].name == nodename:
            Node = model.graph.node[i]
            input_name = model.graph.node[i].input
            output_name = model.graph.node[i].output
    return Node,input_name,output_name

#获取对应输入信息
def getInputTensorValueInfo(input_name,model):
    in_tvi = []
    for name in input_name:
        for params_input in model.graph.input:
            if params_input.name == name:
               in_tvi.append(params_input)
        for inner_output in model.graph.value_info:
            if inner_output.name == name:
                in_tvi.append(inner_output)
    return in_tvi

#获取对应输出信息
def getOutputTensorValueInfo(output_name,model):
    out_tvi = []
    for name in output_name:
        out_tvi = [inner_output for inner_output in model.graph.value_info if inner_output.name == name]
        if name == model.graph.output[0].name:
            out_tvi.append(model.graph.output[0])
    return out_tvi

#获取对应超参数值
def getInitTensorValue(input_name,model):
    init_t = []
    for name in input_name:
        init_t = [init for init in model.graph.initializer if init.name == name]
    return init_t

#构建单个节点onnx模型
def createSingelOnnxModel(ModelPath,nodename,SaveType="",SavePath=""):
    model = loadOnnxModel(str(ModelPath))
    Node,input_name,output_name = getNodeAndIOname(nodename,model)
    in_tvi = getInputTensorValueInfo(input_name,model)
    out_tvi = getOutputTensorValueInfo(output_name,model)
    init_t = getInitTensorValue(input_name,model)

    graph_def = helper.make_graph(
                [Node],
                nodename,
                inputs=in_tvi,  # 输入
                outputs=out_tvi,  # 输出
                initializer=init_t,  # initalizer
            )
    model_def = helper.make_model(graph_def, producer_name='onnx-example')
    print(nodename+"onnx模型生成成功!")
#获取节点数量
def getNodeNum(model):
    return len(model.graph.node)
#获取节点类型
def getNodetype(model):
    op_name = []
    for i in range(len(model.graph.node)):
        if model.graph.node[i].op_type not in op_name:
            op_name.append(model.graph.node[i].op_type)
    return op_name
#获取节点名列表
def getNodeNameList(model):
    NodeNameList = []
    for i in range(len(model.graph.node)):
        NodeNameList.append(model.graph.node[i].name)
    return NodeNameList
#获取模型的输入信息
def getModelInputInfo(model):
    return model.graph.input[0]
#获取模型的输出信息
def getModelOutputInfo(model):
    return model.graph.output[0]







model = onnx.load(r"model.onnx")
node_num = getNodeNum(model)
print("节点数量:", node_num)

node_types = getNodetype(model)
print("节点类型:")
for node_type in node_types:
    print(node_type)


#node_names = getNodeNameList(model)
#print("节点名列表:")
#for node_name in node_names:
#    print(node_name)


model_input_info = getModelInputInfo(model)
print("模型输入信息:")
print(model_input_info)



model_output_info = getModelOutputInfo(model)
print("模型输出信息:")
print(model_output_info)





#output_onnx_file_path="./shape"
#estimated_graph = onnx.shape_inference.infer_shapes(model)
#onnx.save(estimated_graph, output_onnx_file_path)

def parseRepeatedScalarContainer(container):
    values = []
    for element in container:
        print("element.name",element)
        values.append(element)
    return values

def getNodeNameList_test(model):
    NodeNameList = []
    target_nodes = ["/blocks_.0/self_attn/If", "/blocks_.0/self_attn/If_1", "/blocks_.0/self_attn/If_2", "/blocks_.0/self_attn/If_3"]
    for node in model.graph.node:
        if node.name in target_nodes:
            NodeNameList.append(node.name)
            print("节点名称:", node.name)
            #print("类型:", type(node.input))

            input_items=parseRepeatedScalarContainer(node.input)
            #print("类型:", type(input_items))

    return NodeNameList
node_names = getNodeNameList_test(model)
print("节点名列表:")
for node_name in node_names:
    print(node_name)

~

 

  • 推理测试
import os
import base64
import glob
import shutil
import argparse
import torch
import numpy as np
import onnxruntime as ort
from transformers import AutoModel, AutoModelForCausalLM, AutoTokenizer
import tritonclient.**** as ****client


class LLM(torch.nn.Module):
    '''
    Base class for all llm model. Inherits from [`torch.nn.Module`].
    '''

    def __init__(self, path):
        super().__init__()
        self.load_tokenizer(path)
        self.max_length = 1024

    def load_tokenizer(self, model_path: str):
        raise NotImplementedError

    def get_attention_mask(self) -> torch.Tensor:
        raise NotImplementedError

    def get_position_ids(self) -> torch.Tensor:
        raise NotImplementedError

    # some test functions
    def build_prompt(self, query):
        if hasattr(self.tokenizer, 'build_prompt'):
            prompt = self.tokenizer.build_prompt(query)
        else:
            prompt = query
        return prompt

    def str_to_ids(self, prompt):
        input_ids = self.tokenizer(prompt, return_tensors="pt")['input_ids']
        return input_ids

    def id_to_str(self, token_id):
        word = self.tokenizer._convert_id_to_token(int(token_id[0]))
        print(word)
        word = self.tokenizer.convert_tokens_to_string([word+" "])
        return word


    def generate_inputs(self,query):
        prompt = self.build_prompt(query)
        input_ids = self.str_to_ids(prompt).squeeze()
        self.seq_len = input_ids.numel()
        self.context_len = self.seq_len - 2
        attention_mask =  self.get_attention_mask(self.seq_len)
        position_ids = self.get_position_ids(self.seq_len)
        past_key_values = torch.zeros(self.past_kv_shape)



        inputs = {
            'input_ids' : input_ids.detach().numpy(),
            'attention_mask' : attention_mask.numpy(),
            'position_ids' : position_ids.numpy(),
            'past_key_values' : past_key_values.numpy()
        }




        return inputs


class Llama2_7b_input_generator(LLM):

    def load_tokenizer(self, path: str):
        self.tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True)
        self.stop_id = self.tokenizer.eos_token_id
        if self.stop_id is None:
            self.stop_id = self.tokenizer.tokenizer.eos_id
        # some config for export
        self.past_kv_shape = [32, 2, 1, 32, 0, 128]




    def get_attention_mask(self, seq_len) -> torch.Tensor:
        return (1 - torch.tril(torch.ones([1, 1, seq_len, seq_len]))) * torch.finfo(torch.float32).min

    def get_position_ids(self, seq_len) -> torch.Tensor:
        return torch.arange(seq_len, dtype=torch.long).unsqueeze(0)




if __name__ == '__main__':

    query = "please introduce java"

    max_infer_length = 1024
    #cp -r model_path/tokenizer /XXX/model-repository/trt/1/tokenizer
    tokenizer_path = "/XXX/model-repository/trt/1/tokenizer"
    input_generator = Llama2_7b_input_generator(tokenizer_path)
    inputs_tokenized = input_generator.generate_inputs(query)
    # print(inputs_tokenized)
    input_ids = inputs_tokenized["input_ids"]
    past_key_values = inputs_tokenized['past_key_values']
    # ort_session = ort.InferenceSession(onnx_model, providers=['CUDAExecutionProvider'])
    # onnx_outs = ort_session.run(None, inputs_tokenized)
    # print("input_ids shape: ", input_ids.shape)
    # print("atten mask shape: ", attention_mask.shape)
    # print("position ids shape: ", position_ids.shape)
    # print("past kv shape: ", past_key_values.shape)
    # output_str = input_generator.id_to_str(onnx_outs[0])
    # print(onnx_outs[0],output_str)


    import tritonclient.**** as ****client

    triton_client = ****client.InferenceServerClient(url="localhost:8000", verbose=False)
    model_name = "trt"

    token_len = 0
    stop_stream = False

    while True:

        cur_input_len = input_ids.shape[0]
        attention_mask = input_generator.get_attention_mask(cur_input_len).numpy()
        position_ids = input_generator.get_position_ids(cur_input_len).numpy()



        inputs = [
            ****client.InferInput('input_ids', list(input_ids.shape), "INT32"),
            ****client.InferInput('attention_mask', list(attention_mask.shape), "FP32"),
            ****client.InferInput('position_ids', list(position_ids.shape), "INT32"),
            ****client.InferInput('past_key_values', [32, 2, 1, past_key_values.shape[3],past_key_values.shape[4], 128], "FP32")

        ]

        input_ids = np.array(input_ids, dtype=np.int32)
        position_ids = np.array(position_ids, dtype=np.int32)

        inputs[0].set_data_from_numpy(input_ids)
        inputs[1].set_data_from_numpy(attention_mask)
        inputs[2].set_data_from_numpy(position_ids)
        inputs[3].set_data_from_numpy(past_key_values)

        outputs = [
            ****client.InferRequestedOutput('token_id'),
            ****client.InferRequestedOutput('presents')
        ]

        results = triton_client.infer(model_name=model_name, inputs=inputs, outputs=outputs)

        infer_token_str = input_generator.id_to_str(results.as_numpy("token_id"))

        print(infer_token_str,end="", flush=True)

        input_ids = results.as_numpy("token_id")
        past_key_values = results.as_numpy("presents")
        # print(past_key_values.shape)
        # print(results.as_numpy("presents"))
文章来自个人专栏
AI-llama大模型,go语言开发
28 文章 | 2 订阅
0条评论
0 / 1000
请输入你的评论
1
1