模型适配流程概述

代码修改

目前torch_npu已经更新的足够好了，大部分不需要修改一些库中的代码，只需要在main函数中添加如下代码就可以跑起来了。

import torch
import torch_npu
import deepspeed
import deepspeed_npu
from torch_npu.contrib import transfer_to_npu

note：

一般报module 'torch._C' has no attribute '_cuda_setDevice'

就是没有添加from torch_npu.contrib import transfer_to_npu，导致硬件调用没调至npu上

其次修改模型前向传播文件 modeling_telechat.py

修改class FlashSelfAttention(torch.nn.Module):

使用torch_npu.npu_fusion_attention 替换flash_attn_unpadded_func函数

atten_mask_ = torch.triu(torch.ones(q.shape[1], q.shape[1]),1).to(torch.float)
        atten_mask_npu = atten_mask_.clone().bool().to(q.device)
        head_num = q.shape[2]
        output = torch_npu.npu_fusion_attention(
                    q,
                    k,
                    v,
                    head_num,
                    "BSND",
                    keep_prob=1.0,
                    atten_mask=atten_mask_npu,
                    scale=1.0 / math.sqrt(q.shape[-1]),
                    pre_tockens=q.shape[1],
                    next_tockens=0,
                    inner_precise=0)[0]

性能优化

华为社区文档中生成mask的代码为

atten_mask_npu= torch.from_numpy(np.triu(np.ones([max_seqlen_q, max_seqlen_k]), k=1))

全部使用torch函数，可以节省内存从一块地址拷贝到另一块地址的时间，torch.from_numpy会产生拷贝时间

atten_mask_ = torch.triu(torch.ones(q.shape[1], q.shape[1]),1).to(torch.float)

torch_npu.npu_fusion_attention支持两种模式 tnd和 bsnd

tnd ：

t就是total tokens of batch，相当于b乘s，n就是head_num, 注意力的多头个数，d就是隐藏层维度大小除以head_num后的数值

bsnd ：

b是batchsize，s是sequence length tokens长度

profiling采集

profiler_level = Constant.LEVEL2 采集等级建议使用LEVEL2采集数据最多的等级

torch_npu.profiler._ExperimentalConfig(profiler_level = Constant.LEVEL0, aic_metrics = Constant.AicMetricsNone, l2_cache = False, record_op_args = False)

wait, skip_first, warmup,三个参数都是不采集的step数，active是采集的step数，repeat是重复采集activate数

torch_npu.profiler.schedule (wait, active, warmup = 0, repeat = 0, skip_first = 0)

数据格式

大语言模型的数据，一般分为预训练数据和微调数据，预训练是纯文本，微调是问答对。

LLM训练方法都是让它预测下一个token，两种训练方法都需要把token拼接为长度为用户指定的max_length长度，一个max_length长的tokens序列就是一个samples

预训练就是简单的文本token拼接。

微调是将问答对添加问题和回答已经问答结束的特殊token，例如<_user>天翼云公司英文名。<_bot>state cloud.<end>。然后再将多个对话拼接成长度为max_length的token序列，不足的位置用pad_token 补齐。

训练方法的区别是，预训练是计算全部token的损失，全参微调只统计答案部分的损失（使用mask盖住问题的损失）

模型适配流程概述

代码修改

目前torch_npu已经更新的足够好了，大部分不需要修改一些库中的代码，只需要在main函数中添加如下代码就可以跑起来了。

import torch
import torch_npu
import deepspeed
import deepspeed_npu
from torch_npu.contrib import transfer_to_npu

note：

一般报module 'torch._C' has no attribute '_cuda_setDevice'

就是没有添加from torch_npu.contrib import transfer_to_npu，导致硬件调用没调至npu上

其次修改模型前向传播文件 modeling_telechat.py

修改class FlashSelfAttention(torch.nn.Module):

使用torch_npu.npu_fusion_attention 替换flash_attn_unpadded_func函数

性能优化

华为社区文档中生成mask的代码为

atten_mask_npu= torch.from_numpy(np.triu(np.ones([max_seqlen_q, max_seqlen_k]), k=1))

全部使用torch函数，可以节省内存从一块地址拷贝到另一块地址的时间，torch.from_numpy会产生拷贝时间

atten_mask_ = torch.triu(torch.ones(q.shape[1], q.shape[1]),1).to(torch.float)

torch_npu.npu_fusion_attention支持两种模式 tnd和 bsnd

tnd ：

t就是total tokens of batch，相当于b乘s，n就是head_num, 注意力的多头个数，d就是隐藏层维度大小除以head_num后的数值

bsnd ：

b是batchsize，s是sequence length tokens长度

profiling采集

profiler_level = Constant.LEVEL2 采集等级建议使用LEVEL2采集数据最多的等级

torch_npu.profiler._ExperimentalConfig(profiler_level = Constant.LEVEL0, aic_metrics = Constant.AicMetricsNone, l2_cache = False, record_op_args = False)

wait, skip_first, warmup,三个参数都是不采集的step数，active是采集的step数，repeat是重复采集activate数

torch_npu.profiler.schedule (wait, active, warmup = 0, repeat = 0, skip_first = 0)

数据格式

大语言模型的数据，一般分为预训练数据和微调数据，预训练是纯文本，微调是问答对。

LLM训练方法都是让它预测下一个token，两种训练方法都需要把token拼接为长度为用户指定的max_length长度，一个max_length长的tokens序列就是一个samples

预训练就是简单的文本token拼接。

训练方法的区别是，预训练是计算全部token的损失，全参微调只统计答案部分的损失（使用mask盖住问题的损失）

智算服务

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

模型适配流程概述

模型适配流程概述

代码修改

性能优化

profiling采集

数据格式

模型适配流程概述

模型适配流程概述

代码修改

性能优化

profiling采集

数据格式

活动

智算服务

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

模型适配流程概述

模型适配流程概述

代码修改

性能优化

profiling采集

数据格式

模型适配流程概述

模型适配流程概述

代码修改

性能优化

profiling采集

数据格式