searchusermenu
  • 发布文章
  • 消息中心
点赞
收藏
评论
分享
原创

【问题实操】银河高级服务器操作系统实例分享,开机之后反复重启

2024-10-29 09:17:25
4
0

1.服务器环境以及配置

image.png

2.问题现象描述

服务器开机之后反复重启,用户现场已经收集了日志,需要分析日志。

3.问题分析

3.1 查看系统环境

cat ./sos_commands/networking/ethtool_-i_enp6s0f1
driver: mlx5_core
version: 5.0-0
firmware-version: 14.29.2002 (MT_2420110004)
expansion-rom-version:
bus-info: 0000:06:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

3.2 vmcore-dmesg分析

分析127.0.0.1-2023-09-19-07:58:37的vmcore-dmesg

[1865270.773139] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865271.773142] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865272.773093] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865273.773195] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865274.773168] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865275.773147] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865276.773151] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865277.773127] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865278.773147] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865279.773174] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865280.773149] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865281.773120] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865282.773112] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865283.773139] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865284.773150] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865285.773097] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865286.773067] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865287.773095] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry

分析127.0.0.1-2023-10-29-14:17:27的vmcore-dmesg

[3478639.193325] mlx5_core 0000:06:00.0: wait_func:1034:(pid 4068733): MODIFY_CQ(0x403) timeout. Will cause a leak of a command resource
[3478639.193333] mlx5_core 0000:06:00.0: wait_func:1034:(pid 4166044): MODIFY_CQ(0x403) timeout. Will cause a leak of a command resource
[3478639.193334] mlx5_core 0000:06:00.0: wait_func:1034:(pid 2812292): MODIFY_CQ(0x403) timeout. Will cause a leak of a command resource
[3478639.193342] mlx5_core 0000:06:00.0: wait_func:1034:(pid 3401828): MODIFY_CQ(0x403) timeout. Will cause a leak of a command resource
[3478639.193344] mlx5_core 0000:06:00.0: wait_func:1034:(pid 3822606): MODIFY_CQ(0x403) timeout. Will cause a leak of a command resource
[3478639.193354] mlx5_core 0000:06:00.0: wait_func:1034:(pid 188898): MODIFY_CQ(0x403) timeout. Will cause a leak of a command resource
[3478639.193363] mlx5_core 0000:06:00.0: wait_func:1034:(pid 3288787): MODIFY_CQ(0x403) timeout. Will cause a leak of a command resource
[3478639.194469] mlx5_core 0000:06:00.0: wait_func:1038:(pid 192618): MODIFY_CQ(0x403) canceled on out of queue timeout.
[3478639.194487] mlx5_core 0000:06:00.0: wait_func:1038:(pid 192616): MODIFY_CQ(0x403) canceled on out of queue timeout.
[3478639.194495] mlx5_core 0000:06:00.0: wait_func:1038:(pid 192615): MODIFY_CQ(0x403) canceled on out of queue timeout.
[3478639.194505] mlx5_core 0000:06:00.0: wait_func:1038:(pid 192613): MODIFY_CQ(0x403) canceled on out of queue timeout.
[3478639.194515] mlx5_core 0000:06:00.0: wait_func:1038:(pid 192610): MODIFY_CQ(0x403) canceled on out of queue timeout.
[3478640.183135] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1268656): failed to allocate command entry

从日志可见,不断打印failed to allocate command entry,可见该打印来自于drivers/net/ethernet/mellanox/mlx5/core/cmd.c文件的cmd_work_handler函数。
image.png
分析异常打印时的函数调用流程如下所示:

-> mlx5\_cmd\_exec
 -> cmd\_exec
  -> mlx5\_cmd\_invoke
   -> cmd\_work\_handler
    -> alloc\_ent
     -> find\_first\_bit
static int alloc_ent(struct mlx5_cmd *cmd)
{
    unsigned long flags;
    int ret;

    spin_lock_irqsave(&cmd->alloc_lock, flags);
    ret = find_first_bit(&cmd->bitmask, cmd->max_reg_cmds);
    if (ret < cmd->max_reg_cmds)
        clear_bit(ret, &cmd->bitmask);
    spin_unlock_irqrestore(&cmd->alloc_lock, flags);

    return ret < cmd->max_reg_cmds ? ret : -ENOMEM;
}
/*
 * Find the first set bit in a memory region.
 */
unsigned long find_first_bit(const unsigned long *addr, unsigned long size)
{
    unsigned long idx;

    for (idx = 0; idx * BITS_PER_LONG < size; idx++) {
        if (addr[idx])
            return min(idx * BITS_PER_LONG + __ffs(addr[idx]), size);
    }

    return size;
}
static void free_ent(struct mlx5_cmd *cmd, int idx)
{
    unsigned long flags;

    spin_lock_irqsave(&cmd->alloc_lock, flags);
    set_bit(idx, &cmd->bitmask);
    spin_unlock_irqrestore(&cmd->alloc_lock, flags);
}

从上述代码可见,通过cmd->bitmask进行标记是否有还可以申请,cmd->bitmask的初始化如下,假设获取出来的cmd->log_sz=0x80,则cmd->max_reg_cmds=0xFF,cmd->bitmask=0x1FD。应该是cmd->bitmask初始化的时候有多少bit位为1,就可以alloc_ent多少次。

int mlx5_cmd_init(struct mlx5_core_dev *dev)
{
    int size = sizeof(struct mlx5_cmd_prot_block);
    int align = roundup_pow_of_two(size);
    struct mlx5_cmd *cmd = &dev->cmd;
    u32 cmd_h, cmd_l;
    u16 cmd_if_rev;
    int err;
    int i;

    memset(cmd, 0, sizeof(*cmd));
    cmd_if_rev = cmdif_rev(dev);
    if (cmd_if_rev != CMD_IF_REV) {
        dev_err(&dev->pdev->dev,
            "Driver cmdif rev(%d) differs from firmware's(%d)\n",
            CMD_IF_REV, cmd_if_rev);
        return -EINVAL;
    }

    cmd->pool = dma_pool_create("mlx5_cmd", &dev->pdev->dev, size, align, 0);
    if (!cmd->pool)
        return -ENOMEM;
    err = alloc_cmd_page(dev, cmd);
    if (err)
        goto err_free_pool;

    cmd_l = ioread32be(&dev->iseg->cmdq_addr_l_sz) & 0xff;
    cmd->log_sz = cmd_l >> 4 & 0xf;
    cmd->log_stride = cmd_l & 0xf;
    if (1 << cmd->log_sz > MLX5_MAX_COMMANDS) {
        dev_err(&dev->pdev->dev, "firmware reports too many outstanding commands %d\n",
            1 << cmd->log_sz);
        err = -EINVAL;
        goto err_free_page;
    }

    if (cmd->log_sz + cmd->log_stride > MLX5_ADAPTER_PAGE_SHIFT) {
        dev_err(&dev->pdev->dev, "command queue size overflow\n");
        err = -EINVAL;
        goto err_free_page;
    }

    cmd->checksum_disabled = 1;
    cmd->max_reg_cmds = (1 << cmd->log_sz) - 1;
    cmd->bitmask = (1UL << cmd->max_reg_cmds) - 1;

执行cmd_exec的时候,会去调用alloc_ent,找到cmd->bitmask中最近为1的位,并调用clear_bit清除为0;cmd传到固件执行完以后,要去释放command entry,调用free_ent,将对应位重新标志为1。所以出现问题,有两种可能性:

1、当cmd在传到固件,一直未能执行完,而一直有cmd_exec执行,被积压,导致cmd->bitmask的位使用完,cmd->bitmask所有位被置为了0。alloc_ent无法申请到cmd->bitmask中的标识位。

2、cmd_exec执行完后,由于某种原因,没能调用free_ent,导致cmd->bitmask的所有位被使用完。

3.3 vmcore分析

日志中提示Internal error: SP/PC alignment exception,PC寄存器出现了未对齐的打印。

[3478640.184122] Internal error: SP/PC alignment exception: 8a000000 [#1] SMP
[3478640.184720] Modules linked in: fuse tun bridge stp llc fccore(POE) fchook(OE) bonding ip_set_hash_ip ip_set nfnetlink rfkill ipmi_ssif vfat fat rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser rdma_cm ib_umad iw_cm ib_ipoib libiscsi scsi_transport_iscsi ib_cm mlx5_ib joydev ib_uverbs ib_core ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel mlx5_core megaraid_sas igb ast mlxfw dm_mirror dm_region_hash dm_log [last unloaded: ip_tables]
[3478640.188127] Process kworker/u128:0 (pid: 1268656, stack limit = 0x0000000074e0af67)
[3478640.188820] CPU: 15 PID: 1268656 Comm: kworker/u128:0 Kdump: loaded Tainted: P           OE     4.19.90-52.22.v2207.ky10.aarch64 #1
[3478640.189835] Source Version: 52f32c50138f4eec54ab90ce9775f0d75e1d6968
[3478640.190505] Hardware name: HIKVISION DS-V/N/A, BIOS 601FBE02HK 2020/04/20
[3478640.191173] Workqueue: mlx5_cmd_0000:06:00.0 cmd_work_handler [mlx5_core]
[3478640.191790] pstate: 40000005 (nZcv daif -PAN -UAO)
[3478640.192235] pc : 0x42
[3478640.192539] lr : cmd_work_handler+0x434/0x578 [mlx5_core]
[3478640.193029] sp : ffffda857e113d20
[3478640.193348] x29: ffffda857e113d20 x28: ffffda8553b4c400
[3478640.193838] x27: 0000000000000000 x26: ffff5607dffd1000
[3478640.194341] x25: 0000000000000000 x24: ffffda8511e001c8
[3478640.194830] x23: ffffda8511e000e0 x22: ffffda8511e00160
[3478640.195345] x21: ffffda8553b4c548 x20: 0000000000000001
[3478640.195840] x19: ffffda8511e001d8 x18: 0000000000000001
[3478640.196332] x17: 0000fffc58585308 x16: ffff5607def87070
[3478640.196812] x15: ffffffffffffffff x14: 742064656c696166
[3478640.197283] x13: 203a293635363836 x12: 323120646970283a
[3478640.197769] x11: 3630393a72656c64 x10: ffff5607dffd5b70
[3478640.198278] x9 : 5f646d63203a302e x8 : ffff5607deee95d8
[3478640.198769] x7 : ffff5607e07230f8 x6 : 000000000000056b
[3478640.199307] x5 : 0000000000000000 x4 : 0000000000000000
[3478640.199801] x3 : 0000000000000000 x2 : 0000000000000042
[3478640.200358] x1 : 0000000000000000 x0 : 00000000fffffff5
[3478640.200835] Call trace:
[3478640.201084]  0x42
[3478640.201307]  process_one_work+0x1fc/0x4a8
[3478640.201705]  worker_thread+0x50/0x4d8
[3478640.202068]  kthread+0x134/0x138
[3478640.202386]  ret_from_fork+0x10/0x18
[3478640.202781] Code: bad PC value
[3478640.203165] SMP: stopping secondary CPUs
[3478640.205196] Starting crashdump kernel...

crash /usr/lib/debug/lib/modules/4.19.90-52.22.v2207.ky10.aarch64/vmlinux vmcore分析vmcore

LOAD AVERAGE: 102.08, 41.38, 22.56
       TASKS: 7790
    NODENAME: localhost.localdomain
     RELEASE: 4.19.90-52.22.v2207.ky10.aarch64
     VERSION: #1 SMP Tue Mar 14 11:52:45 CST 2023
     MACHINE: aarch64  (unknown Mhz)
      MEMORY: 126.4 GB
       PANIC: ""
         PID: 1268656
     COMMAND: "kworker/u128:0"
        TASK: ffffdc8435076a00  [THREAD_INFO: ffffdc8435076a00]
         CPU: 15
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 1268656  TASK: ffffdc8435076a00  CPU: 15  COMMAND: "kworker/u128:0"
 #0 [ffffda857e1138c0] machine_kexec at ffff5607de7b2e8c
 #1 [ffffda857e113920] __crash_kexec at ffff5607de8ca148
 #2 [ffffda857e113a90] crash_kexec at ffff5607de8ca258
 #3 [ffffda857e113ac0] die at ffff5607de79f65c
 #4 [ffffda857e113b00] arm64_notify_die at ffff5607de79f964
 #5 [ffffda857e113b30] do_sp_pc_abort at ffff5607de7914f8
 #6 [ffffda857e113d10] el1_pc at ffff5607de792f94
 #7 [ffffda857e113d20] (null) at 3e
 #8 [ffffda857e113db0] process_one_work at ffff5607de8226f8
 #9 [ffffda857e113e00] worker_thread at ffff5607de8229f4
#10 [ffffda857e113e70] kthread at ffff5607de82a1e0

根据vmcore-dmesg中,pc : 0x42 lr : cmd_work_handler+0x434/0x578 [mlx5_core],查看lr寄存器执行的位置可见执行 blr x2的时候触发了空指针。从而可知问题发生在drivers/net/ethernet/mellanox/mlx5/core/cmd.c: 908

crash> mod -s mlx5_core
crash> dis -rl cmd_work_handler+0x434 |tail -n 20
0xffff56077bcf6808 <cmd_work_handler+1016>:     mrs     x1, sp_el0
/usr/src/debug/kernel-4.19.90/linux-4.19.90-52.22.v2207.ky10.aarch64/drivers/net/ethernet/mellanox/mlx5/core/cmd.c: 906
0xffff56077bcf680c <cmd_work_handler+1020>:     ldr     x0, [x22,#-128]
0xffff56077bcf6810 <cmd_work_handler+1024>:     adrp    x2, 0xffff56077bd70000
0xffff56077bcf6814 <cmd_work_handler+1028>:     ldr     w4, [x1,#1712]
0xffff56077bcf6818 <cmd_work_handler+1032>:     add     x2, x2, #0xe50
0xffff56077bcf681c <cmd_work_handler+1036>:     add     x2, x2, #0x7c0
0xffff56077bcf6820 <cmd_work_handler+1040>:     mov     w3, #0x38a                      // #906
0xffff56077bcf6824 <cmd_work_handler+1044>:     adrp    x1, 0xffff56077bd7e000
0xffff56077bcf6828 <cmd_work_handler+1048>:     add     x1, x1, #0xcb0
0xffff56077bcf682c <cmd_work_handler+1052>:     bl      0xffff56077bd65e60 <plt:_dev_err>
/usr/src/debug/kernel-4.19.90/linux-4.19.90-52.22.v2207.ky10.aarch64/drivers/net/ethernet/mellanox/mlx5/core/cmd.c: 907
0xffff56077bcf6830 <cmd_work_handler+1056>:     ldr     x2, [x28,#40]
0xffff56077bcf6834 <cmd_work_handler+1060>:     cbz     x2, 0xffff56077bcf6970 <cmd_work_handler+1376>
/usr/src/debug/kernel-4.19.90/linux-4.19.90-52.22.v2207.ky10.aarch64/drivers/net/ethernet/mellanox/mlx5/core/cmd.c: 908
0xffff56077bcf6838 <cmd_work_handler+1064>:     ldr     x1, [x21,#-88]
0xffff56077bcf683c <cmd_work_handler+1068>:     mov     w0, #0xfffffff5                 // #-11
0xffff56077bcf6840 <cmd_work_handler+1072>:     blr     x2
/usr/src/debug/kernel-4.19.90/linux-4.19.90-52.22.v2207.ky10.aarch64/drivers/net/ethernet/mellanox/mlx5/core/cmd.c: 909
0xffff56077bcf6844 <cmd_work_handler+1076>:     ldr     x1, [x28,#16]
crash>

image.png
struct mlx5_cmd_work_ent是一个自定义的结构体,它包含了额外的字段和数据,用于处理MLX5驱动程序中的命令工作项。可见callback是一个有问题的值,该值出现问题应是网卡固件导致。

crash> struct mlx5_cmd_work_ent 0xffffda8553b4c400
struct mlx5_cmd_work_ent {
  state = 1,
  in = 0x383ef300000001,
  out = 0x1,
  uout = 0x0,
  uout_size = 0,
  callback = 0x42,
  cb_timeout_work = {
    work = {
      data = {
        counter = 0
      },
0条评论
作者已关闭评论
李****堃
126文章数
2粉丝数
李****堃
126 文章 | 2 粉丝
原创

【问题实操】银河高级服务器操作系统实例分享,开机之后反复重启

2024-10-29 09:17:25
4
0

1.服务器环境以及配置

image.png

2.问题现象描述

服务器开机之后反复重启,用户现场已经收集了日志,需要分析日志。

3.问题分析

3.1 查看系统环境

cat ./sos_commands/networking/ethtool_-i_enp6s0f1
driver: mlx5_core
version: 5.0-0
firmware-version: 14.29.2002 (MT_2420110004)
expansion-rom-version:
bus-info: 0000:06:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

3.2 vmcore-dmesg分析

分析127.0.0.1-2023-09-19-07:58:37的vmcore-dmesg

[1865270.773139] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865271.773142] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865272.773093] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865273.773195] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865274.773168] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865275.773147] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865276.773151] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865277.773127] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865278.773147] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865279.773174] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865280.773149] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865281.773120] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865282.773112] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865283.773139] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865284.773150] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865285.773097] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865286.773067] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry
[1865287.773095] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1026764): failed to allocate command entry

分析127.0.0.1-2023-10-29-14:17:27的vmcore-dmesg

[3478639.193325] mlx5_core 0000:06:00.0: wait_func:1034:(pid 4068733): MODIFY_CQ(0x403) timeout. Will cause a leak of a command resource
[3478639.193333] mlx5_core 0000:06:00.0: wait_func:1034:(pid 4166044): MODIFY_CQ(0x403) timeout. Will cause a leak of a command resource
[3478639.193334] mlx5_core 0000:06:00.0: wait_func:1034:(pid 2812292): MODIFY_CQ(0x403) timeout. Will cause a leak of a command resource
[3478639.193342] mlx5_core 0000:06:00.0: wait_func:1034:(pid 3401828): MODIFY_CQ(0x403) timeout. Will cause a leak of a command resource
[3478639.193344] mlx5_core 0000:06:00.0: wait_func:1034:(pid 3822606): MODIFY_CQ(0x403) timeout. Will cause a leak of a command resource
[3478639.193354] mlx5_core 0000:06:00.0: wait_func:1034:(pid 188898): MODIFY_CQ(0x403) timeout. Will cause a leak of a command resource
[3478639.193363] mlx5_core 0000:06:00.0: wait_func:1034:(pid 3288787): MODIFY_CQ(0x403) timeout. Will cause a leak of a command resource
[3478639.194469] mlx5_core 0000:06:00.0: wait_func:1038:(pid 192618): MODIFY_CQ(0x403) canceled on out of queue timeout.
[3478639.194487] mlx5_core 0000:06:00.0: wait_func:1038:(pid 192616): MODIFY_CQ(0x403) canceled on out of queue timeout.
[3478639.194495] mlx5_core 0000:06:00.0: wait_func:1038:(pid 192615): MODIFY_CQ(0x403) canceled on out of queue timeout.
[3478639.194505] mlx5_core 0000:06:00.0: wait_func:1038:(pid 192613): MODIFY_CQ(0x403) canceled on out of queue timeout.
[3478639.194515] mlx5_core 0000:06:00.0: wait_func:1038:(pid 192610): MODIFY_CQ(0x403) canceled on out of queue timeout.
[3478640.183135] mlx5_core 0000:06:00.0: cmd_work_handler:906:(pid 1268656): failed to allocate command entry

从日志可见,不断打印failed to allocate command entry,可见该打印来自于drivers/net/ethernet/mellanox/mlx5/core/cmd.c文件的cmd_work_handler函数。
image.png
分析异常打印时的函数调用流程如下所示:

-> mlx5\_cmd\_exec
 -> cmd\_exec
  -> mlx5\_cmd\_invoke
   -> cmd\_work\_handler
    -> alloc\_ent
     -> find\_first\_bit
static int alloc_ent(struct mlx5_cmd *cmd)
{
    unsigned long flags;
    int ret;

    spin_lock_irqsave(&cmd->alloc_lock, flags);
    ret = find_first_bit(&cmd->bitmask, cmd->max_reg_cmds);
    if (ret < cmd->max_reg_cmds)
        clear_bit(ret, &cmd->bitmask);
    spin_unlock_irqrestore(&cmd->alloc_lock, flags);

    return ret < cmd->max_reg_cmds ? ret : -ENOMEM;
}
/*
 * Find the first set bit in a memory region.
 */
unsigned long find_first_bit(const unsigned long *addr, unsigned long size)
{
    unsigned long idx;

    for (idx = 0; idx * BITS_PER_LONG < size; idx++) {
        if (addr[idx])
            return min(idx * BITS_PER_LONG + __ffs(addr[idx]), size);
    }

    return size;
}
static void free_ent(struct mlx5_cmd *cmd, int idx)
{
    unsigned long flags;

    spin_lock_irqsave(&cmd->alloc_lock, flags);
    set_bit(idx, &cmd->bitmask);
    spin_unlock_irqrestore(&cmd->alloc_lock, flags);
}

从上述代码可见,通过cmd->bitmask进行标记是否有还可以申请,cmd->bitmask的初始化如下,假设获取出来的cmd->log_sz=0x80,则cmd->max_reg_cmds=0xFF,cmd->bitmask=0x1FD。应该是cmd->bitmask初始化的时候有多少bit位为1,就可以alloc_ent多少次。

int mlx5_cmd_init(struct mlx5_core_dev *dev)
{
    int size = sizeof(struct mlx5_cmd_prot_block);
    int align = roundup_pow_of_two(size);
    struct mlx5_cmd *cmd = &dev->cmd;
    u32 cmd_h, cmd_l;
    u16 cmd_if_rev;
    int err;
    int i;

    memset(cmd, 0, sizeof(*cmd));
    cmd_if_rev = cmdif_rev(dev);
    if (cmd_if_rev != CMD_IF_REV) {
        dev_err(&dev->pdev->dev,
            "Driver cmdif rev(%d) differs from firmware's(%d)\n",
            CMD_IF_REV, cmd_if_rev);
        return -EINVAL;
    }

    cmd->pool = dma_pool_create("mlx5_cmd", &dev->pdev->dev, size, align, 0);
    if (!cmd->pool)
        return -ENOMEM;
    err = alloc_cmd_page(dev, cmd);
    if (err)
        goto err_free_pool;

    cmd_l = ioread32be(&dev->iseg->cmdq_addr_l_sz) & 0xff;
    cmd->log_sz = cmd_l >> 4 & 0xf;
    cmd->log_stride = cmd_l & 0xf;
    if (1 << cmd->log_sz > MLX5_MAX_COMMANDS) {
        dev_err(&dev->pdev->dev, "firmware reports too many outstanding commands %d\n",
            1 << cmd->log_sz);
        err = -EINVAL;
        goto err_free_page;
    }

    if (cmd->log_sz + cmd->log_stride > MLX5_ADAPTER_PAGE_SHIFT) {
        dev_err(&dev->pdev->dev, "command queue size overflow\n");
        err = -EINVAL;
        goto err_free_page;
    }

    cmd->checksum_disabled = 1;
    cmd->max_reg_cmds = (1 << cmd->log_sz) - 1;
    cmd->bitmask = (1UL << cmd->max_reg_cmds) - 1;

执行cmd_exec的时候,会去调用alloc_ent,找到cmd->bitmask中最近为1的位,并调用clear_bit清除为0;cmd传到固件执行完以后,要去释放command entry,调用free_ent,将对应位重新标志为1。所以出现问题,有两种可能性:

1、当cmd在传到固件,一直未能执行完,而一直有cmd_exec执行,被积压,导致cmd->bitmask的位使用完,cmd->bitmask所有位被置为了0。alloc_ent无法申请到cmd->bitmask中的标识位。

2、cmd_exec执行完后,由于某种原因,没能调用free_ent,导致cmd->bitmask的所有位被使用完。

3.3 vmcore分析

日志中提示Internal error: SP/PC alignment exception,PC寄存器出现了未对齐的打印。

[3478640.184122] Internal error: SP/PC alignment exception: 8a000000 [#1] SMP
[3478640.184720] Modules linked in: fuse tun bridge stp llc fccore(POE) fchook(OE) bonding ip_set_hash_ip ip_set nfnetlink rfkill ipmi_ssif vfat fat rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser rdma_cm ib_umad iw_cm ib_ipoib libiscsi scsi_transport_iscsi ib_cm mlx5_ib joydev ib_uverbs ib_core ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel mlx5_core megaraid_sas igb ast mlxfw dm_mirror dm_region_hash dm_log [last unloaded: ip_tables]
[3478640.188127] Process kworker/u128:0 (pid: 1268656, stack limit = 0x0000000074e0af67)
[3478640.188820] CPU: 15 PID: 1268656 Comm: kworker/u128:0 Kdump: loaded Tainted: P           OE     4.19.90-52.22.v2207.ky10.aarch64 #1
[3478640.189835] Source Version: 52f32c50138f4eec54ab90ce9775f0d75e1d6968
[3478640.190505] Hardware name: HIKVISION DS-V/N/A, BIOS 601FBE02HK 2020/04/20
[3478640.191173] Workqueue: mlx5_cmd_0000:06:00.0 cmd_work_handler [mlx5_core]
[3478640.191790] pstate: 40000005 (nZcv daif -PAN -UAO)
[3478640.192235] pc : 0x42
[3478640.192539] lr : cmd_work_handler+0x434/0x578 [mlx5_core]
[3478640.193029] sp : ffffda857e113d20
[3478640.193348] x29: ffffda857e113d20 x28: ffffda8553b4c400
[3478640.193838] x27: 0000000000000000 x26: ffff5607dffd1000
[3478640.194341] x25: 0000000000000000 x24: ffffda8511e001c8
[3478640.194830] x23: ffffda8511e000e0 x22: ffffda8511e00160
[3478640.195345] x21: ffffda8553b4c548 x20: 0000000000000001
[3478640.195840] x19: ffffda8511e001d8 x18: 0000000000000001
[3478640.196332] x17: 0000fffc58585308 x16: ffff5607def87070
[3478640.196812] x15: ffffffffffffffff x14: 742064656c696166
[3478640.197283] x13: 203a293635363836 x12: 323120646970283a
[3478640.197769] x11: 3630393a72656c64 x10: ffff5607dffd5b70
[3478640.198278] x9 : 5f646d63203a302e x8 : ffff5607deee95d8
[3478640.198769] x7 : ffff5607e07230f8 x6 : 000000000000056b
[3478640.199307] x5 : 0000000000000000 x4 : 0000000000000000
[3478640.199801] x3 : 0000000000000000 x2 : 0000000000000042
[3478640.200358] x1 : 0000000000000000 x0 : 00000000fffffff5
[3478640.200835] Call trace:
[3478640.201084]  0x42
[3478640.201307]  process_one_work+0x1fc/0x4a8
[3478640.201705]  worker_thread+0x50/0x4d8
[3478640.202068]  kthread+0x134/0x138
[3478640.202386]  ret_from_fork+0x10/0x18
[3478640.202781] Code: bad PC value
[3478640.203165] SMP: stopping secondary CPUs
[3478640.205196] Starting crashdump kernel...

crash /usr/lib/debug/lib/modules/4.19.90-52.22.v2207.ky10.aarch64/vmlinux vmcore分析vmcore

LOAD AVERAGE: 102.08, 41.38, 22.56
       TASKS: 7790
    NODENAME: localhost.localdomain
     RELEASE: 4.19.90-52.22.v2207.ky10.aarch64
     VERSION: #1 SMP Tue Mar 14 11:52:45 CST 2023
     MACHINE: aarch64  (unknown Mhz)
      MEMORY: 126.4 GB
       PANIC: ""
         PID: 1268656
     COMMAND: "kworker/u128:0"
        TASK: ffffdc8435076a00  [THREAD_INFO: ffffdc8435076a00]
         CPU: 15
       STATE: TASK_RUNNING (PANIC)

crash> bt
PID: 1268656  TASK: ffffdc8435076a00  CPU: 15  COMMAND: "kworker/u128:0"
 #0 [ffffda857e1138c0] machine_kexec at ffff5607de7b2e8c
 #1 [ffffda857e113920] __crash_kexec at ffff5607de8ca148
 #2 [ffffda857e113a90] crash_kexec at ffff5607de8ca258
 #3 [ffffda857e113ac0] die at ffff5607de79f65c
 #4 [ffffda857e113b00] arm64_notify_die at ffff5607de79f964
 #5 [ffffda857e113b30] do_sp_pc_abort at ffff5607de7914f8
 #6 [ffffda857e113d10] el1_pc at ffff5607de792f94
 #7 [ffffda857e113d20] (null) at 3e
 #8 [ffffda857e113db0] process_one_work at ffff5607de8226f8
 #9 [ffffda857e113e00] worker_thread at ffff5607de8229f4
#10 [ffffda857e113e70] kthread at ffff5607de82a1e0

根据vmcore-dmesg中,pc : 0x42 lr : cmd_work_handler+0x434/0x578 [mlx5_core],查看lr寄存器执行的位置可见执行 blr x2的时候触发了空指针。从而可知问题发生在drivers/net/ethernet/mellanox/mlx5/core/cmd.c: 908

crash> mod -s mlx5_core
crash> dis -rl cmd_work_handler+0x434 |tail -n 20
0xffff56077bcf6808 <cmd_work_handler+1016>:     mrs     x1, sp_el0
/usr/src/debug/kernel-4.19.90/linux-4.19.90-52.22.v2207.ky10.aarch64/drivers/net/ethernet/mellanox/mlx5/core/cmd.c: 906
0xffff56077bcf680c <cmd_work_handler+1020>:     ldr     x0, [x22,#-128]
0xffff56077bcf6810 <cmd_work_handler+1024>:     adrp    x2, 0xffff56077bd70000
0xffff56077bcf6814 <cmd_work_handler+1028>:     ldr     w4, [x1,#1712]
0xffff56077bcf6818 <cmd_work_handler+1032>:     add     x2, x2, #0xe50
0xffff56077bcf681c <cmd_work_handler+1036>:     add     x2, x2, #0x7c0
0xffff56077bcf6820 <cmd_work_handler+1040>:     mov     w3, #0x38a                      // #906
0xffff56077bcf6824 <cmd_work_handler+1044>:     adrp    x1, 0xffff56077bd7e000
0xffff56077bcf6828 <cmd_work_handler+1048>:     add     x1, x1, #0xcb0
0xffff56077bcf682c <cmd_work_handler+1052>:     bl      0xffff56077bd65e60 <plt:_dev_err>
/usr/src/debug/kernel-4.19.90/linux-4.19.90-52.22.v2207.ky10.aarch64/drivers/net/ethernet/mellanox/mlx5/core/cmd.c: 907
0xffff56077bcf6830 <cmd_work_handler+1056>:     ldr     x2, [x28,#40]
0xffff56077bcf6834 <cmd_work_handler+1060>:     cbz     x2, 0xffff56077bcf6970 <cmd_work_handler+1376>
/usr/src/debug/kernel-4.19.90/linux-4.19.90-52.22.v2207.ky10.aarch64/drivers/net/ethernet/mellanox/mlx5/core/cmd.c: 908
0xffff56077bcf6838 <cmd_work_handler+1064>:     ldr     x1, [x21,#-88]
0xffff56077bcf683c <cmd_work_handler+1068>:     mov     w0, #0xfffffff5                 // #-11
0xffff56077bcf6840 <cmd_work_handler+1072>:     blr     x2
/usr/src/debug/kernel-4.19.90/linux-4.19.90-52.22.v2207.ky10.aarch64/drivers/net/ethernet/mellanox/mlx5/core/cmd.c: 909
0xffff56077bcf6844 <cmd_work_handler+1076>:     ldr     x1, [x28,#16]
crash>

image.png
struct mlx5_cmd_work_ent是一个自定义的结构体,它包含了额外的字段和数据,用于处理MLX5驱动程序中的命令工作项。可见callback是一个有问题的值,该值出现问题应是网卡固件导致。

crash> struct mlx5_cmd_work_ent 0xffffda8553b4c400
struct mlx5_cmd_work_ent {
  state = 1,
  in = 0x383ef300000001,
  out = 0x1,
  uout = 0x0,
  uout_size = 0,
  callback = 0x42,
  cb_timeout_work = {
    work = {
      data = {
        counter = 0
      },
文章来自个人专栏
kk
126 文章 | 1 订阅
0条评论
作者已关闭评论
作者已关闭评论
1
0