39

啥都吃的豆芽 · 2023年03月16日 · 北京市

倚天710性能监控 —— DDR PMU子系统

倚天710性能监控 —— DDR PMU子系统

REVISION HISTORY

image.png

前言

文章《倚天710性能监控 —— CMN Flit Traffic Trace with Watchpoint Event》我们介绍了如何利用倚天710 CMN的PMU,统计总线的跨Die带宽和跨Socket带宽。本文介绍如何利用倚天710的DDR子系统的PMU(DDR Sub-System Performance Monitoring Unit),统计DDR带宽。

1. 倚天710的DDR5子系统

倚天710支持支持最先进的DDR5 DRAM,为云计算和HPC提供巨大的内存带宽。倚天710有8 DDR5通道(channel),每个Die上有4个。每个通道相互独立地服务系统的内存请求,分别支持用于1DPC(DIMM Per Channel)的DDR5-4400和2DPC的DDR5-4000。

1.2 DDR5 Architecture

DDR5的一个主要变化是新的DIMM通道结构(Fig 2中Channel Architecture)。DDR4 DIMM的总线位宽为72比特,由64比特数据位和8比特ECC位组成。DDR5的每个DIMM有两个独立的子通道。两个通道中的总线位宽都为40比特:32比特的数据位和8比特的ECC位。尽管DDR4和DDR5的数据位宽相同(总共64比特),但两个独立通道可以提高内存访问效率并减少延迟。单通道单次任务只能读或写,双通道的DDR5则读写可以同时进行。

1.2 DDR5 理论带宽

倚天2DPC的DDR5-4000的理论带宽为:

  • 4000MHz 32bit / 8 8 2 = 128 10^9 2 bytes = 128GB/s 2= 256 GB/s
  • 内存等效频率(4000MHz)_ 子通道位宽(32 bit)/ 8 _ 子通道数(8)* Die (2)

注意GB和GiB的不同:

  • 1 GB = 1000000000 bytes (= 1000^3 B = 10^9 B)
  • 1 GiB = 1073741824 bytes (= 1024^3 B = 2^30 B).

2. 倚天710 DDRSS PMU

倚天710的DDRSS为每个子通道都实现了独立的PMU,用于性能和功能调试,每个子通道的PMU包含16个通用计数器。支持的事件有:

image.png
image.png
image.png

带宽计算公式为:

  • DRAM Read Bandwidth =  perf_hif_rd DDRC_WIDTH DDRC_Freq / DDRC_Cycle
  • DRAM Write Bandwidth = (perf_hif_wr + perf_hif_rmw) DDRC\_WIDTH DDRC_Freq / DDRC_Cycle
  • DDRC_WIDTH: Units of 64 bytes

3. Cloud-kernel对DDRSS PMU的支持

#lscpu
Architecture:          aarch64
Byte Order:            Little Endian
CPU(s):                128
On-line CPU(s) list:   0-127
Thread(s) per core:    1
Core(s) per socket:    128
Socket(s):             1
NUMA node(s):          2
...

测试环境为1个Socket,2个Die,包含两个NUMA node。

#numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 257416 MB
node 0 free: 187991 MB
node 1 cpus: 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 1 size: 257014 MB
node 1 free: 194504 MB
node distances:
node   0   1
  0:  10  15
  1:  15  10

每个NUMA node有 256 GB内存。

#dmidecode|grep -P -A5 "Memory\s+Device"|grep Size|grep -v Range
        Size: 32 GB
        Size: 32 GB
        Size: 32 GB
        Size: 32 GB
        Size: 32 GB
        Size: 32 GB
        Size: 32 GB
        Size: 32 GB
        Size: 32 GB
        Size: 32 GB
        Size: 32 GB
        Size: 32 GB
        Size: 32 GB
        Size: 32 GB
        Size: 32 GB
        Size: 32 GB
        Size: No Module Installed
 ...

#dmidecode -t memory | grep Speed:
        Speed: 4000 MHz
        Configured Clock Speed: 4000 MHz

2DPC,共插了16根DIMM,每个Die8根DIMM,有效频率为 4000MHz。

#ls /sys/bus/event_source/devices/ | grep drw
ali_drw_21000
ali_drw_21080
ali_drw_23000
ali_drw_23080
ali_drw_25000
ali_drw_25080
ali_drw_27000
ali_drw_27080
ali_drw_40021000
ali_drw_40021080
ali_drw_40023000
ali_drw_40023080
ali_drw_40025000
ali_drw_40025080
ali_drw_40027000
ali_drw_40027080

2DPC满插时一共16个PMU设备,其中ali_drw_21000ali_drw_21080为Die 0上同一个DIMM的两个子通道,ali_drw_2X000为Die 0的PMU设备,ali_drw_4002X000为Die 1的PMU设备。

4. DDR 带宽准确性验证

4.1 TL;DR

image.png

带宽单位:MB/s

可以看到,DDR PMU的带宽统计误差不超过 1%。测试原理,请阅读《倚天710性能监控 —— CMN Flit Traffic Trace with Watchpoint Event》。

4.2 C0M0 rd

# First, run bw_mem as backgroud workload
# numactl --cpubind=0 --membind=0 ./bw_mem 40960M rd

# Then run perf command in another console
perf stat \
  -e ali_drw_21000/perf_hif_wr/ \
  -e ali_drw_21000/perf_hif_rd/ \
  -e ali_drw_21000/perf_hif_rmw/ \
  -e ali_drw_21000/perf_cycle/ \
  -e ali_drw_21080/perf_hif_wr/ \
  -e ali_drw_21080/perf_hif_rd/ \
  -e ali_drw_21080/perf_hif_rmw/ \
  -e ali_drw_21080/perf_cycle/ \
  -e ali_drw_23000/perf_hif_wr/ \
  -e ali_drw_23000/perf_hif_rd/ \
  -e ali_drw_23000/perf_hif_rmw/ \
  -e ali_drw_23000/perf_cycle/ \
  -e ali_drw_23080/perf_hif_wr/ \
  -e ali_drw_23080/perf_hif_rd/ \
  -e ali_drw_23080/perf_hif_rmw/ \
  -e ali_drw_23080/perf_cycle/ \
  -e ali_drw_25000/perf_hif_wr/ \
  -e ali_drw_25000/perf_hif_rd/ \
  -e ali_drw_25000/perf_hif_rmw/ \
  -e ali_drw_25000/perf_cycle/ \
  -e ali_drw_25080/perf_hif_wr/ \
  -e ali_drw_25080/perf_hif_rd/ \
  -e ali_drw_25080/perf_hif_rmw/ \
  -e ali_drw_25080/perf_cycle/ \
  -e ali_drw_27000/perf_hif_wr/ \
  -e ali_drw_27000/perf_hif_rd/ \
  -e ali_drw_27000/perf_hif_rmw/ \
  -e ali_drw_27000/perf_cycle/ \
  -e ali_drw_27080/perf_hif_wr/ \
  -e ali_drw_27080/perf_hif_rd/ \
  -e ali_drw_27080/perf_hif_rmw/ \
  -e ali_drw_27080/perf_cycle/ \
  -a -- sleep 1

Performance counter stats for 'system wide':

             12398      ali_drw_21000/perf_hif_wr/
          40160751      ali_drw_21000/perf_hif_rd/
               743      ali_drw_21000/perf_hif_rmw/
         500620725      ali_drw_21000/perf_cycle/
             12252      ali_drw_21080/perf_hif_wr/
          40161013      ali_drw_21080/perf_hif_rd/
               767      ali_drw_21080/perf_hif_rmw/
         500619340      ali_drw_21080/perf_cycle/
             11960      ali_drw_23000/perf_hif_wr/
          40159522      ali_drw_23000/perf_hif_rd/
               737      ali_drw_23000/perf_hif_rmw/
         500613505      ali_drw_23000/perf_cycle/
             12044      ali_drw_23080/perf_hif_wr/
          40159066      ali_drw_23080/perf_hif_rd/
               773      ali_drw_23080/perf_hif_rmw/
         500607620      ali_drw_23080/perf_cycle/
             12698      ali_drw_25000/perf_hif_wr/
          40160138      ali_drw_25000/perf_hif_rd/
               709      ali_drw_25000/perf_hif_rmw/
         500601240      ali_drw_25000/perf_cycle/
             12521      ali_drw_25080/perf_hif_wr/
          40160169      ali_drw_25080/perf_hif_rd/
               727      ali_drw_25080/perf_hif_rmw/
         500594755      ali_drw_25080/perf_cycle/
             12171      ali_drw_27000/perf_hif_wr/
          40159404      ali_drw_27000/perf_hif_rd/
               706      ali_drw_27000/perf_hif_rmw/
         500589945      ali_drw_27000/perf_cycle/
             12290      ali_drw_27080/perf_hif_wr/
          40157620      ali_drw_27080/perf_hif_rd/
               710      ali_drw_27080/perf_hif_rmw/
         500583305      ali_drw_27080/perf_cycle/

       1.000923276 seconds time elapsed

>>> 40159522*8*64/1000/1000.0
20561.675

# set  CPU and memory to the same NUMA node
numactl --cpubind=0 --membind=0 ./bw_mem 40960M rd
40960.00 20507.82

4.3 C1M1 rd

# First, run bw_mem as backgroud workload
# numactl --cpubind=1 --membind=1 ./bw_mem 40960M rd

# Then run perf command in another console
perf stat \
  -e ali_drw_40021000/perf_hif_wr/ \
  -e ali_drw_40021000/perf_hif_rd/ \
  -e ali_drw_40021000/perf_hif_rmw/ \
  -e ali_drw_40021000/perf_cycle/ \
  -e ali_drw_40021080/perf_hif_wr/ \
  -e ali_drw_40021080/perf_hif_rd/ \
  -e ali_drw_40021080/perf_hif_rmw/ \
  -e ali_drw_40021080/perf_cycle/ \
  -e ali_drw_40023000/perf_hif_wr/ \
  -e ali_drw_40023000/perf_hif_rd/ \
  -e ali_drw_40023000/perf_hif_rmw/ \
  -e ali_drw_40023000/perf_cycle/ \
  -e ali_drw_40023080/perf_hif_wr/ \
  -e ali_drw_40023080/perf_hif_rd/ \
  -e ali_drw_40023080/perf_hif_rmw/ \
  -e ali_drw_40023080/perf_cycle/ \
  -e ali_drw_40025000/perf_hif_wr/ \
  -e ali_drw_40025000/perf_hif_rd/ \
  -e ali_drw_40025000/perf_hif_rmw/ \
  -e ali_drw_40025000/perf_cycle/ \
  -e ali_drw_40025080/perf_hif_wr/ \
  -e ali_drw_40025080/perf_hif_rd/ \
  -e ali_drw_40025080/perf_hif_rmw/ \
  -e ali_drw_40025080/perf_cycle/ \
  -e ali_drw_40027000/perf_hif_wr/ \
  -e ali_drw_40027000/perf_hif_rd/ \
  -e ali_drw_40027000/perf_hif_rmw/ \
  -e ali_drw_40027000/perf_cycle/ \
  -e ali_drw_40027080/perf_hif_wr/ \
  -e ali_drw_40027080/perf_hif_rd/ \
  -e ali_drw_40027080/perf_hif_rmw/ \
  -e ali_drw_40027080/perf_cycle/ \
  -a -- sleep 1

 Performance counter stats for 'system wide':

              2329      ali_drw_40021000/perf_hif_wr/
          40071983      ali_drw_40021000/perf_hif_rd/
                58      ali_drw_40021000/perf_hif_rmw/
         500572165      ali_drw_40021000/perf_cycle/
              2374      ali_drw_40021080/perf_hif_wr/
          40071737      ali_drw_40021080/perf_hif_rd/
                39      ali_drw_40021080/perf_hif_rmw/
         500569615      ali_drw_40021080/perf_cycle/
              2330      ali_drw_40023000/perf_hif_wr/
          40071063      ali_drw_40023000/perf_hif_rd/
                74      ali_drw_40023000/perf_hif_rmw/
         500565635      ali_drw_40023000/perf_cycle/
              2372      ali_drw_40023080/perf_hif_wr/
          40070344      ali_drw_40023080/perf_hif_rd/
                54      ali_drw_40023080/perf_hif_rmw/
         500561355      ali_drw_40023080/perf_cycle/
              2362      ali_drw_40025000/perf_hif_wr/
          40070906      ali_drw_40025000/perf_hif_rd/
                45      ali_drw_40025000/perf_hif_rmw/
         500557480      ali_drw_40025000/perf_cycle/
              2385      ali_drw_40025080/perf_hif_wr/
          40070168      ali_drw_40025080/perf_hif_rd/
                46      ali_drw_40025080/perf_hif_rmw/
         500552550      ali_drw_40025080/perf_cycle/
              2333      ali_drw_40027000/perf_hif_wr/
          40069233      ali_drw_40027000/perf_hif_rd/
                28      ali_drw_40027000/perf_hif_rmw/
         500548745      ali_drw_40027000/perf_cycle/
              2211      ali_drw_40027080/perf_hif_wr/
          40068227      ali_drw_40027080/perf_hif_rd/
                30      ali_drw_40027080/perf_hif_rmw/
         500544450      ali_drw_40027080/perf_cycle/

       1.000863258 seconds time elapsed

>>> 40070906*8*64/1000/1000.0
20516.303

numactl --cpubind=1 --membind=1 ./bw_mem 40960M rd
40960.00 20492.53

4.4 C0M0 fwr

# First, run bw_mem as backgroud workload
# numactl --cpubind=0 --membind=0 ./bw_mem 40960M fwr

# Then run perf command in another console
perf stat \
  -e ali_drw_21000/perf_hif_wr/ \
  -e ali_drw_21000/perf_hif_rd/ \
  -e ali_drw_21000/perf_hif_rmw/ \
  -e ali_drw_21000/perf_cycle/ \
  -e ali_drw_21080/perf_hif_wr/ \
  -e ali_drw_21080/perf_hif_rd/ \
  -e ali_drw_21080/perf_hif_rmw/ \
  -e ali_drw_21080/perf_cycle/ \
  -e ali_drw_23000/perf_hif_wr/ \
  -e ali_drw_23000/perf_hif_rd/ \
  -e ali_drw_23000/perf_hif_rmw/ \
  -e ali_drw_23000/perf_cycle/ \
  -e ali_drw_23080/perf_hif_wr/ \
  -e ali_drw_23080/perf_hif_rd/ \
  -e ali_drw_23080/perf_hif_rmw/ \
  -e ali_drw_23080/perf_cycle/ \
  -e ali_drw_25000/perf_hif_wr/ \
  -e ali_drw_25000/perf_hif_rd/ \
  -e ali_drw_25000/perf_hif_rmw/ \
  -e ali_drw_25000/perf_cycle/ \
  -e ali_drw_25080/perf_hif_wr/ \
  -e ali_drw_25080/perf_hif_rd/ \
  -e ali_drw_25080/perf_hif_rmw/ \
  -e ali_drw_25080/perf_cycle/ \
  -e ali_drw_27000/perf_hif_wr/ \
  -e ali_drw_27000/perf_hif_rd/ \
  -e ali_drw_27000/perf_hif_rmw/ \
  -e ali_drw_27000/perf_cycle/ \
  -e ali_drw_27080/perf_hif_wr/ \
  -e ali_drw_27080/perf_hif_rd/ \
  -e ali_drw_27080/perf_hif_rmw/ \
  -e ali_drw_27080/perf_cycle/ \
  -a -- sleep 1

 Performance counter stats for 'system wide':

          42910737      ali_drw_21000/perf_hif_wr/
            108397      ali_drw_21000/perf_hif_rd/
               495      ali_drw_21000/perf_hif_rmw/
         500708510      ali_drw_21000/perf_cycle/
          42911223      ali_drw_21080/perf_hif_wr/
            117280      ali_drw_21080/perf_hif_rd/
               515      ali_drw_21080/perf_hif_rmw/
         500706780      ali_drw_21080/perf_cycle/
          42910038      ali_drw_23000/perf_hif_wr/
            109179      ali_drw_23000/perf_hif_rd/
               516      ali_drw_23000/perf_hif_rmw/
         500702100      ali_drw_23000/perf_cycle/
          42911620      ali_drw_23080/perf_hif_wr/
            111038      ali_drw_23080/perf_hif_rd/
               523      ali_drw_23080/perf_hif_rmw/
         500697340      ali_drw_23080/perf_cycle/
          42910435      ali_drw_25000/perf_hif_wr/
            111748      ali_drw_25000/perf_hif_rd/
               469      ali_drw_25000/perf_hif_rmw/
         500692500      ali_drw_25000/perf_cycle/
          42908786      ali_drw_25080/perf_hif_wr/
            110177      ali_drw_25080/perf_hif_rd/
               456      ali_drw_25080/perf_hif_rmw/
         500686595      ali_drw_25080/perf_cycle/
          42908903      ali_drw_27000/perf_hif_wr/
            114093      ali_drw_27000/perf_hif_rd/
               490      ali_drw_27000/perf_hif_rmw/
         500681405      ali_drw_27000/perf_cycle/
          42908156      ali_drw_27080/perf_hif_wr/
            109668      ali_drw_27080/perf_hif_rd/
               489      ali_drw_27080/perf_hif_rmw/
         500676420      ali_drw_27080/perf_cycle/

       1.001100811 seconds time elapsed
>>> (42908156+489)*8*64/1000/1000.0
21969.226

numactl --cpubind=0 --membind=0 ./bw_mem 40960M fwr
40960.00 21936.50

4.5 C1M1 fwr

# First, run bw_mem as backgroud workload
# numactl --cpubind=1 --membind=1 ./bw_mem 40960M fwr

# Then run perf command in another console
perf stat \
  -e ali_drw_40021000/perf_hif_wr/ \
  -e ali_drw_40021000/perf_hif_rd/ \
  -e ali_drw_40021000/perf_hif_rmw/ \
  -e ali_drw_40021000/perf_cycle/ \
  -e ali_drw_40021080/perf_hif_wr/ \
  -e ali_drw_40021080/perf_hif_rd/ \
  -e ali_drw_40021080/perf_hif_rmw/ \
  -e ali_drw_40021080/perf_cycle/ \
  -e ali_drw_40023000/perf_hif_wr/ \
  -e ali_drw_40023000/perf_hif_rd/ \
  -e ali_drw_40023000/perf_hif_rmw/ \
  -e ali_drw_40023000/perf_cycle/ \
  -e ali_drw_40023080/perf_hif_wr/ \
  -e ali_drw_40023080/perf_hif_rd/ \
  -e ali_drw_40023080/perf_hif_rmw/ \
  -e ali_drw_40023080/perf_cycle/ \
  -e ali_drw_40025000/perf_hif_wr/ \
  -e ali_drw_40025000/perf_hif_rd/ \
  -e ali_drw_40025000/perf_hif_rmw/ \
  -e ali_drw_40025000/perf_cycle/ \
  -e ali_drw_40025080/perf_hif_wr/ \
  -e ali_drw_40025080/perf_hif_rd/ \
  -e ali_drw_40025080/perf_hif_rmw/ \
  -e ali_drw_40025080/perf_cycle/ \
  -e ali_drw_40027000/perf_hif_wr/ \
  -e ali_drw_40027000/perf_hif_rd/ \
  -e ali_drw_40027000/perf_hif_rmw/ \
  -e ali_drw_40027000/perf_cycle/ \
  -e ali_drw_40027080/perf_hif_wr/ \
  -e ali_drw_40027080/perf_hif_rd/ \
  -e ali_drw_40027080/perf_hif_rmw/ \
  -e ali_drw_40027080/perf_cycle/ \
  -a -- sleep 1

 Performance counter stats for 'system wide':

          42906048      ali_drw_40021000/perf_hif_wr/
             33939      ali_drw_40021000/perf_hif_rd/
                76      ali_drw_40021000/perf_hif_rmw/
         500629355      ali_drw_40021000/perf_cycle/
          42905967      ali_drw_40021080/perf_hif_wr/
             34018      ali_drw_40021080/perf_hif_rd/
                63      ali_drw_40021080/perf_hif_rmw/
         500631900      ali_drw_40021080/perf_cycle/
          42905422      ali_drw_40023000/perf_hif_wr/
             33843      ali_drw_40023000/perf_hif_rd/
                75      ali_drw_40023000/perf_hif_rmw/
         500628540      ali_drw_40023000/perf_cycle/
          42905547      ali_drw_40023080/perf_hif_wr/
             33858      ali_drw_40023080/perf_hif_rd/
                68      ali_drw_40023080/perf_hif_rmw/
         500623970      ali_drw_40023080/perf_cycle/
          42905230      ali_drw_40025000/perf_hif_wr/
             34028      ali_drw_40025000/perf_hif_rd/
                56      ali_drw_40025000/perf_hif_rmw/
         500620630      ali_drw_40025000/perf_cycle/
          42904734      ali_drw_40025080/perf_hif_wr/
             34141      ali_drw_40025080/perf_hif_rd/
                61      ali_drw_40025080/perf_hif_rmw/
         500615840      ali_drw_40025080/perf_cycle/
          42903390      ali_drw_40027000/perf_hif_wr/
             33712      ali_drw_40027000/perf_hif_rd/
                84      ali_drw_40027000/perf_hif_rmw/
         500610635      ali_drw_40027000/perf_cycle/
          42903975      ali_drw_40027080/perf_hif_wr/
             33916      ali_drw_40027080/perf_hif_rd/
               106      ali_drw_40027080/perf_hif_rmw/
         500606645      ali_drw_40027080/perf_cycle/

       1.000953335 seconds time elapsed

>>> (42903975+106)*8*64/1000/1000.0
21966.889

#numactl --cpubind=1 --membind=1 ./bw_mem 40960M fwr
40960.00 21934.51
文章来源:龙蜥社区

推荐阅读
倚天710性能系列

更多Arm服务器相关技术及移植干货请关注Arm服务器专栏。如要加入Arm Server微信群,请添加极术小姐姐(微信id:aijishu20)备注Arm服务器邀请加入。
推荐阅读
关注数
17321
内容数
73
分享arm服务器软件应用经验、测试方法、优化思路、工具使用等。
目录
极术微信服务号
关注极术微信号
实时接收点赞提醒和评论通知
安谋科技学堂公众号
关注安谋科技学堂
实时获取安谋科技及 Arm 教学资源
安谋科技招聘公众号
关注安谋科技招聘
实时获取安谋科技中国职位信息