searchusermenu
  • 发布文章
  • 消息中心
点赞
收藏
评论
分享
原创

Stream - DSTREAM_ARRAY_SIZE 探究

2022-12-30 11:03:24
229
0

测试计划

在此次试验中,在同一个服务器上:

  1. STREAM_ARRAY_SIZE在STREAM如果小于L3 cache的4倍,内存测试结果如何
  2. STREAM_ARRAY_SIZE在STREAM如果等于L3 cache的4倍,内存结果为多少
  3. STREAM_ARRAY_SIZE在STREAM持续增大,如果远大于L3 cache的4倍,内存测试结果影响多大

通过对以上3个场景的结果进行对比,从而得出STREAM_ARRAY_SIZE对内存测试的影响和建议的取值,使内存测试的结果更加客观和准确。

公式

按照网上的说明和源码的解释:

STREAM requires different amounts of memory to run on different systems, depending on both the system cache size(s) and the granularity of the system timer.
    You should adjust the value of 'STREAM_ARRAY_SIZE' (below) to meet *both* of the following criteria:
(a) Each array must be at least 4 times the size of the available cache memory. I don't worry about the difference between 10^6 and 2^20, so in practice the minimum array size is about 3.8 times the cache size.
    Example 1: One Xeon E3 with 8 MB L3 cache
        STREAM_ARRAY_SIZE should be >= 4 million, giving
        an array size of 30.5 MB and a total memory requirement
        of 91.5 MB.  
    Example 2: Two Xeon E5's with 20 MB L3 cache each (using OpenMP)
        STREAM_ARRAY_SIZE should be >= 20 million, giving
        an array size of 153 MB and a total memory requirement
        of 458 MB.  
(b) The size should be large enough so that the 'timing calibration'
    output by the program is at least 20 clock-ticks.  
    Example: most versions of Windows have a 10 millisecond timer
        granularity.  20 "ticks" at 10 ms/tic is 200 milliseconds.
        If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec.
        This means the each array must be at least 1 GB, or 128M elements.

公式为:DSTREAM_ARRAY_SIZE = L3 cache (MB)  x 4 (times) x 1000000 x sockets / 8 

按照上面的公式,套用例子里的数据,

(a),当L3 cache为8MB, socket为1时, DSTREAM_ARRAY_SZIE=8 x 4 x 1000000 x 1 / 8 = 400 0000

(b), 当 L3 cache为20MB,socket为2时,DSTREAM_ARRAY_SIZE=20 x 4 x 1000000 x 2 / 8 = 2000 0000

结果与例子里的数据一致。 

到这里,似乎可以直接套用公式开始stream的测试了。 

问题:

本着钻牛角尖的精神, 有个小问题:

DSTREAM_ARRAY_SIZE参数的取值范围, 在大于4倍L3 cache的基础上,如果一直增大,结果影响大吗,必须按照4倍L3 cache来取值么? 
每次测试,都会消耗定量的内存,如果大到内存不够了会发生什么?

测试环境:

测试平台

L3 Cache

socket

4.1倍L3 Cache 对应ARRAY_SIZE

可用物理内存

0.6倍可用内存

物理机 icelake 8378C

114 MB

2

1 2000 0000 

1024 G

614.4 G

虚拟机 4C8G 1v1 绑核

64 MB

4

1 3119 9999 

8 G

4.8 G

测试命令:

gcc -mtune=native -march=native -O3 -fno-pic -ffp-contract=fast -mcmodel=large -fopenmp -DSTREAM_ARRAY_SIZE=${array_size} -DNTIMES=80 stream.c -o stream.o

加上-mcmodel=large在编译时,使用超大arraysize 才不会报错。

测试数据:

物理机:

index

as=2 0000 0000

as=3 0000 0000

as=4 0000 0000

as=5 0000 0000

as=6 0000 0000

as=7 0000 0000

as=10 0000 0000

as=20 0000 0000

as=40 0000 0000

as=100 0000 0000

as=200 0000 0000

as_1=4000 000 0000

as_2=400 0000 0000

as=500 0000 0000

Copy

301173

302532.9

298493.8

300164

299337

299537.7

300983.8

288498.8

290693.6

289813.6

290101.5

288578.3

288767.5

segment fault

Scale

296823.7

298080.6

298202

299726.9

301497.7

300799.8

300366.9

302240.4

302537.5

302987.8

302704.6

302374

302335.7

Add

304744.8

308521.3

306699.3

307306.9

307185.9

307579.5

308760.4

309881

309878.4

309665

309491.3

309764.8

309816.2

Triad

305986.1

309146.6

307486.2

308879.8

309005.9

308036.6

310090.1

309133

309914.4

309669.5

309799.8

309975.3

309967.1

偏差: (max_copy - min_copy) / max_copy x 100% =  4.64%

偏差:(max_scale - min_scale) / max_scale x 100% = 2.03%

偏差: (max_add - min_add) / max_add x 100% = 1.65%

偏差: (max_triad - min_triad) / max_triad x 100% = 1.32%

虚拟机:

index

as_1=1 4000 0000

as_2=1 4000 0000

as_3=1 4000 0000

as_1=2 0000 0000

as_2=2 0000 0000

as_1=3 0000 0000

as_2=3 0000 0000

as=3 0000 0000

Copy

65921.3

64040

63892.3

64075.5

64183.7

64189.1

64107.8

segment fault

Scale

47594.7

48140

48105.7

46939.5

47096.3

45766.1

46016.7

Add

54715.1

55048.5

55178.7

54151.6

54242.7

53583

53667.2

Triad

54798.9

54616.5

54893.1

54284.5

54434.7

53258.7

53239

偏差: (max_copy - min_copy) / max_copy x 100% = 2.75%

偏差: (max_add - min_add) / max_add x 100% = 2.89%

偏差:(max_scale - min_scale) / max_scale x 100%  = 4.93%

偏差: (max_triad - min_triad) / max_triad x 100% = 3.01%

测试结果:

数据偏差:

test index

物理机

虚拟机 1v1 vcpu绑核

Copy

4.64%

2.75%

Scale

2.03%

2.89%

ADD

1.65%

4.93%

Triad

1.32%

3.01%

结论

在DSTREAM_ARRAY_SIZE取大于4.1倍L3 cache的情况下, 其stream的结果差别不大,大部分在3%范围内波动,最大能控制在5%以内, 所以只要大于4倍L3 cache, stream的测试结果是可信的, 如果一直增大DSTREAM_ARRAY_SIZE,只会增加测试运行时间。

在stream测试执行时所需的内存大于0.6倍可用内存时, 对结果的影响也不大, 但是如果大到一定的程度, 在运行时会直接报 segment fault, 导致执行不成功。

 

 

 

 

 

0条评论
0 / 1000
x****n
4文章数
0粉丝数
x****n
4 文章 | 0 粉丝
原创

Stream - DSTREAM_ARRAY_SIZE 探究

2022-12-30 11:03:24
229
0

测试计划

在此次试验中,在同一个服务器上:

  1. STREAM_ARRAY_SIZE在STREAM如果小于L3 cache的4倍,内存测试结果如何
  2. STREAM_ARRAY_SIZE在STREAM如果等于L3 cache的4倍,内存结果为多少
  3. STREAM_ARRAY_SIZE在STREAM持续增大,如果远大于L3 cache的4倍,内存测试结果影响多大

通过对以上3个场景的结果进行对比,从而得出STREAM_ARRAY_SIZE对内存测试的影响和建议的取值,使内存测试的结果更加客观和准确。

公式

按照网上的说明和源码的解释:

STREAM requires different amounts of memory to run on different systems, depending on both the system cache size(s) and the granularity of the system timer.
    You should adjust the value of 'STREAM_ARRAY_SIZE' (below) to meet *both* of the following criteria:
(a) Each array must be at least 4 times the size of the available cache memory. I don't worry about the difference between 10^6 and 2^20, so in practice the minimum array size is about 3.8 times the cache size.
    Example 1: One Xeon E3 with 8 MB L3 cache
        STREAM_ARRAY_SIZE should be >= 4 million, giving
        an array size of 30.5 MB and a total memory requirement
        of 91.5 MB.  
    Example 2: Two Xeon E5's with 20 MB L3 cache each (using OpenMP)
        STREAM_ARRAY_SIZE should be >= 20 million, giving
        an array size of 153 MB and a total memory requirement
        of 458 MB.  
(b) The size should be large enough so that the 'timing calibration'
    output by the program is at least 20 clock-ticks.  
    Example: most versions of Windows have a 10 millisecond timer
        granularity.  20 "ticks" at 10 ms/tic is 200 milliseconds.
        If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec.
        This means the each array must be at least 1 GB, or 128M elements.

公式为:DSTREAM_ARRAY_SIZE = L3 cache (MB)  x 4 (times) x 1000000 x sockets / 8 

按照上面的公式,套用例子里的数据,

(a),当L3 cache为8MB, socket为1时, DSTREAM_ARRAY_SZIE=8 x 4 x 1000000 x 1 / 8 = 400 0000

(b), 当 L3 cache为20MB,socket为2时,DSTREAM_ARRAY_SIZE=20 x 4 x 1000000 x 2 / 8 = 2000 0000

结果与例子里的数据一致。 

到这里,似乎可以直接套用公式开始stream的测试了。 

问题:

本着钻牛角尖的精神, 有个小问题:

DSTREAM_ARRAY_SIZE参数的取值范围, 在大于4倍L3 cache的基础上,如果一直增大,结果影响大吗,必须按照4倍L3 cache来取值么? 
每次测试,都会消耗定量的内存,如果大到内存不够了会发生什么?

测试环境:

测试平台

L3 Cache

socket

4.1倍L3 Cache 对应ARRAY_SIZE

可用物理内存

0.6倍可用内存

物理机 icelake 8378C

114 MB

2

1 2000 0000 

1024 G

614.4 G

虚拟机 4C8G 1v1 绑核

64 MB

4

1 3119 9999 

8 G

4.8 G

测试命令:

gcc -mtune=native -march=native -O3 -fno-pic -ffp-contract=fast -mcmodel=large -fopenmp -DSTREAM_ARRAY_SIZE=${array_size} -DNTIMES=80 stream.c -o stream.o

加上-mcmodel=large在编译时,使用超大arraysize 才不会报错。

测试数据:

物理机:

index

as=2 0000 0000

as=3 0000 0000

as=4 0000 0000

as=5 0000 0000

as=6 0000 0000

as=7 0000 0000

as=10 0000 0000

as=20 0000 0000

as=40 0000 0000

as=100 0000 0000

as=200 0000 0000

as_1=4000 000 0000

as_2=400 0000 0000

as=500 0000 0000

Copy

301173

302532.9

298493.8

300164

299337

299537.7

300983.8

288498.8

290693.6

289813.6

290101.5

288578.3

288767.5

segment fault

Scale

296823.7

298080.6

298202

299726.9

301497.7

300799.8

300366.9

302240.4

302537.5

302987.8

302704.6

302374

302335.7

Add

304744.8

308521.3

306699.3

307306.9

307185.9

307579.5

308760.4

309881

309878.4

309665

309491.3

309764.8

309816.2

Triad

305986.1

309146.6

307486.2

308879.8

309005.9

308036.6

310090.1

309133

309914.4

309669.5

309799.8

309975.3

309967.1

偏差: (max_copy - min_copy) / max_copy x 100% =  4.64%

偏差:(max_scale - min_scale) / max_scale x 100% = 2.03%

偏差: (max_add - min_add) / max_add x 100% = 1.65%

偏差: (max_triad - min_triad) / max_triad x 100% = 1.32%

虚拟机:

index

as_1=1 4000 0000

as_2=1 4000 0000

as_3=1 4000 0000

as_1=2 0000 0000

as_2=2 0000 0000

as_1=3 0000 0000

as_2=3 0000 0000

as=3 0000 0000

Copy

65921.3

64040

63892.3

64075.5

64183.7

64189.1

64107.8

segment fault

Scale

47594.7

48140

48105.7

46939.5

47096.3

45766.1

46016.7

Add

54715.1

55048.5

55178.7

54151.6

54242.7

53583

53667.2

Triad

54798.9

54616.5

54893.1

54284.5

54434.7

53258.7

53239

偏差: (max_copy - min_copy) / max_copy x 100% = 2.75%

偏差: (max_add - min_add) / max_add x 100% = 2.89%

偏差:(max_scale - min_scale) / max_scale x 100%  = 4.93%

偏差: (max_triad - min_triad) / max_triad x 100% = 3.01%

测试结果:

数据偏差:

test index

物理机

虚拟机 1v1 vcpu绑核

Copy

4.64%

2.75%

Scale

2.03%

2.89%

ADD

1.65%

4.93%

Triad

1.32%

3.01%

结论

在DSTREAM_ARRAY_SIZE取大于4.1倍L3 cache的情况下, 其stream的结果差别不大,大部分在3%范围内波动,最大能控制在5%以内, 所以只要大于4倍L3 cache, stream的测试结果是可信的, 如果一直增大DSTREAM_ARRAY_SIZE,只会增加测试运行时间。

在stream测试执行时所需的内存大于0.6倍可用内存时, 对结果的影响也不大, 但是如果大到一定的程度, 在运行时会直接报 segment fault, 导致执行不成功。

 

 

 

 

 

文章来自个人专栏
基础架构性能测试-内存
3 文章 | 1 订阅
0条评论
0 / 1000
请输入你的评论
0
0