测试计划
在此次试验中,在同一个服务器上:
- STREAM_ARRAY_SIZE在STREAM如果小于L3 cache的4倍,内存测试结果如何
- STREAM_ARRAY_SIZE在STREAM如果等于L3 cache的4倍,内存结果为多少
- STREAM_ARRAY_SIZE在STREAM持续增大,如果远大于L3 cache的4倍,内存测试结果影响多大
通过对以上3个场景的结果进行对比,从而得出STREAM_ARRAY_SIZE对内存测试的影响和建议的取值,使内存测试的结果更加客观和准确。
公式
按照网上的说明和源码的解释:
STREAM requires different amounts of memory to run on different systems, depending on both the system cache size(s) and the granularity of the system timer.
You should adjust the value of 'STREAM_ARRAY_SIZE' (below) to meet *both* of the following criteria:
(a) Each array must be at least 4 times the size of the available cache memory. I don't worry about the difference between 10^6 and 2^20, so in practice the minimum array size is about 3.8 times the cache size.
Example 1: One Xeon E3 with 8 MB L3 cache
STREAM_ARRAY_SIZE should be >= 4 million, giving
an array size of 30.5 MB and a total memory requirement
of 91.5 MB.
Example 2: Two Xeon E5's with 20 MB L3 cache each (using OpenMP)
STREAM_ARRAY_SIZE should be >= 20 million, giving
an array size of 153 MB and a total memory requirement
of 458 MB.
(b) The size should be large enough so that the 'timing calibration'
output by the program is at least 20 clock-ticks.
Example: most versions of Windows have a 10 millisecond timer
granularity. 20 "ticks" at 10 ms/tic is 200 milliseconds.
If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec.
This means the each array must be at least 1 GB, or 128M elements.
公式为:DSTREAM_ARRAY_SIZE = L3 cache (MB) x 4 (times) x 1000000 x sockets / 8
按照上面的公式,套用例子里的数据,
(a),当L3 cache为8MB, socket为1时, DSTREAM_ARRAY_SZIE=8 x 4 x 1000000 x 1 / 8 = 400 0000
(b), 当 L3 cache为20MB,socket为2时,DSTREAM_ARRAY_SIZE=20 x 4 x 1000000 x 2 / 8 = 2000 0000
结果与例子里的数据一致。
到这里,似乎可以直接套用公式开始stream的测试了。
问题:
本着钻牛角尖的精神, 有个小问题:
DSTREAM_ARRAY_SIZE参数的取值范围, 在大于4倍L3 cache的基础上,如果一直增大,结果影响大吗,必须按照4倍L3 cache来取值么?
每次测试,都会消耗定量的内存,如果大到内存不够了会发生什么?
测试环境:
测试平台 |
L3 Cache |
socket |
4.1倍L3 Cache 对应ARRAY_SIZE |
可用物理内存 |
0.6倍可用内存 |
物理机 icelake 8378C |
114 MB |
2 |
1 2000 0000 |
1024 G |
614.4 G |
虚拟机 4C8G 1v1 绑核 |
64 MB |
4 |
1 3119 9999 |
8 G |
4.8 G |
测试命令:
gcc -mtune=native -march=native -O3 -fno-pic -ffp-contract=fast -mcmodel=large -fopenmp -DSTREAM_ARRAY_SIZE=${array_size} -DNTIMES=80 stream.c -o stream.o
加上-mcmodel=large, 在编译时,使用超大arraysize 才不会报错。
测试数据:
物理机:
index |
as=2 0000 0000 |
as=3 0000 0000 |
as=4 0000 0000 |
as=5 0000 0000 |
as=6 0000 0000 |
as=7 0000 0000 |
as=10 0000 0000 |
as=20 0000 0000 |
as=40 0000 0000 |
as=100 0000 0000 |
as=200 0000 0000 |
as_1=4000 000 0000 |
as_2=400 0000 0000 |
as=500 0000 0000 |
Copy |
301173 |
302532.9 |
298493.8 |
300164 |
299337 |
299537.7 |
300983.8 |
288498.8 |
290693.6 |
289813.6 |
290101.5 |
288578.3 |
288767.5 |
segment fault |
Scale |
296823.7 |
298080.6 |
298202 |
299726.9 |
301497.7 |
300799.8 |
300366.9 |
302240.4 |
302537.5 |
302987.8 |
302704.6 |
302374 |
302335.7 |
|
Add |
304744.8 |
308521.3 |
306699.3 |
307306.9 |
307185.9 |
307579.5 |
308760.4 |
309881 |
309878.4 |
309665 |
309491.3 |
309764.8 |
309816.2 |
|
Triad |
305986.1 |
309146.6 |
307486.2 |
308879.8 |
309005.9 |
308036.6 |
310090.1 |
309133 |
309914.4 |
309669.5 |
309799.8 |
309975.3 |
309967.1 |
偏差: (max_copy - min_copy) / max_copy x 100% = 4.64%
偏差:(max_scale - min_scale) / max_scale x 100% = 2.03%
偏差: (max_add - min_add) / max_add x 100% = 1.65%
偏差: (max_triad - min_triad) / max_triad x 100% = 1.32%
虚拟机:
index |
as_1=1 4000 0000 |
as_2=1 4000 0000 |
as_3=1 4000 0000 |
as_1=2 0000 0000 |
as_2=2 0000 0000 |
as_1=3 0000 0000 |
as_2=3 0000 0000 |
as=3 0000 0000 |
Copy |
65921.3 |
64040 |
63892.3 |
64075.5 |
64183.7 |
64189.1 |
64107.8 |
segment fault |
Scale |
47594.7 |
48140 |
48105.7 |
46939.5 |
47096.3 |
45766.1 |
46016.7 |
|
Add |
54715.1 |
55048.5 |
55178.7 |
54151.6 |
54242.7 |
53583 |
53667.2 |
|
Triad |
54798.9 |
54616.5 |
54893.1 |
54284.5 |
54434.7 |
53258.7 |
53239 |
偏差: (max_copy - min_copy) / max_copy x 100% = 2.75%
偏差: (max_add - min_add) / max_add x 100% = 2.89%
偏差:(max_scale - min_scale) / max_scale x 100% = 4.93%
偏差: (max_triad - min_triad) / max_triad x 100% = 3.01%
测试结果:
数据偏差:
test index |
物理机 |
虚拟机 1v1 vcpu绑核 |
Copy |
4.64% |
2.75% |
Scale |
2.03% |
2.89% |
ADD |
1.65% |
4.93% |
Triad |
1.32% |
3.01% |
结论
在DSTREAM_ARRAY_SIZE取大于4.1倍L3 cache的情况下, 其stream的结果差别不大,大部分在3%范围内波动,最大能控制在5%以内, 所以只要大于4倍L3 cache, stream的测试结果是可信的, 如果一直增大DSTREAM_ARRAY_SIZE,只会增加测试运行时间。
在stream测试执行时所需的内存大于0.6倍可用内存时, 对结果的影响也不大, 但是如果大到一定的程度, 在运行时会直接报 segment fault, 导致执行不成功。