测试计划

在此次试验中，在同一个服务器上：

STREAM_ARRAY_SIZE在STREAM如果小于L3 cache的4倍，内存测试结果如何
STREAM_ARRAY_SIZE在STREAM如果等于L3 cache的4倍，内存结果为多少
STREAM_ARRAY_SIZE在STREAM持续增大，如果远大于L3 cache的4倍，内存测试结果影响多大

通过对以上3个场景的结果进行对比，从而得出STREAM_ARRAY_SIZE对内存测试的影响和建议的取值，使内存测试的结果更加客观和准确。

公式

按照网上的说明和源码的解释：

STREAM requires different amounts of memory to run on different systems, depending on both the system cache size(s) and the granularity of the system timer.
You should adjust the value of 'STREAM_ARRAY_SIZE' (below) to meet *both* of the following criteria:
(a) Each array must be at least 4 times the size of the available cache memory. I don't worry about the difference between 10^6 and 2^20, so in practice the minimum array size is about 3.8 times the cache size.
Example 1: One Xeon E3 with 8 MB L3 cache
STREAM_ARRAY_SIZE should be >= 4 million, giving
an array size of 30.5 MB and a total memory requirement
of 91.5 MB.
Example 2: Two Xeon E5's with 20 MB L3 cache each (using OpenMP)
STREAM_ARRAY_SIZE should be >= 20 million, giving
an array size of 153 MB and a total memory requirement
of 458 MB.
(b) The size should be large enough so that the 'timing calibration'
output by the program is at least 20 clock-ticks.
Example: most versions of Windows have a 10 millisecond timer
granularity. 20 "ticks" at 10 ms/tic is 200 milliseconds.
If the chip is capable of 10 GB/s, it moves 2 GB in 200 msec.
This means the each array must be at least 1 GB, or 128M elements.

公式为：DSTREAM_ARRAY_SIZE = L3 cache (MB) x 4 (times) x 1000000 x sockets / 8

按照上面的公式，套用例子里的数据，

(a)，当L3 cache为8MB， socket为1时， DSTREAM_ARRAY_SZIE=8 x 4 x 1000000 x 1 / 8 = 400 0000

(b), 当 L3 cache为20MB，socket为2时，DSTREAM_ARRAY_SIZE=20 x 4 x 1000000 x 2 / 8 = 2000 0000

结果与例子里的数据一致。

到这里，似乎可以直接套用公式开始stream的测试了。

问题：

本着钻牛角尖的精神，有个小问题：

DSTREAM_ARRAY_SIZE参数的取值范围，在大于4倍L3 cache的基础上，如果一直增大，结果影响大吗，必须按照4倍L3 cache来取值么？
每次测试，都会消耗定量的内存，如果大到内存不够了会发生什么？

测试环境：

测试平台	L3 Cache	socket	4.1倍L3 Cache 对应ARRAY_SIZE	可用物理内存	0.6倍可用内存
物理机 icelake 8378C	114 MB	2	1 2000 0000	1024 G	614.4 G
虚拟机 4C8G 1v1 绑核	64 MB	4	1 3119 9999	8 G	4.8 G

测试命令：

gcc -mtune=native -march=native -O3 -fno-pic -ffp-contract=fast -mcmodel=large -fopenmp -DSTREAM_ARRAY_SIZE=${array_size} -DNTIMES=80 stream.c -o stream.o

加上-mcmodel=large，在编译时，使用超大arraysize 才不会报错。

测试数据：

物理机：

index	as=2 0000 0000	as=3 0000 0000	as=4 0000 0000	as=5 0000 0000	as=6 0000 0000	as=7 0000 0000	as=10 0000 0000	as=20 0000 0000	as=40 0000 0000	as=100 0000 0000	as=200 0000 0000	as_1=4000 000 0000	as_2=400 0000 0000	as=500 0000 0000
Copy	301173	302532.9	298493.8	300164	299337	299537.7	300983.8	288498.8	290693.6	289813.6	290101.5	288578.3	288767.5	segment fault
Scale	296823.7	298080.6	298202	299726.9	301497.7	300799.8	300366.9	302240.4	302537.5	302987.8	302704.6	302374	302335.7
Add	304744.8	308521.3	306699.3	307306.9	307185.9	307579.5	308760.4	309881	309878.4	309665	309491.3	309764.8	309816.2
Triad	305986.1	309146.6	307486.2	308879.8	309005.9	308036.6	310090.1	309133	309914.4	309669.5	309799.8	309975.3	309967.1

偏差： (max_copy - min_copy) / max_copy x 100% = 4.64%

偏差：(max_scale - min_scale) / max_scale x 100% = 2.03%

偏差： (max_add - min_add) / max_add x 100% = 1.65%

偏差： (max_triad - min_triad) / max_triad x 100% = 1.32%

虚拟机：

index	as_1=1 4000 0000	as_2=1 4000 0000	as_3=1 4000 0000	as_1=2 0000 0000	as_2=2 0000 0000	as_1=3 0000 0000	as_2=3 0000 0000	as=3 0000 0000
Copy	65921.3	64040	63892.3	64075.5	64183.7	64189.1	64107.8	segment fault
Scale	47594.7	48140	48105.7	46939.5	47096.3	45766.1	46016.7
Add	54715.1	55048.5	55178.7	54151.6	54242.7	53583	53667.2
Triad	54798.9	54616.5	54893.1	54284.5	54434.7	53258.7	53239

偏差： (max_copy - min_copy) / max_copy x 100% = 2.75%

偏差： (max_add - min_add) / max_add x 100% = 2.89%

偏差：(max_scale - min_scale) / max_scale x 100% = 4.93%

偏差： (max_triad - min_triad) / max_triad x 100% = 3.01%

测试结果：

数据偏差：

test index	物理机	虚拟机 1v1 vcpu绑核
Copy	4.64%	2.75%
Scale	2.03%	2.89%
ADD	1.65%	4.93%
Triad	1.32%	3.01%

结论

在DSTREAM_ARRAY_SIZE取大于4.1倍L3 cache的情况下，其stream的结果差别不大，大部分在3%范围内波动，最大能控制在5%以内，所以只要大于4倍L3 cache， stream的测试结果是可信的，如果一直增大DSTREAM_ARRAY_SIZE，只会增加测试运行时间。

在stream测试执行时所需的内存大于0.6倍可用内存时，对结果的影响也不大，但是如果大到一定的程度，在运行时会直接报 segment fault，导致执行不成功。

测试计划

在此次试验中，在同一个服务器上：

STREAM_ARRAY_SIZE在STREAM如果小于L3 cache的4倍，内存测试结果如何
STREAM_ARRAY_SIZE在STREAM如果等于L3 cache的4倍，内存结果为多少
STREAM_ARRAY_SIZE在STREAM持续增大，如果远大于L3 cache的4倍，内存测试结果影响多大

通过对以上3个场景的结果进行对比，从而得出STREAM_ARRAY_SIZE对内存测试的影响和建议的取值，使内存测试的结果更加客观和准确。

公式

按照网上的说明和源码的解释：

公式为：DSTREAM_ARRAY_SIZE = L3 cache (MB) x 4 (times) x 1000000 x sockets / 8

按照上面的公式，套用例子里的数据，

(a)，当L3 cache为8MB， socket为1时， DSTREAM_ARRAY_SZIE=8 x 4 x 1000000 x 1 / 8 = 400 0000

(b), 当 L3 cache为20MB，socket为2时，DSTREAM_ARRAY_SIZE=20 x 4 x 1000000 x 2 / 8 = 2000 0000

结果与例子里的数据一致。

到这里，似乎可以直接套用公式开始stream的测试了。

问题：

本着钻牛角尖的精神，有个小问题：

测试环境：

测试平台	L3 Cache	socket	4.1倍L3 Cache 对应ARRAY_SIZE	可用物理内存	0.6倍可用内存
物理机 icelake 8378C	114 MB	2	1 2000 0000	1024 G	614.4 G
虚拟机 4C8G 1v1 绑核	64 MB	4	1 3119 9999	8 G	4.8 G

测试命令：

gcc -mtune=native -march=native -O3 -fno-pic -ffp-contract=fast -mcmodel=large -fopenmp -DSTREAM_ARRAY_SIZE=${array_size} -DNTIMES=80 stream.c -o stream.o

加上-mcmodel=large，在编译时，使用超大arraysize 才不会报错。

测试数据：

物理机：

index	as=2 0000 0000	as=3 0000 0000	as=4 0000 0000	as=5 0000 0000	as=6 0000 0000	as=7 0000 0000	as=10 0000 0000	as=20 0000 0000	as=40 0000 0000	as=100 0000 0000	as=200 0000 0000	as_1=4000 000 0000	as_2=400 0000 0000	as=500 0000 0000
Copy	301173	302532.9	298493.8	300164	299337	299537.7	300983.8	288498.8	290693.6	289813.6	290101.5	288578.3	288767.5	segment fault
Scale	296823.7	298080.6	298202	299726.9	301497.7	300799.8	300366.9	302240.4	302537.5	302987.8	302704.6	302374	302335.7
Add	304744.8	308521.3	306699.3	307306.9	307185.9	307579.5	308760.4	309881	309878.4	309665	309491.3	309764.8	309816.2
Triad	305986.1	309146.6	307486.2	308879.8	309005.9	308036.6	310090.1	309133	309914.4	309669.5	309799.8	309975.3	309967.1

偏差： (max_copy - min_copy) / max_copy x 100% = 4.64%

偏差：(max_scale - min_scale) / max_scale x 100% = 2.03%

偏差： (max_add - min_add) / max_add x 100% = 1.65%

偏差： (max_triad - min_triad) / max_triad x 100% = 1.32%

虚拟机：

index	as_1=1 4000 0000	as_2=1 4000 0000	as_3=1 4000 0000	as_1=2 0000 0000	as_2=2 0000 0000	as_1=3 0000 0000	as_2=3 0000 0000	as=3 0000 0000
Copy	65921.3	64040	63892.3	64075.5	64183.7	64189.1	64107.8	segment fault
Scale	47594.7	48140	48105.7	46939.5	47096.3	45766.1	46016.7
Add	54715.1	55048.5	55178.7	54151.6	54242.7	53583	53667.2
Triad	54798.9	54616.5	54893.1	54284.5	54434.7	53258.7	53239

偏差： (max_copy - min_copy) / max_copy x 100% = 2.75%

偏差： (max_add - min_add) / max_add x 100% = 2.89%

偏差：(max_scale - min_scale) / max_scale x 100% = 4.93%

偏差： (max_triad - min_triad) / max_triad x 100% = 3.01%

测试结果：

数据偏差：

test index	物理机	虚拟机 1v1 vcpu绑核
Copy	4.64%	2.75%
Scale	2.03%	2.89%
ADD	1.65%	4.93%
Triad	1.32%	3.01%

智算服务

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

Stream - DSTREAM_ARRAY_SIZE 探究

测试计划

公式

问题：

测试环境：

测试命令：

测试数据：

物理机：

虚拟机：

测试结果：

数据偏差：

结论

Stream - DSTREAM_ARRAY_SIZE 探究

测试计划

公式

问题：

测试环境：

测试命令：

测试数据：

物理机：

虚拟机：

测试结果：

数据偏差：

结论

活动

智算服务

应用商城

合作伙伴

开发者

支持与服务

了解天翼云

Stream - DSTREAM_ARRAY_SIZE 探究

测试计划

公式

问题：

测试环境：

测试命令：

测试数据：

物理机：

虚拟机：

测试结果：

数据偏差：

结论

Stream - DSTREAM_ARRAY_SIZE 探究

测试计划

公式

问题：

测试环境：

测试命令：

测试数据：

物理机：

虚拟机：

测试结果：

数据偏差：

结论