Original post is here: eklausmeier.goip.de
I wanted to write a short benchmark for my son to demonstrate that the AMD Bulldozer 8 core CPU is better than a 6 core CPU from AMD when computing with integers. So I wrote a short C program to compute a recurrence relation using integers only, see C code below. When I ran this program on one core, then two cores, then three cores, and so on, I was a little bit surprised to see that the CPU usage time grew. Indeed, it grew quite significantly: from 20 to 60% up! Once all cores of the CPU are used then the CPU usage does no longer increase. In this case it does seem to have reached its equilibrium.
The AMD CPU is a FX 8120, 3.1GHz. For comparison I used an Intel i7-2640, 2.8GHz, 4 core CPU.
[more_WP_Tag] I compiled
1cc -Wall -O3 intpoly.c -o intpoly
and then ran the program as follows
1for i in `seq 1 24`; do echo 2 -1 0 -2 | time -f "%e %U %S" ./intpoly -f -n0 & done
I varied the number of runs from 1 to 24, as indicated in the line above.
The average CPU usage results were as follows on an 8 core AMD Bulldozer:
runs AMD i7 AMD/fp
1 2.44 2.14 5.86
2 2.48 2.25 6.02
3 2.84 2.85 6.84
4 2.96 4.28 6.91
5 3.00 4.34 7.00
6 3.44 4.35 7.12
7 3.82 4.35 7.22
8 4.10 4.34 7.30
9 4.11 4.35 7.28
10 4.11 4.36 7.28
16 4.11 4.38 7.30
24 4.11 4.38 7.31
64 4.12 4.39 7.30
The times given above are not the elapsed time! Of course, elapsed time can be calculated according
number of runs / number of cores * equilibrium time
For example, the AMD used 33 seconds of elapsed time for 64 tasks, while the i7 used 69 seconds, so more than two times as long.
So, I got my result, stating that the more cores you have in your CPU the smaller the elapsed time will be. What I did not expect was, that in the range from 1 to the number of cores in the CPU, usage and elapsed time increase. So some distribution of work must be going on within the CPU, or some kind of contention. If somebody knows why, please leave a note. This behaviour is independent from AMD or Intel, and independent whether one uses integer calculation or double precision.
The C program, named intpoly
, reads some initial values from stdin
. With -f
switch it computes all values in double precision, without -f
it uses integer arithmetic. Switch -n
tells how many iterations, -m
the number of rounds.
1/* Evaluate difference scheme.
2 It has generating polynomial
3 4 2
4 x - 2 x + 1
5
6 and has real roots:
7 2 2
8 (x - 1) (x + 1)
9
10 So all values should remain bounded.
11
12 Floating point case: Polynomial is
13 3 2
14 4 25 x 311 x 29 x 14
15 x - ----- - ------ + ---- + --
16 72 216 108 27
17
18 with roots
19 8 2 7
20 (x - 1) (x - -) (x + -) (x + -)
21 9 3 8
22
23
24 Elmar Klausmeier, 29-Dec-2012
25*/
26
27#include <stdio.h>
28#include <stdlib.h>
29#include <unistd.h>
30#include <limits.h>
31
32
33int main (int argc, char *argv[]) {
34 register int i, x[5];
35 int c, i0=0, m=1, xin[5], fp=0, prtflag = 0;
36 double xd[5];
37
38 while ((c = getopt(argc,argv,"fm:n:p")) != -1) {
39 switch (c) {
40 case 'f':
41 fp = 1;
42 break;
43 case 'm':
44 m = atoi(optarg);
45 break;
46 case 'n':
47 i0 = atoi(optarg);
48 break;
49 case 'p':
50 prtflag = 1;
51 break;
52 default:
53 printf("%s: unknown flag %c\n",argv[0],c);
54 return 1;
55 }
56 }
57
58 if (i0 == 0) i0 = INT_MAX;
59 if (m == 0) m = 1;
60 if (fp) goto fp_section;
61
62 // Integer computation
63 if ((c = scanf("%d %d %d %d", &xin[0],&xin[1],&xin[2],&xin[3])) != 4) {
64 printf("%s: Need exactly 4 input arguments via stdin, read only %d\n",argv[0],c);
65 return 3;
66 }
67 x[0] = xin[0];
68 x[1] = xin[1];
69 x[2] = xin[2];
70 x[3] = xin[3];
71 //x[4] = xin[4];
72
73 while (m-- > 0) {
74 i = i0;
75 while (i-- > 0) {
76 x[4] = 2*x[2] - x[0];
77#ifdef PRTFLAG
78 //x[4] = (75*x[3] + 311*x[2] - 58*x[1] - 112*x[0]) / 216;
79 if (prtflag || x[4] > 9999999999 || x[4] < -9999999999)
80 printf("%d\n",x[4]);
81#endif
82 x[0] = x[1];
83 x[1] = x[2];
84 x[2] = x[3];
85 x[3] = x[4];
86 //x[4] = x[5];
87 }
88 }
89 printf("%d\n",x[4]);
90
91 return 0;
92
93fp_section:
94 // Floating-point computation
95 if ((c = scanf("%lf %lf %lf %lf %lf", &xd[0],&xd[1],&xd[2],&xd[3],&xd[4])) != 4) {
96 printf("%s: Need exactly 4 input arguments via stdin, read only %d\n",argv[0],c);
97 return 3;
98 }
99 while (m-- > 0) {
100 i = i0;
101 while (i-- > 0) {
102 xd[4] = 25.0/72*xd[3] + 311.0/216*xd[2] - 29.0/108*xd[1] - 14.0/27*xd[0];
103#ifdef PRTFLAG
104 if (prtflag || xd[4] > 9999999999 || xd[4] < -9999999999)
105 printf("%f\n",xd[4]);
106#endif
107 xd[0] = xd[1];
108 xd[1] = xd[2];
109 xd[2] = xd[3];
110 xd[3] = xd[4];
111 }
112 }
113 printf("%f\n",xd[4]);
114
115 return 0;
116
117}
A version of this program can also be found on GitHub in intpoly.c.
1. Comment. By brucedawson | April 21, 2013 at 18:19.
You say “the more cores you have in your CPU the smaller the elapsed time will be. What I did not expect was, that in the range from 1 to the number of cores in the CPU, usage and elapsed time increase.” — isn’t that contradicting yourself, since elapsed time can’t both get smaller and increase?
Also, as my counting-to-ten post that you linked to says, you can’t trust the CPU usage information from ‘time’. You should use a separate program to monitor total system CPU usage, in a more reliable way. If you redo your tests with that (and label your graph and table axes!) then maybe it will all make sense.
2. Comment. eklausmeier | April 21, 2013 at 21:34 Edit
Thank you for your post. It is indeed not easy to get the point. Assume the time command is correct in reporting elapsed time, which you also state in your article. Even if you do not trust the time command, when you do something like (date;command;date) it will give you similar results.
Now assume a fixed number of jobs for our artificial problem at hand, say 64. The statement is: The more cores you have, the quicker you can solve the problem. Every job can run independently. A machine with two cores will take longer (elapsed time over all processes) than a machine with eight cores, all else being equal. First I just wanted to prove just that, which I did.
The problem I observed now is, that you cannot just divide the number of jobs by your number of cores. The relation is apparently not linear, see the graph. What is puzzling, even when you have enough free cores available on your machine, elapsed time per job increases in the range one to number of cores in your CPU, 8 in my case. In the Phoronix forum (Thread: Slowdown when computing in parallel on multicore CPU) I got a reply which explained this phenomen with CPU migration. This seems to be a deficiency/bug in releases before kernel version 3.8.