Original post is here: eklausmeier.goip.de
Ico Doornekamp on 20-Dec-2011 asked why a C version of a Lua program ran more slowly than the Lua program. The mentioned discrepancy cannot be reproduced, neither on an AMD FX-8120, nor an Intel i5-4250U processor. Generally a C version program is expected to be faster than a Lua program.
Here is the Lua program called lua_perf.lua
:
1local N = 4000
2local S = 1000
3
4local t = {}
5
6for i = 0, N do
7 t[i] = {
8 a = 0,
9 b = 1,
10 f = i * 0.25
11 }
12end
13
14for j = 0, S-1 do
15 for i = 0, N-1 do
16 t[i].a = t[i].a + t[i].b * t[i].f
17 t[i].b = t[i].b - t[i].a * t[i].f
18 end
19 print(string.format("%.6f", t[1].a))
20end
It computes values for a circle.
[more_WP_Tag] Mathematics are in The perfect (sine) wave, or Numerical Solutions of Differential Equations (dead link).
The same program in C called lua_perf.c
:
1#include <stdio.h>
2
3#define N 4000
4#define S 1000
5
6struct t {
7 double a, b, f;
8};
9
10
11int main (int argc, char **argv) {
12 int i, j;
13 struct t t[N];
14
15 for(i=0; i<N; i++) {
16 t[i].a = 0;
17 t[i].b = 1;
18 t[i].f = i * 0.25;
19 };
20
21 for(j=0; j<S; j++) {
22 for(i=0; i<N; i++) {
23 t[i].a += t[i].b * t[i].f;
24 t[i].b -= t[i].a * t[i].f;
25 }
26 printf("%.6f\n", t[1].a);
27 }
28
29 return 0;
30}
Same program in Java called lua_perf.java
:
1class lua_perf {
2 public double a, b, f;
3
4 static final int N=4000;
5 static final int S=1000;
6
7 public static void main (String[] argv) {
8 int i, j;
9 lua_perf[] t = new lua_perf[N];
10
11 for(i=0; i<N; i++) {
12 t[i] = new lua_perf();
13 t[i].a = 0;
14 t[i].b = 1;
15 t[i].f = i * 0.25;
16 };
17
18 for(j=0; j<S; j++) {
19 for(i=0; i<N; i++) {
20 t[i].a += t[i].b * t[i].f;
21 t[i].b -= t[i].a * t[i].f;
22 }
23 System.out.println(t[1].a);
24 }
25 }
26}
Compile for your machine:
1cc -Wall -march=native -O3 lua_perf.c -o lua_perf
2javac lua_perf.java
Then run the programs multiple times and record the best value.
1time lua lua_perf.lua > /dev/null
2
3real 0m1.027s
4user 0m1.023s
5sys 0m0.000s
6
7
8time luajit lua_perf.lua > /dev/null
9
10real 0m0.042s
11user 0m0.040s
12sys 0m0.000s
13
14
15time ./lua_perf > /dev/null
16
17real 0m0.014s
18user 0m0.013s
19sys 0m0.000s
20
21
22time java lua_perf > /dev/null
23
24real 0m0.108s
25user 0m0.160s
26sys 0m0.013s
The result is pretty much as expected: The C program runs three times faster than the LuaJIT program. The LuaJIT program runs almost 25-times faster than the ordinary Lua program.
The Java program needs almost three times as long as LuaJIT. This was totally unexpected. Even when avoiding all the new
statements in the for-loop, run-time is way higher than LuaJIT. What brings Java back in range to LuaJIT is if one subtracts the Java startup-time. Java startup-time was measured with a program called lua_perf_empty.java
:
1class lua_perf_empty {
2 public static void main (String[] argv) {
3 System.out.println("Hello, world.");
4 }
5}
This simple program needs 0m0.067s, i.e., startup-time dominates.
1time java lua_perf_empty > /dev/null
2
3real 0m0.067s
4user 0m0.067s
5sys 0m0.007s
Startup-time for Lua and LuaJIT is 0m0.002s, i.e., negligible.
C is gcc 5.3.0, Lua is 5.3.2, LuaJIT is 2.0.4, Java is openjdk full version "1.8.0_74-b02".
I also checked all output files for C, Lua, and LuaJIT, i.e., not redirecting to /dev/null
: All files were identical.
These findings are in line with results given in Julia Benchmarks:
Similar results from the LuaJIT website:
Comment from Gert Vierman, 23-Apr-2016: Hi, I am the original poster of the message on the Lua list here. The issue was real and reproducible.
From http://lua-users.org/lists/lua-l/2011-12/msg00615.html:
“it seems that the code caused a lot of calculations resulting in denormal numbers, which tend to be handled much slower on some hardware [1]. My solution (workaround?) was to enable SSE and add the -ffast-math flag to gcc to tell the compiler I don’t really care about very precise answers.
I’m not sure how denormals affect luajit, but it seems that in this case this is no problem for the luajit implementation.
Comment from Sennie Son, 11-Jul-2019: You are not using the FFI in LuaJIT – Mike Pall has a nice article on his page explaining why using FFI primitives are much faster and memory effective (they are statically typed and fixed size after initialization and thus are way better at being optimized by the JIT) here: https://luajit.org/ext_ffi.html
Resulting code:
1local ffi = require(“ffi”)
2ffi.cdef[[
3 typedef struct { double a, b, f; } table_elem;
4]]
5
6local N = 4000
7local S = 1000
8local t = ffi.new(“table_elem[?]”, N)
9
10for i = 0, N-1 do
11 t[i].a = 0.0
12 t[i].b = 1.0
13 t[i].f = i * 0.25
14end
15
16for j = 0, S-1 do
17 for i = 0, N-1 do
18 t[i].a = t[i].a + t[i].b * t[i].f
19 t[i].b = t[i].b – t[i].a * t[i].f
20 end
21 print(string.format(“%.6f”, t[1].a))
22end
Which for me creates a ~4.7x speedup overall.