Performance Comparison of mmap() versus read() versus fread()

· klm's blog


Original post is here: eklausmeier.goip.de

I recently read in Computers are fast! by Julia Evans about a comparison between fread() and mmap() suggesting that both calls deliver roughly the same performance. Unfortunately the codes mentioned there and referenced in bytesum.c for fread() and bytesum_mmap.c for mmap() do not really compare the same thing. The first adds size_t, the second adds up uint8_t. My computer showed that these programs do behave differently and therefore give different performance.

I reprogrammed the comparison adding read() to fread() and mmap(). The code is in GitHub. Compile with

1cc -Wall -O3 tbytesum1.c -o tbytesum1

For this program the results are as follows:

[more_WP_Tag]

 1/home/klm/c: time ./tbytesum1 -f ubuntu-14.04-server-amd64.iso
 2The answer is: -98049011
 3
 4real    0m0.187s
 5user    0m0.077s
 6sys     0m0.110s
 7/home/klm/c: time tbytesum1 -f ubuntu-14.04-server-amd64.iso
 8The answer is: -98049011
 9
10real    0m0.193s
11user    0m0.100s
12sys     0m0.090s
13/home/klm/c: time tbytesum1 -r ubuntu-14.04-server-amd64.iso
14The answer is: -98049011
15
16real    0m0.186s
17user    0m0.080s
18sys     0m0.103s
19/home/klm/c: time tbytesum1 -r ~ubuntu-14.04-server-amd64.iso
20The answer is: -98049011
21
22real    0m0.196s
23user    0m0.110s
24sys     0m0.083s
25/home/klm/c: time tbytesum1 -m ubuntu-14.04-server-amd64.iso
26The answer is: -98049011
27
28real    0m0.152s
29user    0m0.110s
30sys     0m0.040s
31/home/klm/c: time tbytesum1 -m ubuntu-14.04-server-amd64.iso
32The answer is: -98049011
33
34real    0m0.159s
35user    0m0.113s
36sys     0m0.043s

The file in question is the Ubuntu ISO-image for the server edition, roughly 564 MB, stored on a classical hard-drive (Seagate 2TB drive). fread() and read() don't make a difference. This demonstrates that the mmap()'ed version needs roughly half the system time (83ms down to 40ms), leading to a reduction of 20% of the total running time (186ms down to 152ms).

A similar test with a short video from a SSD:

 1/home/klm/c: time tbytesum1 -r CLIP0627.AVI
 2The answer is: -122687217
 3
 4real    0m0.097s
 5user    0m0.033s
 6sys     0m0.063s
 7/home/klm/c: time tbytesum1 -r CLIP0627.AVI
 8The answer is: -122687217
 9
10real    0m0.097s
11user    0m0.050s
12sys     0m0.043s
13/home/klm/c: time tbytesum1 -f CLIP0627.AVI
14The answer is: -122687217
15
16real    0m0.093s
17user    0m0.050s
18sys     0m0.040s
19/home/klm/c: time tbytesum1 -f CLIP0627.AVI
20The answer is: -122687217
21
22real    0m0.098s
23user    0m0.043s
24sys     0m0.053s
25/home/klm/c: time tbytesum1 -m CLIP0627.AVI
26The answer is: -122687217
27
28real    0m0.079s
29user    0m0.050s
30sys     0m0.027s
31/home/klm/c: time tbytesum1 -m CLIP0627.AVI
32The answer is: -122687217
33
34real    0m0.084s
35user    0m0.050s
36sys     0m0.030s

The AVI-file is roughly 259 MB. Again, fread() and read() don't differ, but mmap() is roughly 30% faster system-time-wise.

These tests were conducted on a 4.3.3-3-ARCH x86_64 system utilizing an AMD FX-8120 Eight-Core processor running up to 3.1 GHz. gcc used for compiling was 5.3.0.

Linus Torvalds gave the following remarks on read() versus mmap() in mmap/mlock performance versus read:

People love `mmap()` and other ways to play with the page tables to optimize away a copy operation, and sometimes it is worth it.

HOWEVER, playing games with the virtual memory mapping is very expensive in itself. It has a number of quite real disadvantages that people tend to ignore because memory copying is seen as something very slow, and sometimes optimizing that copy away is seen as an obvious improvment.

Downsides to mmap():

  1. quite noticeable setup and teardown costs. And I mean noticeable. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the mappings. It's The TLB flush needed after unmapping stuff.
  2. page faulting is expensive. That's how the mapping gets populated, and it's quite slow.

Upsides of mmap():

  1. if the data gets re-used over and over again (within a single map operation), or if you can avoid a lot of other logic by just mapping something in, mmap() is just the greatest thing since sliced bread.
    This may be a file that you go over many times (the binary image of an executable is the obvious case here - the code jumps all around the place), or a setup where it's just so convenient to map the whole thing in without regard of the actual usage patterns that mmap() just wins. You may have random access patterns, and use mmap() as a way of keeping track of what data you actually needed.
  2. if the data is large, mmap() is a great way to let the system know what it can do with the data-set. The kernel can forget pages as memory pressure forces the system to page stuff out, and then just automatically re-fetch them again.

And the automatic sharing is obviously a case of this..

It is interesting to note that the performance is kernel-dependent. The same tests conducted on Ubuntu 14.04.3 LTS, kernel version 3.13.0-74-generic #118-Ubuntu SMP, x86_64, on the exact same hardware, give a more blurred view:

 1/home/klm/c: time tbytesum1 -r ubuntu-14.04-server-amd64.iso
 2The answer is: -98049011
 3
 4real    0m0.191s
 5user    0m0.072s
 6sys     0m0.119s
 7/home/klm/c: time tbytesum1 -r ubuntu-14.04-server-amd64.iso
 8The answer is: -98049011
 9
10real    0m0.177s
11user    0m0.073s
12sys     0m0.104s
13/home/klm/c: time tbytesum1 -f ubuntu-14.04-server-amd64.iso
14The answer is: -98049011
15
16real    0m0.187s
17user    0m0.092s
18sys     0m0.095s
19/home/klm/c: time tbytesum1 -f ubuntu-14.04-server-amd64.iso
20The answer is: -98049011
21
22real    0m0.181s
23user    0m0.077s
24sys     0m0.104s
25/home/klm/c: time tbytesum1 -m ubuntu-14.04-server-amd64.iso
26The answer is: -98049011
27
28real    0m0.174s
29user    0m0.104s
30sys     0m0.072s
31/home/klm/c: time tbytesum1 -m ubuntu-14.04-server-amd64.iso
32The answer is: -98049011
33
34real    0m0.175s
35user    0m0.102s
36sys     0m0.071s

Again, read() and fread() make no real difference. The difference in system-time between read() and mmap() is 25% (95ms down to 71ms). Compiler was gcc 4.8.4.

Testing a file on SSD:

 1root@chieftec:~# time ~klm/c/tbytesum1 -r CLIP0627.AVI
 2The answer is: -122687217
 3
 4real    0m0.098s
 5user    0m0.039s
 6sys     0m0.059s
 7/home/klm/c: time tbytesum1 -r CLIP0627.AVI
 8The answer is: -122687217
 9
10real    0m0.093s
11user    0m0.047s
12sys     0m0.047s
13/home/klm/c: time tbytesum1 -f CLIP0627.AVI
14The answer is: -122687217
15
16real    0m0.092s
17user    0m0.036s
18sys     0m0.056s
19/home/klm/c: time tbytesum1 -f CLIP0627.AVI
20The answer is: -122687217
21
22real    0m0.087s
23user    0m0.040s
24sys     0m0.047s
25/home/klm/c: time tbytesum1 -m CLIP0627.AVI
26The answer is: -122687217
27
28real    0m0.091s
29user    0m0.047s
30sys     0m0.043s
31/home/klm/c: time tbytesum1 -m CLIP0627.AVI
32The answer is: -122687217
33
34real    0m0.086s
35user    0m0.047s
36sys     0m0.039s

With the file on SSD the result is even more fading away to the file stored on hard-disk: running times between read() and mmap() are almost identical, contrary to the result on kernel 4.3.3.

A kernel dependency was also hinted in CPU Usage Time Is Dependant on Load.

Both binaries compiled by gcc either version 5.3.0 on Arch or 4.8.4 on Ubuntu use loop-unrolling for all three functions mmaptst(), freadtst(), and readtst(), which can be seen by:

1objdump -d tbytesum1

Here is the assembler code:

00000000004008e0 <mmaptst>:
  4008e0:	41 55                	push   %r13
  4008e2:	41 54                	push   %r12
  4008e4:	31 c0                	xor    %eax,%eax
  4008e6:	55                   	push   %rbp
  4008e7:	53                   	push   %rbx
  4008e8:	48 89 f5             	mov    %rsi,%rbp
. . .
  400a5a:	66 0f fe c1          	paddd  %xmm1,%xmm0
  400a5e:	66 0f 7e c0          	movd   %xmm0,%eax
  400a62:	01 c3                	add    %eax,%ebx
  400a64:	89 f8                	mov    %edi,%eax
  400a66:	48 01 c2             	add    %rax,%rdx
  400a69:	41 39 fa             	cmp    %edi,%r10d
  400a6c:	0f 84 a7 00 00 00    	je     400b19 <mmaptst+0x239>
  400a72:	0f be 02             	movsbl (%rdx),%eax
  400a75:	01 c3                	add    %eax,%ebx
  400a77:	83 fe 01             	cmp    $0x1,%esi
  400a7a:	0f 84 99 00 00 00    	je     400b19 <mmaptst+0x239>
  400a80:	0f be 42 01          	movsbl 0x1(%rdx),%eax
  400a84:	01 c3                	add    %eax,%ebx
  400a86:	83 fe 02             	cmp    $0x2,%esi
  400a89:	0f 84 8a 00 00 00    	je     400b19 <mmaptst+0x239>
  400a8f:	0f be 42 02          	movsbl 0x2(%rdx),%eax
. . .
  400b06:	74 11                	je     400b19 <mmaptst+0x239>
  400b08:	0f be 42 0d          	movsbl 0xd(%rdx),%eax
  400b0c:	01 c3                	add    %eax,%ebx
  400b0e:	83 fe 0e             	cmp    $0xe,%esi
  400b11:	74 06                	je     400b19 <mmaptst+0x239>
  400b13:	0f be 42 0e          	movsbl 0xe(%rdx),%eax
  400b17:	01 c3                	add    %eax,%ebx
  400b19:	44 89 ef             	mov    %r13d,%edi
. . .

Compiling with gcc 5.3.0 and option march=native, i.e.,

1cc -Wall -O3 -march=native tbytesum1.c -o tbytesum1N

and reading the Ubuntu ISO-file from HD reduces real-time by roughly 10% (152ms down to 138ms), and reduces user-time by roughly 15% (110ms down to 93ms). The generated code uses AMD's vpadd, vpsrldq, vpmovsxwd instructions.

Added 05-Nov-2017: In the blog article "ripgrep is faster than {grep, ag, git grep, ucg, pt, sift}" by Andrew Gallant on the performance comparison between ag (Silver Searcher) and rg (ripgrep) he says:

Naively, it seems like (1) would be obviously faster. Surely, all of the bookkeeping and copying in (2) would make it much slower! In fact, this is not at all true. (1) may not require much bookkeeping from the perspective of the programmer, but there is a lot of bookkeeping going on inside the Linux kernel to maintain the memory map. (That link goes to a mailing list post that is quite old, but it still appears relevant today.)

When I first started writing ripgrep, I used the memory map approach. It took me a long time to be convinced enough to start down the second path with an intermediate buffer (because neither a CPU profile nor the output of strace ever showed any convincing evidence that memory maps were to blame), but as soon as I had a prototype of (2) working, it was clear that it was much faster than the memory map approach.

With all that said, memory maps aren’t all bad. They just happen to be bad for the particular use case of “rapidly open, scan and close memory maps for thousands of small files.” For a different use case, like, say, “open this large file and search it once,” memory maps turn out to be a boon. We’ll see that in action in our single-file benchmarks later.

Added 30-Oct-2023: This post is referenced in Prequel to liburing_b3sum. This article goes into much detail about O_DIRECT, io_uring and shows many performance benchmarks. Highly recommended to read.

Added 23-Sep-2024: This post is cited in An Evaluation of Bandwidth of Different Storage Types (HDD vs. SSD vs. LustreFS) for Different Block Sizes and Different Parallel Read Methods (mmap vs pread vs read), published at Queen's University Belfast.