Original post is here: eklausmeier.goip.de
This blog uses Simplified Saaze as its static site generator. Generating all 561 HTML pages takes 0.25 seconds. The environment used is as in below table.
Type | Value |
---|---|
CPU | AMD Ryzen 7 5700G |
RAM | 64 GB |
OS | Arch Linux 6.7.6-arch1-1 #1 SMP PREEMPT_DYNAMIC |
PHP | PHP 8.3.3 (cli) |
PHP with JIT | PHP 8.3.3 (cli), Zend Engine v4.3.3 with Zend OPcache v8.3.3 |
Simplified Saaze | 2.0 |
1. Runtimes in serial mode. In the following we use PHP with no JIT. So far runtimes for this very blog are as below:
1$ time php saaze -mortb /tmp/build
2Building static site in /tmp/build...
3 execute(): filePath=./content/aux.yml, nSIentries=7, totalPages=1, entries_per_page=20
4 execute(): filePath=./content/blog.yml, nSIentries=452, totalPages=23, entries_per_page=20
5 execute(): filePath=./content/gallery.yml, nSIentries=7, totalPages=1, entries_per_page=20
6 execute(): filePath=./content/music.yml, nSIentries=69, totalPages=4, entries_per_page=20
7 execute(): filePath=./content/error.yml, nSIentries=0, totalPages=0, entries_per_page=20
8Finished creating 5 collections, 4 with index, and 561 entries (0.25 secs / 24.46MB)
9#collections=5, parseEntry=0.0103/563-5, md2html=0.0201, MathParser=0.0141/561, renderEntry=0.1573/561, renderCollection=0.0058/33, content=561/0, excerpt=0/0
10 real 0.28s
11 user 0.16s
12 sys 0
13 swapped 0
14 total space 0
It can be seen that the renderEntry()
function uses 0.1573 seconds from overall 0.25 seconds, i.e., more than 60%.
These 561 calls will now be parallelized.
The rest stays serial.
For the Lemire blog we have:
1$ time php saaze -rb /tmp/buildLemire
2Building static site in /tmp/buildLemire...
3 execute(): filePath=/home/klm/php/saaze-lemire/content/blog.yml, nSIentries=2771, totalPages=139, entries_per_page=20
4Finished creating 1 collections, 1 with index, and 4483 entries (1.01 secs / 97.18MB)
5#collections=1, parseEntry=0.0702/4483-1, md2html=0.1003, MathParser=0.0594/4483, renderEntry=0.4121/4483, renderCollection=0.0225/140, content=4483/0, excerpt=0/0
6 real 1.03s
7 user 0.64s
8 sys 0
9 swapped 0
10 total space 0
In this case the output template processing is 0.4121 seconds from overall 1.01 seconds, that's 40%. This shows that the Lemire templates are easier. No wonder, they do not use categories and tags, and many other gimmicks, which I used in this blog. But still, 40% of the runtime is spent on output rendering.
In Performance Comparison Saaze vs. Hugo vs. Zola I wrote:
It would be quite easy to use threads in Saaze, i.e., so-called entries and the chunks of collections could easily be processed in parallel.
It is even easier to parallelize the generation of the output files when the PHP templating is in place. We will see that parallelizing can be done in less than 20 lines of PHP code.
2. Runtimes in serial mode with JIT enabled. Below are the runtime with JIT and OPCache enabled for PHP.
1time php saaze -mortb /tmp/build
2Building static site in /tmp/build...
3 execute(): filePath=./content/aux.yml, nSIentries=7, totalPages=1, entries_per_page=20
4 execute(): filePath=./content/blog.yml, nSIentries=453, totalPages=23, entries_per_page=20
5 execute(): filePath=./content/gallery.yml, nSIentries=7, totalPages=1, entries_per_page=20
6 execute(): filePath=./content/music.yml, nSIentries=69, totalPages=4, entries_per_page=20
7 execute(): filePath=./content/error.yml, nSIentries=0, totalPages=0, entries_per_page=20
8Finished creating 5 collections, 4 with index, and 562 entries (0.16 secs / 20.36MB)
9#collections=5, parseEntry=0.0104/564-5, md2html=0.0219, MathParser=0.0203/562, renderEntry=0.0521/562, renderCollection=0.0022/33, content=562/0, excerpt=0/0
10 real 0.19s
11 user 0.11s
12 sys 0
13 swapped 0
14 total space 0
The previous massive renderEntry()
part in runtime shrank from 0.1573 seconds to 0.0521 seconds.
I think this is mainly due to the OPCache, which now avoids recompiling and reparsing the PHP output template.
For the Lemire blog with JIT enabled we have:
1time php saaze -rb /tmp/buildLemire
2Building static site in /tmp/buildLemire...
3 execute(): filePath=/home/klm/php/saaze-lemire/content/blog.yml, nSIentries=2771, totalPages=139, entries_per_page=20
4Finished creating 1 collections, 1 with index, and 4483 entries (0.62 secs / 96.24MB)
5#collections=1, parseEntry=0.0655/4483-1, md2html=0.0974, MathParser=0.0586/4483, renderEntry=0.0707/4483, renderCollection=0.0110/140, content=4483/0, excerpt=0/0
6 real 0.65s
7 user 0.40s
8 sys 0
9 swapped 0
10 total space 0
Similar picture to the above: the renderEntry()
part dropped from 0.4121 seconds to 0.0707 seconds.
That's massive.
3. Unix forks in PHP. As a preliminary introduction to pcntl_fork()
in PHP, look at below simple PHP code.
1<?php
2 for ($i=1; $i<=4; ++$i) {
3 if (($pid = pcntl_fork())) {
4 printf("i=%d, pid=%d\n",$i,$pid);
5 sleep(1);
6 exit(0);
7 }
Running this script:
1$ php forktst.php
2i=1, pid=15082
3i=2, pid=15083
4i=3, pid=15084
5i=4, pid=15085
The fork and join method of parallelization is easy to use, but it has the disadvantage that communicating results from the children to the parent is "difficult". Communicating data from the parent to its children is "easy": everything is copied over.
4. Implementation in BuildCommand.php.
The command-line version of Simplified Saaze calls buildAllStatic()
.
This routine iterates through all collections, and for each collection it iterates through all entries.
- Function
getEntries()
reads Markdown files into memory and converts them to HTML by using MD4C, all in memory - Function
buildEntry()
uses the entry in question and writes the HTML to disk by processing it through our PHP templates.
PHP function buildEntry()
is essentially:
1private function buildEntry(Collection $collection, Entry $entry, string $dest) : void {
2 ...
3 file_put_contents($entryDir, $this->templateManager->renderEntry($entry);
4}
buildEntry()
is now encapsulated within beginParallel()
and endParallel()
.
That's it.
1foreach ($collections as $collection) {
2 $entries = $collection->getEntries(); # finally calls getContentAndExcerpt() and sorts
3 $nentries = count($entries);
4 $nSIentries = count($collection->entriesSansIndex);
5 $entries_per_page = $collection->data['entries_per_page'] ?? \Saaze\Config::$H['global_config_entries_per_page'];
6 $totalPages = ceil($nSIentries / $entries_per_page);
7 printf("\texecute(): filePath=%s, nSIentries=%d, totalPages=%d, entries_per_page=%d\n",$collection->filePath,$nSIentries,$totalPages,$entries_per_page);
8
9 $this->beginParallel($nentries,$aprocs);
10 $i = 0;
11 foreach ($entries as $entry) {
12 if ($this->nprocs > 0 && ($i++ % $this->nprocs) != $this->procnr) continue; // distribute work among nprocs processes
13 if ($entry->data['entry'] ?? true) {
14 $this->buildEntry($collection, $entry, $dest);
15 $entryCount++;
16 }
17 }
18 $this->endParallel();
19
20 if ($tags) { // populate cat_and_tag[][] array
21 foreach ($entries as $entry) {
22 if ($entry->data['entry'] ?? true)
23 $this->build_cat_and_tag($entry,$collection->draftOverride);
24 }
25 }
26
27 ++$totalCollection;
28 if ($this->buildCollectionIndex($collection, 0, $dest)) $collectionCount++;
29
30 for ($page=1; $page <= $totalPages; $page++)
31 $this->buildCollectionIndex($collection, $page, $dest);
32}
The two PHP functions for fork and join are thus:
1protected function beginParallel(int $nentries, int $aprocs) : void {
2 $this->pid = 0;
3 $this->procnr = 0;
4 $this->nprocs = 1;
5 if ($nentries < 128) return; // too few entries to warrant forking
6 $this->nprocs = $aprocs; // aprocs = allowed procs, specified on commmand-line
7 for ($this->procnr=0; $this->procnr<$this->nprocs; ++$this->procnr)
8 if (($this->pid = pcntl_fork())) return; // child returns to work
9}
10
11protected function endParallel() : void {
12 if ($this->pid) exit(0); // exit child process; pid=0 is parent
13}
This fork and join via pcntl_fork()
does not work on Microsoft Windows.
5. Benchmarking. How much of an improvement do we get by this? For this very blog with 561 entries, the runtimes can be more than halved. This is in line with the 60% runtime used by the output template processing. It should be noted that this blog is comprised of five collections:
- aux: 7 entries
- blog: 452 entries, only these are parallelized!
- gallery: 7 entries
- music: 69 entries
- error: 1 entry
The parallelization kicks in only for at least 128 entries. I.e., only the blog-part is parallelized, the music-part and the other parts are not.
Another benchmark is the Lemire blog converted to Simplified Saaze, see Example Theme for Simplified Saaze: Lemire.
Command-lines are:
1time php saaze -p16 -mortb /tmp/build
2time php saaze -p16 -rb /tmp/buildLemire
Then we are varying the parameter -p
.
All output is to /tmp
, which is a RAM disk in Arch Linux.
Obviously, I do not want to measure disk read or write speed.
I want to measure the processing speed of Simplified Saaze.
Timings are from time
, taking real time.
Blog entries | p=1 | p=2 | p=4 | p=8 | p=16 |
---|---|---|---|---|---|
561 posts / this blog | 0.28 | 0.18 | 0.16 | 0.13 | 0.12 |
561 posts with JIT | 0.19 | 0.17 | 0.14 | 0.13 | 0.12 |
4.483 posts in Lemire | 1.03 | 1.02 | 0.65 | 0.54 | 0.52 |
4.483 posts with JIT | 0.65 | 0.64 | 0.53 | 0.47 | 0.46 |
Overall, with just 20 lines of PHP we can halve the runtime. For JIT enabled, the drop in runtime is not so pronounced, but also almost halved.
The very good performance of JIT, which we can see here, is in line with the findings in Phoronix: PHP 8.0 JIT Is Offering Very Compelling Performance Ahead Of Its Alpha.