Testing Static HTML Files with htmltest

· klm's blog

Once a static site generator has generated static HTML files, it is time to test all links and images. That's what htmltest does.

Original post is here: eklausmeier.goip.de

Task at hand: You have generated static HTML files with a static site generator. You want to check, whether the links in all your files are pointing to valid URLs, and you want all your images and PDFs be present.

There is a utility written in the Go programming language called htmltest. I first read about this utility here: Rohan “Seirdy” Kumar: Software I use.

Clearly, you don't want your users see errors like below:

HTTP code meaning
400 Bad request
401 Unauthorized
402 Payment required
403 Forbidden
404 Not found
405 Method not allowed
406 Not acceptable
407 Proxy authentication required
408 Request timeout
414 URI too long
415 Unsupported media type
429 Too many requests
500 Internal server error
502 Bad gateway
503 Service unavailable
504 Gateway timeout
999 Non-standard

1. Installation and configuration #

Installation of htmltest is easy under Arch Linux as there is an AUR package.

Help message from htmltest:

 1$ htmltest -h
 2htmltest - Test generated HTML for problems
 3           https://github.com/wjdp/htmltest
 4
 5Usage:
 6  htmltest [options] [<path>]
 7  htmltest -v --version
 8  htmltest -h --help
 9
10Options:
11  <path>                       Path to directory or file to test, if omitted we
12                               attempt to read from .htmltest.yml.
13  -c FILE, --conf FILE         Custom path to config file.
14  -h, --help                   Show this text.
15  -l LEVEL, --log-level LEVEL  Logging level, 0-3: debug, info, warning, error.
16  -s, --skip-external          Skip external link checks, may shorten execution
17                               time considerably.
18  -v, --version                Show version and build time.

Without a configuration file htmltest will produce way too many warning.

IgnoreAltEmpty: true
IgnoreAltMissing: true
IgnoreDirectoryMissingTrailingSlash: true
CheckMailto: false
IgnoreURLs: [ "^/(pagefind|jpilot|koehntopp|mobility|lemire|nukeklaus|panorama|paternoster|saaze-example|vonhoff|wendt|xhprof)", "share.flipboard.com" ]
CheckInternalHash: false
CheckFavicon: true

In particular the IgnoreDirectoryMissingTrailingSlash is important to suppress warnings about unreachable URLs, which just miss the last slash (/). Below two lines denote two different URLs:

http://example.com/abc
http://example.com/abc/

Though, in most cases they are considered the same.

My images mostly do not have any alt attributes, so there is no need for checking that and I set IgnoreAltEmpty.

CheckInternalHash avoids checking internal links within a page. It seems that htmltest has a bug here and cannot check them properly, so I disabled that check.

I generated all my HTML files with Simplified Saaze:

1time php saaze -mortb /tmp/build

That way all the HTML files are in /tmp/build. They lack images, PDFs, JavaScript, CSS, etc. I just linked them into the directory:

1cd /tmp/build
2ln -s /srv/http/img
3ln -s /srv/http/pdf
4ln -s /srv/http/jscss

Only internal links:

1htmltest -c ~/etc/htmltest.yml -s .

Output might looks like this:

Skipping the checking of external links.
htmltest started at 09:34:32 on .
========================================================================
✔✔✔ passed in 995.04718ms
tested 650 documents

Checking internal and external links: Only internal links by dropping -s:

1htmltest -c ~/etc/htmltest.yml .

Output might look like this:

htmltest started at 04:26:30 on .
========================================================================
404.html
  Not Found [rel="canonical"] --- 404.html --> https://eklausmeier.goip.de/404.html
music/2021/08-09-music-played-by-kristina-varlid/index.html
  Non-OK status: 401 --- music/2021/08-09-music-played-by-kristina-varlid/index.html --> https://player.vimeo.com/video/126529871
music/2024/03-25-music-played-by-patil-zakarian/index.html
  Non-OK status: 999 --- music/2024/03-25-music-played-by-patil-zakarian/index.html --> https://www.linkedin.com/in/patilzw/
gallery/2021/09-26-montpellier/index.html
  Non-OK status: 403 --- gallery/2021/09-26-montpellier/index.html --> https://www.ialc.org/find-a-school/institut-linguistique-adenet/
gallery/2023/09-17-montreal/index.html
  Non-OK status: 404 --- gallery/2023/09-17-montreal/index.html --> https://play.google.com/store/apps/details?id=quebec.artm.chrono
  Non-OK status: 403 --- gallery/2023/09-17-montreal/index.html --> https://pubs.acs.org/doi/10.1021/acsomega.1c04196
gallery/2024/09-22-toulouse/index.html
  Non-OK status: 404 --- gallery/2024/09-22-toulouse/index.html --> https://play.google.com/store/apps/details?id=com.thebing.app.ts.student.lsf&hl=en_US
blog/2012/12-29-cpu-usage-time-dependant-on-load/index.html
  Non-OK status: 404 --- blog/2012/12-29-cpu-usage-time-dependant-on-load/index.html --> https://phoronix.com/forums/showthread.php?78970-Slowdown-when-computing-in-parallel-on-multicore-CPU
blog/2013/01-07-number-of-combinations-for-german-tax-id/index.html
  Non-OK status: 500 --- blog/2013/01-07-number-of-combinations-for-german-tax-id/index.html --> https://www1.osci.de/sixcms/media.php/13/Pr%FCfziffernberechnung.pdf
...
========================================================================
✘✘✘ failed in 29m35.511051392s
376 errors in 650 documents

4. Summary #

Overall, the tool operates quite quickly, when analyzing internal links. It takes longer when reaching out to external URLs.

There are, however, a number of bugs or shortcomings with htmltest:

  1. It produces a number of false alarms, e.g., reuters.com or intel.com are reported as 404.
  2. Internal links within a page are not handled properly.
  3. When htmltest encounters 403 it reports that. It would be advantageous if one could configure to ignore these kind of warnings.