Original post is here: eklausmeier.goip.de
Task at hand: You have generated static HTML files with a static site generator. You want to check, whether the links in all your files are pointing to valid URLs, and you want all your images and PDFs be present.
There is a utility written in the Go programming language called htmltest. I first read about this utility here: Rohan “Seirdy” Kumar: Software I use.
Clearly, you don't want your users see errors like below:
HTTP code | meaning |
---|---|
400 | Bad request |
401 | Unauthorized |
402 | Payment required |
403 | Forbidden |
404 | Not found |
405 | Method not allowed |
406 | Not acceptable |
407 | Proxy authentication required |
408 | Request timeout |
414 | URI too long |
415 | Unsupported media type |
429 | Too many requests |
500 | Internal server error |
502 | Bad gateway |
503 | Service unavailable |
504 | Gateway timeout |
999 | Non-standard |
1. Installation and configuration #
Installation of htmltest is easy under Arch Linux as there is an AUR package.
Help message from htmltest
:
1$ htmltest -h
2htmltest - Test generated HTML for problems
3 https://github.com/wjdp/htmltest
4
5Usage:
6 htmltest [options] [<path>]
7 htmltest -v --version
8 htmltest -h --help
9
10Options:
11 <path> Path to directory or file to test, if omitted we
12 attempt to read from .htmltest.yml.
13 -c FILE, --conf FILE Custom path to config file.
14 -h, --help Show this text.
15 -l LEVEL, --log-level LEVEL Logging level, 0-3: debug, info, warning, error.
16 -s, --skip-external Skip external link checks, may shorten execution
17 time considerably.
18 -v, --version Show version and build time.
Without a configuration file htmltest
will produce way too many warning.
IgnoreAltEmpty: true
IgnoreAltMissing: true
IgnoreDirectoryMissingTrailingSlash: true
CheckMailto: false
IgnoreURLs: [ "^/(pagefind|jpilot|koehntopp|mobility|lemire|nukeklaus|panorama|paternoster|saaze-example|vonhoff|wendt|xhprof)", "share.flipboard.com" ]
CheckInternalHash: false
CheckFavicon: true
In particular the IgnoreDirectoryMissingTrailingSlash
is important to suppress warnings about unreachable URLs, which just miss the last slash (/
).
Below two lines denote two different URLs:
http://example.com/abc
http://example.com/abc/
Though, in most cases they are considered the same.
My images mostly do not have any alt
attributes, so there is no need for checking that and I set IgnoreAltEmpty
.
CheckInternalHash
avoids checking internal links within a page.
It seems that htmltest
has a bug here and cannot check them properly, so I disabled that check.
2. Checking internal links #
I generated all my HTML files with Simplified Saaze:
1time php saaze -mortb /tmp/build
That way all the HTML files are in /tmp/build
.
They lack images, PDFs, JavaScript, CSS, etc.
I just linked them into the directory:
1cd /tmp/build
2ln -s /srv/http/img
3ln -s /srv/http/pdf
4ln -s /srv/http/jscss
Only internal links:
1htmltest -c ~/etc/htmltest.yml -s .
Output might looks like this:
Skipping the checking of external links.
htmltest started at 09:34:32 on .
========================================================================
✔✔✔ passed in 995.04718ms
tested 650 documents
3. Checking internal and external links #
Checking internal and external links:
Only internal links by dropping -s
:
1htmltest -c ~/etc/htmltest.yml .
Output might look like this:
htmltest started at 04:26:30 on .
========================================================================
404.html
Not Found [rel="canonical"] --- 404.html --> https://eklausmeier.goip.de/404.html
music/2021/08-09-music-played-by-kristina-varlid/index.html
Non-OK status: 401 --- music/2021/08-09-music-played-by-kristina-varlid/index.html --> https://player.vimeo.com/video/126529871
music/2024/03-25-music-played-by-patil-zakarian/index.html
Non-OK status: 999 --- music/2024/03-25-music-played-by-patil-zakarian/index.html --> https://www.linkedin.com/in/patilzw/
gallery/2021/09-26-montpellier/index.html
Non-OK status: 403 --- gallery/2021/09-26-montpellier/index.html --> https://www.ialc.org/find-a-school/institut-linguistique-adenet/
gallery/2023/09-17-montreal/index.html
Non-OK status: 404 --- gallery/2023/09-17-montreal/index.html --> https://play.google.com/store/apps/details?id=quebec.artm.chrono
Non-OK status: 403 --- gallery/2023/09-17-montreal/index.html --> https://pubs.acs.org/doi/10.1021/acsomega.1c04196
gallery/2024/09-22-toulouse/index.html
Non-OK status: 404 --- gallery/2024/09-22-toulouse/index.html --> https://play.google.com/store/apps/details?id=com.thebing.app.ts.student.lsf&hl=en_US
blog/2012/12-29-cpu-usage-time-dependant-on-load/index.html
Non-OK status: 404 --- blog/2012/12-29-cpu-usage-time-dependant-on-load/index.html --> https://phoronix.com/forums/showthread.php?78970-Slowdown-when-computing-in-parallel-on-multicore-CPU
blog/2013/01-07-number-of-combinations-for-german-tax-id/index.html
Non-OK status: 500 --- blog/2013/01-07-number-of-combinations-for-german-tax-id/index.html --> https://www1.osci.de/sixcms/media.php/13/Pr%FCfziffernberechnung.pdf
...
========================================================================
✘✘✘ failed in 29m35.511051392s
376 errors in 650 documents
4. Summary #
Overall, the tool operates quite quickly, when analyzing internal links. It takes longer when reaching out to external URLs.
There are, however, a number of bugs or shortcomings with htmltest
:
- It produces a number of false alarms, e.g., reuters.com or intel.com are reported as 404.
- Internal links within a page are not handled properly.
- When
htmltest
encounters 403 it reports that. It would be advantageous if one could configure to ignore these kind of warnings.