Original post is here: eklausmeier.goip.de
1. Problem statement. When you run a web-server on your machine many bots and crawlers will visit. When analysing how many "real" visitors you have, you should therefore suppress these entries in your analysis from your log file of the web-server.
This blog is served by Hiawatha web-server. Every visitor writes at least one entry into the web-server's log file. Hiawatha and many other web-servers call it access.log. Hiawatha has other log files, e.g., error.log, garbage.log, and system.log. An entry in access.log looks something like this:
192.168.178.24|Mon 13 Dec 2021 11:37:03 +0100|200|6148|GET /blog/2014/09-20-advances-in-automotive-technology/ HTTP/1.1||Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36|Host: nucsaaze|Connection: keep-alive|Cache-Control: max-age=0|Upgrade-Insecure-Requests: 1|Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9|Sec-GPC: 1|Accept-Encoding: gzip, deflate|Accept-Language: en-US,en;q=0.9
192.168.178.24|Mon 13 Dec 2021 11:37:04 +0100|200|168499|GET /img/RynoSingleWheel.jpg HTTP/1.1|http://nucsaaze/blog/2014/09-20-advances-in-automotive-technology/|Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36|Host: nucsaaze|Connection: keep-alive|Accept: image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8|Sec-GPC: 1|Accept-Encoding: gzip, deflate|Accept-Language: en-US,en;q=0.9
192.168.178.24|Mon 13 Dec 2021 11:37:04 +0100|304|182|GET /img/LitMotorBike.jpg HTTP/1.1|http://nucsaaze/blog/2014/09-20-advances-in-automotive-technology/|Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36|Host: nucsaaze|Connection: keep-alive|Accept: image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8|Sec-GPC: 1|Accept-Encoding: gzip, deflate|Accept-Language: en-US,en;q=0.9|If-Modified-Since: Mon, 13 Dec 2021 10:14:34 GMT
171.25.193.77|Mon 13 Dec 2021 12:58:31 +0100|200|6419|GET / HTTP/1.1||Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36|Host: 94.114.1.108:8443|Connection: close|Accept: */*|Accept-Encoding: gzip
185.220.100.242|Mon 13 Dec 2021 12:58:36 +0100|200|2456|GET /favicon.ico HTTP/1.1||Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36|Host: 94.114.1.108:8443|Connection: close|Accept: */*|Accept-Encoding: gzip
Hiawatha's access.log has the following fields, separated by |
:
- host
- date
- code
- size
- URL
- referer
- user agent
- and other fields
2. Examples. Here is the Google-Bot, seen in access.log, which is a highly welcomed crawler:
66.249.76.169|Fri 10 Dec 2021 12:32:14 +0100|200|7037|GET /blog/2021/07-13-performance-comparison-c-vs-java-vs-javascript-vs-luajit-vs-pypy-vs-php-vs-python-vs-perl/ HTTP/1.1||Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)|Host: eklausmeier.goip.de|Connection: keep-alive|Accept: text/html,application/xhtml+xml,application/signed-exchange;v=b3,application/xml;q=0.9,*/*;q=0.8|From: googlebot(at)googlebot.com|Accept-Encoding: gzip, deflate, br|If-Modified-Since: Fri, 19 Nov 2021 04:23:35 GMT
Here are two entries of the Yandex-Bot, also highly welcomed:
5.255.253.106|Wed 08 Dec 2021 12:06:43 +0100|200|2761|GET /blog/2014/01-19-cisco-2014-annual-security-report-java-continues-to-be-most-vulnerable-of-all-web-exploits HTTP/1.1||Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)|Host: eklausmeier.goip.de|Connection: keep-alive|From: support@search.yandex.ru|Accept-Encoding: gzip,deflate|Accept: */*
213.180.203.141|Wed 08 Dec 2021 12:06:44 +0100|200|2434|GET /blog/2013/12-08-surfing-the-internet-with-100-mbits HTTP/1.1||Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots)|Host: eklausmeier.goip.de|Connection: keep-alive|From: support@search.yandex.ru|Accept-Encoding: gzip,deflate|Accept: */*
Here is an example of a bot, whose purpose is doubtful and probably just nonsense:
45.155.205.233|Wed 08 Dec 2021 19:45:48 +0100|303|228|GET /index.php?s=/Index/\think\app/invokefunction&function=call_user_func_array&vars[0]=md5&vars[1][]=HelloThinkPHP21 HTTP/1.1||Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36|Host: 94.114.1.108:443|Accept-Encoding: gzip|Connection: close
Some of the bots and crawlers identify themselves in the user-agent field, e.g., Google and Yandex. But unfortunately, many do not. Quite the contrary, they try to hide as normal browsers.
Below Perl script accesslogFilter
filters access.log so that only the "real" visitors remain. The script filters according:
- IP addresses: some bots and crawlers do not identify themselves, so we have to resort to naked addresses
- HTTP status codes
- identifying string in the user-agent field
3. Configuration. Here is the list of IP addresses I found to be "annoying" in any analysis:
1my %ips = (
2 '18.212.118.57' => 0, '18.232.89.176' => 0, '3.94.81.106' => 0,
3 '34.239.184.105' => 0, '54.80.126.99' => 0, '54.162.60.209' => 0, # compute-1.amazonaws.com
4 '62.138.2.14' => 0, '62.138.2.214' => 0, '62.138.2.160' => 0,
5 '62.138.3.52' => 0, '62.138.6.15' => 0, '85.25.210.23' => 0, # startdedicated.de
6 '66.240.192.138' => 0, '66.240.219.133' => 0, '66.240.219.146' => 0, '66.240.236.119' => 0,
7 '71.6.135.131' => 0, '71.6.146.185' => 0, '71.6.158.166' => 0, '71.6.165.200' => 0,
8 '71.6.167.142' => 0, '71.6.199.23' => 0, '80.82.77.33' => 0, '80.82.77.139' => 0,
9 '82.221.105.6' => 0, '82.221.105.7' => 0, # shodan.io
10 '138.246.253.24' => 0, '106.55.250.60' => 0, # various robots.txt reader
11 '127.0.0.1' => 0, # localhost
12 '192.168.178.2' => 0, '192.168.178.20' => 0, '192.168.178.24' => 0,
13 '192.168.178.118' => 0, '192.168.178.249' => 0 # local network
14);
For example, shodan.io does not identify itself in the user-agent string. Therefore I simply identified them by using nslookup
:
1$ nslookup 82.221.105.7
27.105.221.82.in-addr.arpa name = census11.shodan.io.
Here is the list of HTTP status codes, which are filtered out:
1my %errorCode = ( 301 => 0, 302 => 0, 400 => 0, 403 => 0, 404 => 0, 405 => 0, 500 => 0, 501 => 0, 503 => 0, 505 => 0 );
Status code 404 would also be very useful to find links on my web-server which point to nowhere. But the bots are so numerous that it seems to be more beneficial to just filter it out.
Here is the list of user-agent strings, which identify a bot or crawler:
1my %bots = (
2 adsbot => 0, adscanner => 0, ahrefsbot => 0, applebot => 0, 'archive.org_bot' => 0,
3 baiduspider => 0, bingbot => 0, blexbot => 0,
4 ccbot => 0, censysinspect => 0,
5 'clark-crawler2' => 0, crawler => 0, crawler2 => 0, 'crawler.php' => 0,
6 criteobot => 0, curl => 0,
7 dataforseobot => 0, dotbot => 0, 'feedsearch-crawler' => 0,
8 facebookexternalhit => 0, fluid => 0, 'go-http-client' => 0,
9 googlebot => 0, ioncrawl => 0, lighthouse => 0, ltx71 => 0,
10 mediapartners => 0, 'mediapartners-google' => 0,
11 netestate => 0, nuclei => 0, nicecrawler => 0, petalbot => 0,
12 proximic => 0, 'pulsepoint-ads.txt-crawler' => 0, 'python-requests' => 0,
13 semrushbot => 0, 'semrushbot-ba' => 0, sitecheckerbotcrawler => 0, sitelockspider => 0,
14 twitterbot => 0, ucrawl => 0,
15 virustotal => 0, 'web-crawler' => 0, 'webmeup-crawler' => 0,
16 'x-fb-crawlerbot' => 0, yandexbot => 0,
17);
4. Perl script. With those configurations in place, the rest of the Perl script accesslogFilter
is pretty straightforward:
1use strict;
2
3use Getopt::Std;
4my %opts = ();
5getopts('o:',\%opts);
6my $statout = (defined($opts{'o'}) ? $opts{'o'} : undef);
7my ($emptyUA,$badURL,$smallUA) = (0,0,0);
Put the above config, i.e., the hash tables after these declarations. Now the actual filtering-code:
1W: while (<>) {
2 my @F = split /\|/;
3 if (defined($ips{$F[0]})) { $ips{$F[0]} += 1; next; }
4 if ($#F <= 5) { $emptyUA += 1; next; }
5 if (defined($errorCode{$F[2]})) { $errorCode{$F[2]} += 1; next; } # not found errors are ignored
6 if ($F[4] =~ /XDEBUG_SESSION_START|HelloThink(CMF|PHP)/) { $badURL += 1; next; }
7 if ($#F >= 6) { # Is UA field available?
8 if (length($F[6]) <= 3) { $smallUA += 1; next W; }
9 for ( split(/[ :;,\/\(\)\@]/,lc $F[6]) ) {
10 if (defined($bots{$_})) { $bots{$_} += 1; next W; } # skip bots
11 }
12 }
13 print;
14}
For reporting purposes only, i.e., when given command line option -o report.txt
, not required for filtering:
1if (defined($statout)) {
2 open(F,">$statout") || die("Cannot write to $statout");
3 for (sort keys %ips) {
4 next if ($ips{$_} == 0);
5 printf(F "IP\t%d\t%s\n",$ips{$_},$_);
6 }
7 printf(F "eUA\t%d\n",$emptyUA);
8 printf(F "badURL\t%d\n",$badURL);
9 for (sort keys %errorCode) {
10 next if ($errorCode{$_} == 0);
11 printf(F "code\t%d\t%d\n",$errorCode{$_},$_);
12 }
13 printf(F "sUA\t%d\n",$smallUA);
14 for (sort keys %bots) {
15 next if ($bots{$_} == 0);
16 printf(F "bot\t%d\t%s\n",$bots{$_},$_);
17 }
18}
The script is in GitHub: eklausme/bin/accesslogFilter.
5. Reporting. Now to get a feeling how much filtering actually happens when applying above rules:
- Unmodified access.log for ca. one year: 181,282 entries
- Filtered access.log with
accesslogFilter
: 32,451 entries remain, i.e., less than 20%
So 80% of the visits of my web-server is stemming from bots, crawlers, junk, or from myself.
Unfiltered output of goaccess looks like this:
Filtered output of goaccess looks like this:
I have written about goacess here: Using GoAccess with Hiawatha Web-Server. Unfortunately goaccess is not good at filtering.
The statistics of accesslogFilter
are below. First statistics for filtering according IP address. Host "startdedicated" is at the top:
IP 704 106.55.250.60
IP 154 127.0.0.1
IP 419 138.246.253.24
IP 335 18.212.118.57
IP 333 18.232.89.176
IP 1753 192.168.178.118
IP 1341 192.168.178.2
IP 1554 192.168.178.20
IP 6180 192.168.178.24
IP 496 192.168.178.249
IP 331 3.94.81.106
IP 335 34.239.184.105
IP 335 54.162.60.209
IP 329 54.80.126.99
IP 282 62.138.2.14
IP 6757 62.138.2.160
IP 286 62.138.2.214
IP 1353 62.138.3.52
IP 631 62.138.6.15
IP 11 66.240.192.138
IP 14 66.240.219.133
IP 6 66.240.219.146
IP 17 66.240.236.119
IP 32 71.6.135.131
IP 10 71.6.146.185
IP 14 71.6.158.166
IP 5 71.6.165.200
IP 10 71.6.167.142
IP 17 71.6.199.23
IP 32 80.82.77.139
IP 44 80.82.77.33
IP 52 82.221.105.6
IP 41 82.221.105.7
IP 15688 85.25.210.23
Above numbers are depicted in below pie-chart:
Special filtering according user-agent or bad/silly URL:
eUA 143
badURL 1835
sUA 3580
Distribution of filtered HTTP status code. As mentioned above, code 404 dominates by far:
1code 16033 301
2code 15 302
3code 16 400
4code 67 403
5code 48222 404
6code 3877 405
7code 91 500
8code 126 501
9code 981 503
Above numbers as pie-chart:
Statistics on filtered entries according user-agent strings. Google is at the top, followed by Semrush:
1bot 326 adsbot
2bot 932 adscanner
3bot 5163 ahrefsbot
4bot 325 applebot
5bot 18 archive.org_bot
6bot 426 baiduspider
7bot 1011 bingbot
8bot 1105 blexbot
9bot 2 ccbot
10bot 1131 censysinspect
11bot 182 clark-crawler2
12bot 39 crawler
13bot 88 crawler.php
14bot 4 criteobot
15bot 196 curl
16bot 379 dataforseobot
17bot 1688 dotbot
18bot 8 facebookexternalhit
19bot 1 feedsearch-crawler
20bot 16 fluid
21bot 226 go-http-client
22bot 7016 googlebot
23bot 34 ioncrawl
24bot 19 ltx71
25bot 2059 mediapartners-google
26bot 410 netestate
27bot 1 nicecrawler
28bot 346 nuclei
29bot 1088 petalbot
30bot 77 proximic
31bot 4 pulsepoint-ads.txt-crawler
32bot 342 python-requests
33bot 6609 semrushbot
34bot 159 semrushbot-ba
35bot 26 sitecheckerbotcrawler
36bot 699 sitelockspider
37bot 68 twitterbot
38bot 2 ucrawl
39bot 4 virustotal
40bot 1715 yandexbot
Above data as pie-chart.
This gives a good graphical presentation why Bing, or Baidu are way inferior to Google search.
6. References. Below links might provide further information on bots & crawlers.
- Web Crawlers: Love the Good, but Kill the Bad and the Ugly: This post talks about limiting the bots & crawlers to your website as they are slowing down the entire server. The author mentions facebookexternalhit visiting his site 541 times per hour!
- IAB/ABC International Spiders and Bots List: A commercial list with bots, spiders, and crawlers. The list costs 15,000 USD.
- Bots and the Adobe Experience Cloud: AEC uses IAB.
- List of bots in StopBadBots: aBots.php
- AWStats robot list: robots.pm.
Added 14-Mar-2022: I added further IP addresses, HTTP status codes, and bot names to script accesslogFilter
. Then I compared the ratio of original log file to filtered log file. Below table shows the results. One can see that bots, crawlers, junk and myself make up to almost 90% of the traffic. Below table is chronologically in reverse order, i.e., lowest number is youngest.
Log file | #entries | after filtering | ratio |
---|---|---|---|
0 | 1813 | 375 | 0.207 |
1 | 10913 | 1605 | 0.147 |
2 | 8631 | 1411 | 0.163 |
3 | 8146 | 1287 | 0.158 |
4 | 10319 | 1380 | 0.134 |
5 | 6287 | 1276 | 0.203 |
6 | 8064 | 1239 | 0.154 |
7 | 9684 | 1023 | 0.106 |
8 | 6317 | 1110 | 0.176 |
9 | 8115 | 1096 | 0.135 |
10 | 7334 | 1302 | 0.178 |
11 | 6835 | 1262 | 0.185 |
12 | 7239 | 922 | 0.127 |
13 | 10457 | 1297 | 0.124 |
14 | 8897 | 1102 | 0.124 |
15 | 9554 | 1051 | 0.110 |
16 | 8945 | 1020 | 0.114 |
17 | 4873 | 891 | 0.183 |
18 | 4961 | 753 | 0.152 |
19 | 5865 | 611 | 0.104 |
20 | 4850 | 501 | 0.103 |
21 | 4290 | 492 | 0.115 |
22 | 4863 | 514 | 0.106 |
23 | 4652 | 485 | 0.104 |
24 | 4289 | 520 | 0.121 |
25 | 3940 | 706 | 0.179 |
26 | 4292 | 634 | 0.148 |
27 | 3044 | 608 | 0.200 |
28 | 4390 | 538 | 0.123 |
29 | 3149 | 410 | 0.130 |
30 | 4402 | 470 | 0.107 |
31 | 3894 | 456 | 0.117 |
32 | 3018 | 518 | 0.172 |
33 | 5170 | 646 | 0.125 |
34 | 3980 | 581 | 0.146 |
35 | 3325 | 457 | 0.137 |
36 | 2979 | 478 | 0.160 |
37 | 5648 | 596 | 0.106 |
38 | 3139 | 470 | 0.150 |
39 | 2859 | 360 | 0.126 |
40 | 3157 | 700 | 0.222 |
41 | 2370 | 280 | 0.118 |
42 | 3457 | 286 | 0.083 |
43 | 8597 | 403 | 0.047 |
44 | 2414 | 258 | 0.107 |
45 | 2913 | 318 | 0.109 |
46 | 1835 | 308 | 0.168 |
47 | 2027 | 420 | 0.207 |
48 | 1777 | 319 | 0.180 |
49 | 1369 | 443 | 0.324 |
50 | 2860 | 415 | 0.145 |
51 | 2965 | 292 | 0.098 |
52 | 1325 | 264 | 0.199 |
Added 29-Mar-2022: Added even more IP addresses and bot names to accesslogFilter
. Table is now chronologically in order, i.e., lower number is older.
Log file | #entries | after filtering | ratio |
---|---|---|---|
1 | 1325 | 254 | 0.192 |
2 | 2965 | 286 | 0.096 |
3 | 2860 | 402 | 0.141 |
4 | 1369 | 416 | 0.304 |
5 | 1777 | 303 | 0.171 |
6 | 2027 | 404 | 0.199 |
7 | 1835 | 298 | 0.162 |
8 | 2913 | 297 | 0.102 |
9 | 2414 | 253 | 0.105 |
10 | 8597 | 400 | 0.047 |
11 | 3457 | 284 | 0.082 |
12 | 2370 | 277 | 0.117 |
13 | 3157 | 699 | 0.221 |
14 | 2859 | 358 | 0.125 |
15 | 3139 | 451 | 0.144 |
16 | 5648 | 556 | 0.098 |
17 | 2979 | 456 | 0.153 |
18 | 3325 | 442 | 0.133 |
19 | 3980 | 559 | 0.140 |
20 | 5170 | 626 | 0.121 |
21 | 3018 | 497 | 0.165 |
22 | 3894 | 441 | 0.113 |
23 | 4402 | 447 | 0.102 |
24 | 3149 | 394 | 0.125 |
25 | 4390 | 509 | 0.116 |
26 | 3044 | 580 | 0.191 |
27 | 4292 | 605 | 0.141 |
28 | 3940 | 675 | 0.171 |
29 | 4289 | 491 | 0.114 |
30 | 4652 | 466 | 0.100 |
31 | 4863 | 498 | 0.102 |
32 | 4290 | 464 | 0.108 |
33 | 4850 | 489 | 0.101 |
34 | 5865 | 591 | 0.101 |
35 | 4961 | 739 | 0.149 |
36 | 4873 | 861 | 0.177 |
37 | 8945 | 964 | 0.108 |
38 | 9554 | 994 | 0.104 |
39 | 8897 | 1067 | 0.120 |
40 | 10457 | 1270 | 0.121 |
41 | 7239 | 889 | 0.123 |
42 | 6835 | 1233 | 0.180 |
43 | 7334 | 1252 | 0.171 |
44 | 8115 | 1056 | 0.130 |
45 | 6317 | 1068 | 0.169 |
46 | 9684 | 974 | 0.101 |
47 | 8064 | 1119 | 0.139 |
48 | 6287 | 1251 | 0.199 |
49 | 10319 | 1341 | 0.130 |
50 | 8146 | 1107 | 0.136 |
51 | 8631 | 1237 | 0.143 |
52 | 10913 | 1309 | 0.120 |
53 | 7086 | 1364 | 0.192 |
54 | 8863 | 2121 | 0.239 |
55 | 5241 | 436 | 0.083 |
Script for generating this is:
1for k in `seq 54 -1 0`; do
2 let km="55-$k"; i=access.log.$k; wci=`cat $i | wc -l`;
3 filt=`accesslogFilter $i | wc -l`; let pr="1.0*$filt/$wci";
4 printf " %2d | %5d | %4d | %6.3f\n" $km $wci $filt $pr;
5done
Added 10-Dec-2023: Mitchell Krog also has lists of IP addresses, bot-names, class C nets, etc.