Line Length Distribution in Files

· klm's blog


Original post is here: eklausmeier.goip.de

When processing input files I have to check whether those input files have a common record format. For this I therefore have to compute the line length of each record in the input file.

1. Perl solution. The below program reads the input file and shows a histogram of each line length with its according frequency.

 1#!/bin/perl -W
 2# Histogram of line length's
 3
 4use strict;
 5
 6my %H;
 7
 8while (<>) {
 9	$H{length($_)} += 1;
10}
11
12for (sort {$a <=> $b} keys %H) {
13	printf("%5d\t%d\n",$_,$H{$_});
14}

2. Perl one-liner. Many times a simple Perl program can be converted into a Perl one-liner. See for example Introduction to Perl one-liners, written by Peteris Krumnis. Also see Useful One-Line Scripts for Perl.

1perl -ne '$H{length($_)} += 1; END { printf("%5d\t%d\n",$_,$H{$_}) for (sort {$a <=> $b} keys %H); }' <yourFile>

Example usage:

1printf "\n\na\n\ab\nabc\n" | perl -ne '$H{length($_)} += 1; END { printf("%5d\t%d\n",$_,$H{$_}) for (sort {$a <=> $b} keys %H); }'

gives

1    1   2
2    2   1
3    3   1
4    4   1

3. Awk solution. If Perl is not available, then hopefully Awk is installed. Below Awk program accomplishes pretty much the same.

 1#!/bin/awk -f
 2
 3function max(a,b) {
 4	return  a>b ? a : b
 5}
 6
 7	{ m = max(length($0),m); x[length($0)] += 1 }
 8
 9END {
10	for (i=0; i<=m; ++i)
11		print i, x[i]
12}