Rewriting Perl to plain C

· klm's blog

A Perl script for analyzing COBOL programs was running for more than 30 minutes. The rewritten program using the C programming language uses less t...

Original post is here: eklausmeier.goip.de

Perl script was running too slow. Rewriting it in C made it 20 times faster.

1. Problem statement. Analyze call-tree dependency in COBOL programs. There are 77 million lines of COBOL code in ca. 30,000 files. These 30,000 COBOL programs could potentially include 74,000 COPY-books comprising 10 million lines of additional code. COBOL COPY-books are analogous to C header-files. So in total there are around 88 million lines of COBOL code. Just for comparison: the Linux kernel has ca. 20 million lines of code.

COBOL program analysis started with a simple Perl script. This Perl script is less than 200 lines, including comments. This script produced the desired dependency information.

Wading through all this COBOL code took up to 25 minutes in serial mode, and 13 minutes using 4 cores on an HP EliteBook notebook using Intel Skylake i7-6600U clocked 2.8 GHz. It took 36 minutes on an AMD FX-8120 clocked with 3.1 GHz. This execution time was deemed too long to see any changes in the output changing something in the Perl script. All runs are on Arch Linux 4.14.11-1 SMP PREEMPT.

2. Result. Rewriting the Perl script in C resulted in a speed improvement of factor 20 when run in serial mode, i.e., run times are now 110s on one core. It runs in 32s when using 8 cores on an AMD FX-8120. C program uses taylormade hashing routines. [more_WP_Tag]

3. Runtime. How much time is spent in file processing alone? Word-counting through all COBOL code and COPY-books using wc took around 40s and 7s for source and COPY-books respectively.

 1$ /bin/time -p wc * | tail -1            (in directory with COBOL source)
 2  77249523  228229442 3656045826 total
 3        real 40.94
 4        user 39.87
 5        sys 1.02
 6
 7$ /bin/time -p wc * | tail -1            (in directory with COPY-books)
 8 10403819  39061598 575793581 total
 9        real 7.08
10        user 6.32
11        sys 0.54

Above timings on AMD FX-8120 clearly show that all the source code is already completely cached in memory as sys=0.54. Above commands were run multiple times. Timings using zsh-internal time command always show sys=0, and are probably wrong. These timings give a lower bound on expected runtime of any serial C program. It is interesting to compare wc performance with grep performance on the same data:

1$ /bin/time -p grep -c DIVISION * | tail -1
2ZZZ99:4
3        real 2.46
4        user 1.63
5        sys 0.82

There is a large gap, a factor of 10, when grep needs to do some heavy duty processing on the input:

1$ /bin/time -p grep -P "\s+CALL\s+'?([\-\w]+)" * | tail -1
2ZZZ99:                 CALL abc USING uvw
3        real 22.83
4        user 22.08
5        sys 0.74

Perl script used 36 minutes on AMD FX-8120.

1$ time siriusDynUpro * > ../resultPerl
2        real 2185.76s
3        user 2178.23s
4        sys 0

I.e., 2185s=36m. Perl is version v5.26.1 built for x86_64-linux-thread-multi.

C program used less than 2 minutes and is therefore 20-times faster than the Perl script.

1$ /bin/time -p siriusDynCall * > ../resultC
2        real 110.24
3        user 106.19
4        sys 4.02

4. Hashing in C. While Perl has hashing already built-in, in C one has to either program this oneself, or find a suitable library, e.g., GLib. I wrote on hashing in Hashing Just Random Numbers and Hash functions: An empirical comparison — article by Peter Kankowski. This time I needed lookup, insert, and iterating over all entries. I wrote on this in Hashing with Iterator in C. Also see Hardening Perl's Hash Function by Yves Orton.

5. Scanning COBOL. There are a number of dialects of COBOL. In our case we had classical COBOL with

  1. Sequence number area: columns 1 to 6
  2. Indicator area: column 7
  3. Area A and B: columns 8 to 72
  4. Space or numbering: columns above 72

In our case the COBOL source code might be of mixed case, which must be interpreted as meaning the exact same thing. The next C function cobolGets(), similar to fgets(), fills two arrays: One array containing the original source code, and another where all characters are upper case. The latter is used for our analysis.

 1// Read one line of COBOL Code: ignore first 6 chars, ignore everything over column 72
 2// Do not return '\n', rtrim line
 3// Return number of 'useful' chars in buf[]
 4static int cobolGets (char buf[], char bufUp[], int maxbuf, FILE *fp) {
 5        int c, i, spc;
 6        char *p, *pUp, *q, *qUp;        // we always have: p <= q, pUp <= qUp
 7newline:
 8        p = q = buf;
 9        pUp = qUp = bufUp;
10        buf[0] = '\0';
11        bufUp[0] = '\0';
12        spc = 1;        // line contains only space so far
13        for (i=0; i<maxbuf && (c = fgetc(fp)) != EOF; ++i) {
14                if (c == '\n') {
15                        if (i < 6 || spc) goto newline; // line too short or spaces only, therefore read next line
16                        else {
17                                *p = '\0';
18                                *pUp = '\0';
19                                return p - buf; // dropped first 6 chars
20                        }
21                } else if (i < 6 || i > 71) continue;   // first six chars in COBOL line are usually digits or space
22                else {
23                        *q++ = c;
24                        *qUp++ = toupper(c);
25                        if (c != ' ') {
26                                spc = 0;        // at least one non-space
27                                p = q;  // move p to q
28                                pUp = qUp;
29                        }
30                }
31        }
32        return (-1);    // no more 'useful' chars
33}

The following Perl code is roughly analogous, but not the same:

1        chomp;
2        $_ = substr($_,0,72) if (length($_) >= 72);     # delete numbering on the right
3        s/\s+$//g;      # rtrim
4        next if (length($_) <= 6);
5        my $c6 = substr($_,6,1);
6        next if ($c6 eq '*' || $c6 eq '/');     # COBOL comment
7        $_ = substr($_,7);      # delete numbering on the left + delete c6
8        my $line = uc $_;

Getting individual words or tokens is done with strtok(). Function strtok() alters the original string. This is another reason to have the input in two arrays, see discussion above. The C code does not use PCRE, while the Perl script uses regular expressions in multiple ways, at least in 8 places. COBOL words are extracted by

1// Split line into tokens/words, return number of tokens found.
2static int cobolToken (char line[], char *token[], int maxtoken) {
3        int n = 1;
4        if ((token[0] = strtok(line," ().*/'\"")) == NULL) return 0;
5        while ((token[n] = strtok(NULL," ().*/'\"")) != NULL)
6                if (++n >= maxtoken) break;
7        return n;
8}

This is roughly analogous to the Perl \s+ idiom.

Excerpt from readCobol():

 1static int readCobol (int dirFlag, const char *fname) {
 2        ...
 3        enum { MAXBUF = 1048576, MAXTOKEN=MAXBUF/2 };
 4        static char buf[MAXBUF], bufUp[MAXBUF], *token[MAXTOKEN];
 5        char *pBuf=buf, *pBufUp=bufUp;
 6        struct hashElem *e;
 7        FILE *fp;
 8        ...
 9        if ((fp = fopen(fname,"rb")) == NULL) ...
10        while ((n = cobolGets(pBuf,pBufUp,MAXBUF,fp)) >= 0) {
11                if (pBuf[0] == '*' || pBuf[0] == '/') continue; // ignore COBOL comment line
12                pBufUp[0] = ' ';        // cope for debugging line: 'D' in column 6
13                if (strcmp(pBufUp+n-6," EJECT") == 0    // ignore compiler-directing statements
14                ||  strcmp(pBufUp+n-6," TITLE") == 0
15                ||  strcmp(pBufUp+n-6," SKIP1") == 0
16                ||  strcmp(pBufUp+n-6," SKIP2") == 0
17                ||  strcmp(pBufUp+n-6," SKIP3") == 0) continue;
18                if ((n = cobolToken(bufUp,token,MAXTOKEN)) <= 1) continue;      // line not relevant for our purpose
19                ...
20                if (strcmp(token[1],"DIVISION") == 0) {
21                        if (strcmp(token[0],"DATA") == 0) division = 1;
22                        else if (strcmp(token[0],"PROCEDURE") == 0) division = 2;
23                        else division = 3;
24                } else if (strcmp(token[0],"COPY") == 0) {
25                        readCobol(strlen(token[1]) > 5,token[1]);       // recursive call
26                } else if (division == 1) {     // DATA DIVISION
27                ... work von VALUE clause ...
28                } else if (division == 2) {     // PROCEDURE DIVISION
29                        // check for: MOVE abc TO uvw, or MOVE abc(k) TO uvw
30                        // abc might itself result from spreading, so look in dynHash
31                        p = NULL;
32                        if (n >= 4  &&  strcmp(token[0],"MOVE") == 0
33                        &&  (  (product = hashLookup(pgmHash,token[1])) != NULL
34                                ||  (p = hashLookup(dynHash,token[1])) != NULL  )
35                        ) {
36                        ...
37                        } else if (n >= 2  &&  strcmp(token[0],"CALL") == 0) {
38                                hashInsert(callHash,token[1],"1",1);   // token[1] has actually been called
39                                ...
40                        }
41                }
42        }
43        if (fclose(fp)) return 3;
44
45        return 0;
46}

It was at first unexpected that there are real COBOL programs where the VALUE clause spans multiple thousand lines! That is the reason why the buffers buf[] and bufUp[] (uppercase of buf[]) for a single line/statement are so exceptionally large, here 1 MB.

6. Profiling. To detect which C function consumes most of the CPU time I used gprof. This requires to compile with flags -pg -no-pie. Flag -pg alone does not work due to gcc Bug 21189 - gprof doesn't work with code built with PIE

1cc -Wall -g -pg -no-pie siriusDynCall.c -o siriusDynCall

Result using gprof siriusDynCall gmon.out

 1Each sample counts as 0.01 seconds.
 2  %   cumulative   self              self     total
 3 time   seconds   seconds    calls  ms/call  ms/call  name
 4 82.18     54.76    54.76 167506298     0.00     0.00  cobolGets
 5 11.61     62.49     7.74    24809     0.31     2.67  readCobol
 6  2.78     64.35     1.85 50861689     0.00     0.00  hashKR
 7  1.74     65.51     1.16 89742068     0.00     0.00  cobolToken
 8  1.19     66.30     0.79 50480957     0.00     0.00  hashLookup
 9  0.54     66.66     0.36    74427     0.00     0.00  hashClear
10  0.05     66.69     0.04        2    17.52    21.76  readZZTab
11  0.00     66.69     0.00   380732     0.00     0.00  hashInsert
12  0.00     66.69     0.00   380732     0.00     0.00  sHashInsert
13  0.00     66.69     0.00        7     0.00     0.00  hashInit
14  0.00     66.69     0.00        7     0.00     0.00  sHashInit

Clearly I/O dominates, as expected from wc timings above. Functions sHash*() are just wrappers around hash*().

I also checked with Valgrind whether I messed up with storage allocations.

7. Dead end. Using strace for wc and grep I noticed that read() was reading in larger chunks than above C program, e.g.,

1read(3, "      * PROGRAMM: xyz       "..., 32768) = 32768

Furthermore I noticed some hitherto unknown fadvise64() statements, like

1stat("xyz", {st_mode=S_IFREG|0644, st_size=216375, ...}) = 0
2openat(AT_FDCWD, "uvw", O_RDONLY)     = 3
3fadvise64(3, 0, 0, POSIX_FADV_SEQUENTIAL) = 0

statements for wc. I tried

1        if (posix_fadvise(fileno(fp),0,0,POSIX_FADV_SEQUENTIAL)) ...
2        setbuffer(fp,fbuf[0],MAXFBUF);

directly after fopen(). But this did not change runtime in any form.

8. Outlook. There is still some room for performance improvement. One could use gperftools memory allocator, instead of malloc/calloc/free from libc. Although, one can see from the profiling results above that memory allocation is clearly not dominating runtime, rather I/O is.