Parallel Mass-File Processing

· klm's blog


Original post is here: eklausmeier.goip.de

Task at hand: Process ca. 400,000 files. In our case each file needed to be converted from EBCDIC to ASCII.

Obviously, you could do this sequentially. But having a multiprocessor machine you should make use of all processing power. The chosen approach is as follows:

  1. Generate a list of all files to be processed, i.e., file with all filenames, henceforth called fl. For example: find . -mindepth 2 > fl
  2. Split fl into 32 parts ("chunks"): split -nl/32 fl fl\.
  3. Each chunk is now processed in parallel: for i in fl.??; do processEachChunk $i & done

In our case each file is processed as below, i.e., processEachChunk looks like:

1T=/tmp/mvscvtInp.$$
2while read fn; do
3	#echo $fn
4	if [ -f $fn ]; then
5		mv "$fn" $T  ||  echo "Error: i=|$fn|, T=|$T|"
6		mvscvt -a < $T > "$fn"
7	fi
8done

Here mvscvt is the homegrown program to convert EBCDIC to ASCII. If your EBCDIC files are not special in any way then you can use

1dd conv=ascii if=...

instead of mvscvt.

If possible, i.e., if all data fits into main memory, do this operation on a RAM disk. On Arch Linux /tmp is mounted as tmpfs, i.e., a RAM disk.

Added 13-Apr-2023: Alternative route. Split into 64 files with all the filenames:

1find . -type f > fl
2split -nl/64 fl flsp

Now run a program, which can handle multiple arguments, and therefore does not need to be started over and over again.

1for i in flsp*; do mvscvt -ar `cat $i` & done