Converting Journal Article from LaTeX to Markdown

· klm's blog

Describe Perl script to convert from LaTeX with Bibtex to Markdown.

Original post is here: eklausmeier.goip.de

1. Problem statement. You have a scientific journal article in LaTeX format on arXiv but want it in Markdown format for a personal blog. In our case we take the article "A Parsec-Scale Galactic 3D Dust Map out to 1.25 kpc from the Sun" from Gordian Edenhofer et al. The original paper is here: https://arxiv.org/abs/2308.01295

If the article is in Markdown format, it can then be easily transformed into HTML. Having an article in Markdown format has a number of advantages over having the article in LaTeX format:

  1. It is much easier to write Markdown than LaTeX
  2. Reading HTML is easier than reading a PDF
  3. The notion of page, i.e., paper sized page, does not have a good meaning in the world of smartphones, tablet, etc.

Of course, the math in the LaTeX document will be converted to MathJax.

2. Overview of the content of the scientific article. The article briefly describes the importance of dust:

Interstellar dust comprises only 1% of the interstellar medium by mass, but absorbs and re-radiates more than 30 of starlight at infrared wavelengths. As such, dust plays an outsized role in the evolution of galaxies, catalyzing the formation of molecular hydrogen, shielding complex molecules from the UV radiation field, coupling the magnetic field to interstellar gas, and regulating the overall heating and cooling of the interstellar medium.

Dust's ability to scatter and absorb starlight is precisely the reason why we can probe it in three spatial dimensions.

A novel $\cal O(n)$ method called Iterative Charted Refinement (ICR) was used to analyze the more than 122 billion of data from the Gaia mission.

Photo

The algorithm ran for 4 weeks using the SLURM workload manager.

We employ a new Python framework called NIFTy.re for deploying NIFTy models to GPUs. NIFTy.re is part of the NIFTy Python package and internally uses JAX to run models on the GPU. We are able to speed up the evaluation of the value and gradient of ... by two orders of magnitude by transitioning from CPUs to GPUs. Our reconstruction ran on a single NVIDIA A100 GPU with 80 GB of memory for about four weeks.

Needless to say, this 4 week run was only one of the very many runs to actually produce the final result.

The result is a 3D dust map

achieving an angular resolution of ${14'}$ ($N_\text{side}=256$). We sample the dust extinction in 516 distance bins spanning 69 pc to 1250 pc. We obtain a maximum distance resolution of 0.4pc at 69pc and a minimum distance resolution of 7pc at 1.25 kpc.

3. Solution. Initially a Pandoc approach was tried. Pandoc and all its dependencies on Arch Linux needs more than half GB (Gigabyte!) of space, just for the installation. After installation the Pandoc approach even failed.

Perl, the workhorse, had to do the job again. For the conversion I created two Perl scripts:

  1. blogparsec: converts main.tex, i.e., the actual paper
  2. blogbibtex: converts the Bibtex-formatted file literature.bib

Using those two script, creating the Markdown file goes like this:

1blogparsec main.tex > 08-03-a-parsec-scale-galactic-3d-dust-map-out-to-1-25-kpc-from-the-sun.md
2blogbibtex literature.bib >> 08-03-a-parsec-scale-galactic-3d-dust-map-out-to-1-25-kpc-from-the-sun.md

This file still needs some manual editing. One prominent case is moving the table-of-content to the top, as this is appended at the end.

4. blogparsec script. Some notes on this Perl script. The input to this script is the actual LaTeX text with all the formulas etc.

First define some variables and use strict mode.

1#!/bin/perl -W
2# Convert paper in "Astronomy & Astrophysics" LaTeX format to something resembling Markdown
3# Manual post-processing is still necessary but a lot easier
4
5use strict;
6my ($ignore,$sectionCnt,$subSectionCnt,$replaceAlgo,$replaceTable) = (1,0,0,0,0);
7my (@sections) = ();

The frontmatter header is a simple here-document:

 1print <<'EOF';
 2---
 3date: "2023-08-03 14:00:00"
 4title: "A Parsec-Scale Galactic 3D Dust Map out to 1.25 kpc from the Sun"
 5description: "A 3D map of the spatial distribution of interstellar dust extinction out to a distance of 1.25 kpc from the Sun"
 6MathJax: true
 7categories: ["mathematics", "astronomy"]
 8tags: ["interstellar dust", "interstellar medium", "Milky Way", "Gaia", "Gaussian processes", "Bayesian inference"]
 9---
10
11EOF

The main loop looks at each line in main.tex. After the loop the literature section is added, then all sections collected so far are printed.

 1while (<>) {
 2	$ignore = 0 if (/\\author\{Gordian~Edenhofer/);
 3	next if ($ignore);
 4
 5    (...)
 6
 7	print;
 8
 9	print "\$\$\n" if (/(\\end\{equation\}|\\end\{align\})/);	# enclose with $$ #2
10}
11
12
13print "## Literature<a id=Literature></a>\n";
14for (@sections) {
15	print $_ . "\n";
16}
17++$sectionCnt;
18print "- [$sectionCnt. Literature](#Literature)\n";

What follows is the part which is marked as (...) in above code.

Here is the special case for processing algorithm and tables in the paper: the algorithm is simply a screenshot of the original PDF, the table is a here-document:

 1
 2	# In this particular case we replace the two algorithms with a corresponding screenshot
 3	if (/^\\begin\{algorithm/) {
 4		$replaceAlgo = 1;
 5		next;
 6	} elsif (/^\s+Pseudocode for ICR creating a GP/) {
 7		s/^(\s+)//;
 8		s/(\\left|right)\\/$1\\\\/g;	# probably MathJax bug
 9		$replaceAlgo = 0;
10		print "![](*<?=\$rbase?>*/img/parsec_res/Algorithm1.webp)\n\n";
11	} elsif (/^\s+Pseudocode for our expansion point variational/) {
12		s/^(\s+)//;
13		$replaceAlgo = 0;
14		print "![](*<?=\$rbase?>*/img/parsec_res/Algorithm2.webp)\n\n";
15	} elsif ($replaceAlgo == 1) { next; }
16
17	if (/^\\begin\{table/) {
18		$replaceTable = 1;
19		next;
20	} elsif (/^\\end\{table/) {
21		$replaceTable = 0;
22		print <<'EOF';
23
24Parameters of the prior distributions.
25The parameters $s$, $\mathrm{scl}$, and $\mathrm{off}$ fully determine $\rho$.
26They are jointly chosen to a prior yield the kernel reconstructed in [Leike2020][].
27
28
29
30 Name | Distribution | Mean | Standard Deviation | Degrees of Freedom
31 -----|--------------|------|--------------------|--------------------
32_s_   | Normal       | 0.0  | Kernel from [Leike2020][] | 786,432 &times; 772
33scl   | Log-Normal   | 1.0  | 0.5                |  1
34off   |  Normal      | $-6.91\left(\approx\ln10^{-3}\right)$ <br>prior median extinction <br>from [Leike2020][] | 1.0 | 1
35      |              |      | Shape Parameter    | Scale Parameter
36$n_\sigma$ | Inverse Gamma | 3.0 | 4.0 | #Stars = 53,880,655
37
38EOF
39		next;
40	} elsif ($replaceTable == 1) { next; }

The header with its authors and institutions needs some extra handling:

 1s/^\\(author|institute)\{/\n<p>\u$1s:<\/p>\n\n1. /;
 2
 3s/\~/ /g;
 4
 5# Authors, institutions, abstract, etc.
 6s/\(\\begin\{CJK\*.+?CJK\*\}\)//;
 7s/\\inst\{(.+?)\}/ \($1\)/g;
 8if (/^\s+\\and/) { print "1. "; next; }
 9s/^\{% (\w+) heading \(.+$/\n\n_\u$1._ /;
10s/^\\abstract/## Abstract/;
11s/^\\keywords\{/__Key words.__ /;

Many lines simply are no longer needed in Markdown and therefore dopped:

1# Lines to drop, not relevant
2next if (/(^\\maketitle|^%\s+|^%In general|^\\date|^\\begin\{figure|^\\end\{figure|\s+\\centering|\s+\\begin\{split\}|\s+\\end\{split\}|^\s*\\label|^\\end\{acknowledgements\}|^\\FloatBarrier|^\\bibliograph|^\\end\{algorithm\}|^\\begin\{appendix|^\\end\{appendix\}|^\\end\{document\})/);
3
4s/\s+%\s+[^%].+$//;	# Drop LaTeX comments
5s/\\fnmsep.+$//;	# drop e-mail

Display math is enclosed in double dollars:

1print "\$\$\n" if (/(\\begin\{equation\}|\\begin\{align\})/);	# enclose with $$a #1

Images are replaced with the usual Markdown code ![]():

1# images
2s/\s+\\includegraphics.+res\/(\w+)\}/!\[Photo\]\(\*<\?=\$rbase\?>\*\/img\/parsec_res\/$1\.png)/;
3s/\s+\\subcaptionbox\{(.+?)\}\{\%/\n__$1__\n/g;

Some LaTeX macros are not present in MathJax and therefore need to be replaced.

 1# MathJax doesn't know \nicefrac
 2s/\\nicefrac\{(.+?)\}\{(.+?)\}/\{$1\}\/\{$2\}/g;
 3s/\\coloneqq/:=/g;	# MathJax doesn't know \coloneqq + \argmin + \SI
 4s/\\argmin/\\mathop\{\\hbox\{arg min\}\}/g;
 5s/\\SI(|\[parse\-numbers=false\])\{(.+?)\}/$2/g;
 6s/\\SIrange\{(.+?)\}\{(.+?)\}\{(|\\)([^\\]+?)\}/$1 $4 to $2 $4/g;
 7s/\\nano\\meter/nm/g;
 8s/\{\\pc\}/pc/g;
 9s/\{\\kpc\}/kpc/g;
10s/(kpc|pc)\$/\\\\,\\hbox\{$1\}\$/g;
11s/\{\\cubic\\pc\}/\\\\,\\hbox\{pc\}^3/g;

What looks good in LaTeX does not necessarily look good in Markdown:

1s/i\.e\.\\ /i.e., /g;
2
3# Special cases
4s/``([A-Za-z])/"$1/g;	# double backquotes in LaTeX have an entirely different meaning than in Markdown

More MathJax specialities:

1# These are probably MathJax bugs, which we correct here
2s/\$\\tilde\{Q\}_\{\\bar\{\\xi\}\}\$/\$\\tilde\{Q\}\\_\{\\bar\{\\xi\}\}\$/g;
3s/\$\\mathcal\{D\}_/\$\\mathcal\{D\}\\_/g;
4s/\$P\(d\|\\mathcal\{D\}_/\$P\(d\|\\mathcal\{D\}\\_/g;
5s/\$\\mathrm\{sf\}_/\$\\mathrm\{sf\}\\_/g;

Various LaTeX text-macros:

 1s/\\url\{(.+?)\}/$1/g;	# Markdown automatically URL-ifies URLs, so we can dispense \url{}
 2
 3# Thousands separator, see https://stackoverflow.com/questions/33442240/perl-printf-to-use-commas-as-thousands-separator
 4s/\\num\[group-separator=\{,\}\]\{(\d+)\}/scalar reverse(join(",",unpack("(A3)*", reverse int($1))))/eg;
 5
 6# Code
 7s/\\lstinline\|(.+?)\|/`$1`/g;
 8s/\\texttt\{(.+?)\}/`$1`/g;
 9s/quality\\_flags\$<\$8/quality_flags<8/g;	# special case
10
11# Special cases for preventing code blocks because of indentation
12s/   (The angular resolution)/$1/;
13s/   (The stated highest r)/$1/;

Section and subsection headers become ## and ### in Markdown:

 1# sections + subsections
 2if (/\\section\{(.+?)\}\s*$/) {
 3    my $s = $1;
 4    ++$sectionCnt; $subSectionCnt = 0;
 5    push @sections, "- [$sectionCnt. $s](#s$sectionCnt)";
 6    $_ = "\n## $sectionCnt. $s<a id=s$sectionCnt></a>\n";
 7} elsif (/\\subsection\{(.+?)\}\s*$/) {
 8    my $s = $1;
 9    ++$subSectionCnt;
10    push @sections, "\t- [$sectionCnt.$subSectionCnt $s](#s${sectionCnt}_$subSectionCnt)";
11    $_ = "\n### $sectionCnt.$subSectionCnt $s<a id=s${sectionCnt}_$subSectionCnt></a>\n";
12}

For footnotes I used block quotes in Markdown.

1if (/(\\footnotetext\{%|^\\begin\{acknowledgements\})/) { print "> "; next; }

I fought a little bit with citations and initially had something like:

 1# Citations
 2#s/\\citep(|\[.*?\]\[\])\{(\w+)\}/'('.(length($1)>4?substr($1,1,-3).' ':'').'['.join('], [',split(',',$2)).'][])'/eg;
 3# First approach, now obsolete through eval()-approach
 4#s/\\citep\{(\w+)\}/([$1][])/g;
 5#s/\\citep\{(\w+),(\w+)\}/([$1][], [$2][])/g;
 6#s/\\citep\{(\w+),(\w+),(\w+)\}/([$1][], [$2][], [$3][])/g;
 7#s/\\citep\{(\w+),(\w+),(\w+),(\w+)\}/([$1][], [$2][], [$3][], [$4][])/g;
 8#s/\\citep\{(\w+),(\w+),(\w+),(\w+),(\w+)\}/([$1][], [$2][], [$3][], [$4][], [$5][])/g;
 9#s/\\citep\{(\w+),(\w+),(\w+),(\w+),(\w+),(\w+)\}/([$1][], [$2][], [$3][], [$4][], [$5][], [$6][])/g;
10#s/\\citep\{(\w+),(\w+),(\w+),(\w+),(\w+),(\w+),(\w+)\}/([$1][], [$2][], [$3][], [$4][], [$5][], [$6][], [$7][])/g;
11#s/\\citep\{(\w+),(\w+),(\w+),(\w+),(\w+),(\w+),(\w+),(\w+)\}/([$1][], [$2][], [$3][], [$4][], [$5][], [$6][], [$7][], [$8][])/g;
12#s/\\citep\{(\w+),(\w+),(\w+),(\w+),(\w+),(\w+),(\w+),(\w+),(\w+)\}/([$1][], [$2][], [$3][], [$4][], [$5][], [$6][], [$7][], [$8][], [$9][])/g;
13#s/\\citep\{(\w+),(\w+),(\w+),(\w+),(\w+),(\w+),(\w+),(\w+),(\w+),(\w+)\}/([$1][], [$2][], [$3][], [$4][], [$5][], [$6][], [$7][], [$8][], [$9][], [$10][])/g;
14#s/\\citet\{(\w+)\}/[$1][]/g;

Luckily this can be handled by eval in regex, i.e., watch out for the s///eg, the e is important:

1s!\\citep\{([,\w]+)\}!'(['.join('][], [',split(/,/,$1)).'][])'!eg;	# cite-paranthesis without any prefix text
2s!\\citep\[(.+?)\]\[\]\{(\w+)\}!'('.$1.' ['.join('][], [',split(/,/,$2)).'][])'!eg;	# citep with prefix text
3s!\\(citet|citeauthor)\{([,\w]+)\}!'['.join('][], [',split(/,/,$2)).'][]'!eg;	# we handle citet+citeauthor the same

During development of this Perl script I used Beyond Compare quite intensively, to compare the original against the changed file.

5. blogbibtex script. The input to this script is the Bibtex file with all literature references. The Bibtex file looks something like this:

 1@book{Draine2011,
 2  author  = {{Draine}, Bruce T.},
 3  title   = {{Physics of the Interstellar and Intergalactic Medium}},
 4  year    = 2011,
 5  adsurl  = {https://ui.adsabs.harvard.edu/abs/2011piim.book.....D},
 6  adsnote = {Provided by the SAO/NASA Astrophysics Data System}
 7}
 8@article{Popescu2002,
 9  author        = {{Popescu}, Cristina C. and {Tuffs}, Richard J.},
10  title         = {{The percentage of stellar light re-radiated by dust in late-type Virgo Cluster galaxies}},
11  journal       = {\mnras},
12  keywords      = {galaxies: clusters: individual: Virgo Cluster, galaxies: fundamental parameters, galaxies: photometry, galaxies: spiral, galaxies: statistics, infrared: galaxies, Astrophysics},
13  year          = 2002,
14  month         = sep,
15  volume        = {335},
16  number        = {2},
17  pages         = {L41-L44},
18  doi           = {10.1046/j.1365-8711.2002.05881.x},
19  archiveprefix = {arXiv},
20  eprint        = {astro-ph/0208285},
21  primaryclass  = {astro-ph},
22  adsurl        = {https://ui.adsabs.harvard.edu/abs/2002MNRAS.335L..41P},
23  adsnote       = {Provided by the SAO/NASA Astrophysics Data System}
24}

The Perl script has some journal names preloaded:

 1#!/bin/perl -W
 2# Convert BibTeX to Markdown. Produce the following:
 3#    1. List of URL targets
 4#    2. Sorted list of literature entries
 5
 6use strict;
 7my ($inArticle,$entry,$entryOrig,$type) = (0,"","");
 8my %H;	# hash of hash (each element in hash is a yet another hash)
 9my %Journals = (	# see http://cdsads.u-strasbg.fr/abs_doc/aas_macros.html
10	'\aap'   => 'Astronomy & Astrophysics',
11	'\aj'    => 'Astronomical Journal',
12	'\apj'   => 'The Astrophysical Journal',
13	'\apjl'  => 'Astrophysical Journal, Letters',
14	'\apjs'  => 'Astrophysical Journal, Supplement',
15	'\mnras' => 'Monthly Notices of the RAS',
16	'\nat'   => 'Nature'
17);

The actual loop populates the hash %H:

 1while (<>) {
 2	if (/^@(article|book|inproceedings|misc|software)\{(\w+),$/) {
 3		($type,$entry,$entryOrig,$inArticle) = ($1,uc $2,$2,1);
 4		$H{$entry}{'entry'} = $entryOrig;
 5		$H{$entry}{'type'} = $type;
 6		#printf("\t\tentry = |%s|, type = |%s|\n",$entry,$type);
 7	} elsif ($inArticle) {
 8		if (/^}\s*$/) { $inArticle = 0; next; }
 9		if (/^\s+(\w+)\s*=\s*(.+)(|,)$/) {
10			my ($key,$value) = ($1,$2);
11
12			# LaTeX foreign language character handling
13			$value =~ s/\{\\ss\}/ß/g;
14			$value =~ s/\{\\"A\}/Ä/g;
15			$value =~ s/\{\\"U\}/Ü/g;
16			$value =~ s/\{\\"O\}/Ö/g;
17			$value =~ s/\{\\"a\}/ä/g;
18			$value =~ s/\{\\"u\}/ü/g;
19			$value =~ s/\{\\"i\}/ï/g;
20			$value =~ s/\{\\H\{o\}\}/ő/g;
21			$value =~ s/\{\\"\\i\}/ï/g;
22			$value =~ s/\{\\"o\}/ö/g;
23			$value =~ s/\{\\'A\}/Á/g;	# accent aigu
24			$value =~ s/\{\\'E\}/É/g;	# accent aigu
25			$value =~ s/\{\\'O\}/Ó/g;	# accent aigu
26			$value =~ s/\{\\'U\}/Ú/g;	# accent aigu
27			$value =~ s/\{\\'a\}/á/g;	# accent aigu
28			$value =~ s/\{\\'e\}/é/g;	# accent aigu
29			$value =~ s/\{\\'o\}/ó/g;	# accent aigu
30			$value =~ s/\{\\'u\}/ú/g;	# accent aigu
31			$value =~ s/\{\\`a\}/à/g;	# accent grave
32			$value =~ s/\{\\`e\}/è/g;	# accent grave
33			$value =~ s/\{\\`u\}/ù/g;	# accent grave
34			$value =~ s/\{\\^a\}/â/g;	# accent circonflexe
35			$value =~ s/\{\\^e\}/ê/g;	# accent circonflexe
36			$value =~ s/\{\\^i\}/î/g;	# accent circonflexe
37			$value =~ s/\{\\^\\i\}/î/g;	# accent circonflexe
38			$value =~ s/\{\\^o\}/ô/g;	# accent circonflexe
39			$value =~ s/\{\\^u\}/û/g;	# accent circonflexe
40			$value =~ s/\{\\~A\}/Ã/g;	# minuscule a
41			$value =~ s/\{\\~a\}/ã/g;	# minuscule a
42			$value =~ s/\{\\~O\}/Õ/g;	# minuscule o
43			$value =~ s/\{\\~o\}/õ/g;	# minuscule o
44			$value =~ s/\{\\~n\}/ñ/g;	# palatal n
45			$value =~ s/\{\\v\{C\}/Č/g;	# grapheme C
46			$value =~ s/\{\\v\{c\}/č/g;	# grapheme c
47			$value =~ s/\{\\v\{S\}/Š/g;	# grapheme S
48			$value =~ s/\{\\v\{s\}/š/g;	# grapheme s
49			$value =~ s/\{\\v\{Z\}/Ž/g;	# grapheme Z
50			$value =~ s/\{\\v\{z\}/ž/g;	# grapheme z
51
52			$value =~ s/\{|\}|\~//g;	# drop {}~
53			$value =~ s/,$//;	# drop last comma
54			$H{$entry}{$key} = $value;
55			#printf("\t\t\tentry = |%s|, key = |%s|\n", $entry, $key);
56		}
57	}
58}

Once everything is loaded into the hash, the hash is printed out in formatted form.

 1print("\n");
 2for my $e (sort keys %H) {
 3	my $He = \%H{$e};
 4	my $url =
 5	printf("[%s]: %s\n", $H{$e}{'entry'},
 6		exists($H{$e}{'doi'}) ? 'https://doi.org/'.$H{$e}{'doi'}
 7		: exists($H{$e}{'url'}) ? $H{$e}{'url'} : '#Literature');
 8}
 9print("\n");
10
11for my $e (sort keys %H) {
12	my ($He,$date,$journal) = (\$H{$e},"","");
13	if (exists($$He->{'year'}) && exists($$He->{'month'}) && exists($$He->{'day'})) {
14		$date = sprintf("%02d-%s-%d", $$He->{'year'}, $$He->{'month'}, $$He->{'day'});
15	} elsif (exists($$He->{'year'}) && exists($$He->{'month'})) {
16		my $m = $$He->{'month'};
17		$date = "\u$m" . "-" . 	$$He->{'year'};
18	} elsif (exists($$He->{'year'})) {
19		$date = $$He->{'year'};
20	}
21	if (exists($$He->{'journal'})) {
22		my $t = $$He->{'journal'};
23		$journal = ", " . ((substr($t,0,1) eq '\\') ? $Journals{$t} : $t);
24		$journal .= ", Vol. " . $$He->{'volume'} if (exists($$He->{'volume'}));
25		$journal .= ", Nr. " . $$He->{'number'} if (exists($$He->{'number'}));
26		$journal .= ", pp. " . $$He->{'pages'} if (exists($$He->{'pages'}));
27	}
28
29	printf("1. \\[%s\\] %s: _%s_, %s%s%s\n", $H{$e}{'entry'}, $H{$e}{'author'},
30		defined($H{$e}{'title'}) ? $H{$e}{'title'} : $H{$e}{'howpublished'},
31		$date, $journal,
32		exists($H{$e}{'doi'}) ? ', https://doi.org/'.$H{$e}{'doi'}
33		: exists($H{$e}{'url'}) ? ', ' . $H{$e}{'url'} : ''
34	);
35}

The output of this blogbibtex script is then appended to the output of the previous script blogparsec.

6. Open issues. I had already worked for two days on these two Perl scripts and wanted to finish it. Therefore the following topics are not adressed but can be solved quite easily.

  1. There are still some stray curly braces, which should be removed.
  2. Back and forward references, i.e., all these still visible \Cref tags should be converted using link references in Markdown.
  3. LaTeX table were converted manually, should be fully automatic.
  4. Converting the \begin{algorithm} and \end{algorithm} probably is a lot trickier, as it needs extra CSS to work properly.