Pandoc filters for chemistry

One of the great things about pandoc is that it is very extensible through the use of filters. The best example of this is pandoc-citeproc, which is how references are processed in the native pandoc syntax. However, there are many other filters available, and they are fairly easy to write if you’re passingly familiar with any one of a number of different programming languages (although Haskell—pandoc’s native language—and Python appear to be most common).

As a chemist, this sort of extensibility is both tremendously useful and sometimes very necessary. There are, I’ve realized, some real idiosyncrasies to our writing (for example, our insistence on having at least two and sometimes three different categories of figures that are numbered separately). In LaTeX, these are taken care of with different packages, like mhchem. I like to use LaTeX for Supporting Info files and these sorts of packages are very useful. In pandoc, a lot of similar functionality can be added through short filters that are applied when the files are processed.

In the next few posts, I’ll outline a few examples of short little filters I’ve recently put together to smooth the path for writing chemistry in pandoc. Here’s the first one: pandoc-chem-struct. I’ve put this up as a Github repo, although there’s not a whole lot to it.

The basic issue is this: pandoc’s support of sub/superscripts means that writing chemistry is very possible, but the raw markdown can look a bit awkward. For example, butanol comes out like this: CH~3~CH~2~CH~2~CH~2~OH, and sulfate is SO~4~^2-^. These are both very awkward to type, and more importantly they just look terrible and are hard to read.

In LaTeX, the mhchem package allows one to simple write \ce{CH3CH2CH2CH2OH} or \ce{SO4^2-}. There are many other powerful features as well, but this basic functionality was what I missed the most. The pandoc-chem-struct filter co-opts the most basic form of this syntax to allow condensed formulas to be entered quickly.

#! /usr/bin/env python3
"""Pandoc filter to format simple chemical structures.

Structures specified as in s:{CH3CH2O-}, s:{SO4^2-}
are converted to formatted structures such as CH~3~CH~2~OH^-^, 
SO~4~^2−^.

"""

from pandocfilters import toJSONFilter, Str, Subscript, Superscript
import re

# Pattern for structures in md.
ID_PAT = re.compile('(.*)s:\{(.*)\}(.*)')
# Used to identify charges at end of formula.
CHARGE_PAT = re.compile('(\w*)\^?([0-9]*[-–−+])')

def chem_struct (key, val, fmt, meta):
    if key == 'Str' and ID_PAT.match(val):
        # Store punctuation after formula in end.
        start, raw_formula, end = ID_PAT.match(val).groups()
        
        if CHARGE_PAT.match(raw_formula):
            formula, charge = CHARGE_PAT.match(raw_formula).groups()
            # Replace hyphen with minus sign
            charge = charge.replace('-', '−') 
        else:
            formula, charge = raw_formula, None

        formatted_formula = []

        for d in formula:
            if d.isdigit():
                formatted_formula.append(Subscript([Str(d)]))
            else:
                formatted_formula.append(Str(d))

        if charge:
            formatted_charge = [Superscript([Str(charge)])]
        else:
            formatted_charge = []

        formatted_start = [Str(start)]
        formatted_end = [Str(end)]

        return formatted_start + formatted_formula + formatted_charge \
               + formatted_end

if __name__ == '__main__':
    toJSONFilter(chem_struct)

What does this actually do? If passed to pandoc when processing a document (pandoc -F pandoc-chem-struct.py etc.), it converts text of the form “s:{CH3OH}” to CH3OH. There’s not a lot to this. All numbers are simply subscripted, and trailing charges are superscripted. A “^” can be use to flag trailing numbers as charges: s:{SO4^2-} is interpreted as SO42−. The filter also replaces hyphens with proper minus signs.

When I was first experimenting with pandoc, I wasted a lot of time trying to work against the program by using templates and little hacks to achieve these same results. For example, I’d use a LaTeX template for pdf output that loaded the mhchem package, and then wrote structures as raw LaTeX code (“\ce{CH3OH}”). This works just fine if you’re only interested in output to pdf/LaTeX, but it’s a terrible approach because it means you could never translate the document to Word, html, etc. (raw LaTeX is ignored in these other formats). These pandoc filters play nice with the program, and allow one to retain its great flexibility.

Leave a Reply

Your email address will not be published. Required fields are marked *