A life in plain text with pandoc

Back in the dark ages, when working on the first (now defunct) html version of the group website, I started to wonder a bit about making more use of plain text in my everyday workflows. As much as html isn’t exactly a convenient format for writing, I was tired of trying to herd figures into the right places in Word.

So, I started looking into markup languages, and eventually LaTeX. For the uninitiated, LaTeX (yes, it’s spelled that way, but pronounced lah-tek or lay-tek), is a markup language very commonly used by folks in math-heavy fields. Better yet, it’s free. It’s not, as far as I can tell, particularly common in organic chemistry; however, it does a wonderful job of typesetting and it was appealing to the aspiring-but-inept programmer in me.

I fooled around with LaTeX for a year or two, especially when writing “fancy” documents like letters of recommendation. On the one hand, it’s great, and the final product is outstanding. LaTeX does a wonderful job of typesetting and typography, much better than Word. It is also highly customizable through the loading of optional packages. For example, the mhchem package allows for rapid writing of chemical formulae and reactions.

On the other hand, I just have a hard time writing in it. As much as I love the result, I just find the syntax too distracting. The other problem for me is that LaTeX is a bit of a lifestyle choice. I was very reluctant to write my high-value documents in it because we live in a MS Word world, and I was worried about investing a lot of time on a project and then being stuck trying to convert it into a different format. So, while I still use LaTeX for a few specific applications (my CV, the Supporting Information for papers), for me, it’s not a long term solution.

The ultimate solution to this problem was pandoc. At the most basic level, pandoc is a document converter, allowing various formats (docx, LaTeX, html, etc.) to be converted into many other formats (all of the above and many others, including pdf). Pandoc was developed with academic writing in mind, having been created by John MacFarlane, of all things the Chair of the Department of Philosophy at Berkeley.

The beauty of pandoc is that it has a robust markdown syntax that is easy to work with and legible. Like other markup formats, writing in markdown (pandoc’s or otherwise) forces you to structure documents carefully and worry about aesthetic choices later. Further, it easily allows for more modern documents with seamless linking to urls. Here’s an example:

# This is a heading

## This is a subheading

This is normal text. *This text is in italics.* This **word** is in bold.

- Bullet point 1.
- Bullet point 2.

![Figure caption](path/to/figure.pdf)

The files are in plain text, which appeals to my sense of organization, enables tools for version control (like git), and is the ultimate in future proofing. Most importantly, though, files in pandoc markdown can be easily converted into essentially any format you can imagine. So, documents can be written without fear that they will be unshareable or unsuitable for submission. It provides a unified platform that can be used to create pdfs, Word documents, simple html pages, etc., all by making minor changes at compile time. Plus, it uses LaTeX as the backend for pdf generation, so one gets all of its benefits while being able to hide most (if not all) of the clutter in template files.

Pandoc works nicely with the BibTeX system for citations, which integrates well with literature management systems (like Papers). In fact, I actually like this workflow better than Endnote or Papers alone: I create separate databases for each manuscript, which allows me to easily include footnotes without contaminating my larger database.

For my workflow, the added effort is minimal. Honestly, there are just two drawbacks: First, Pandoc’s markdown does not give a lot of flexibility in constructing complex tables. That’s not generally an issue for me, but it is a potential concern. Second, because pandoc uses LaTeX to generate pdfs, it’s tricky to carefully place floating figures. The one area where this is a problem for me is in writing proposals: with a hard 15 page limit, you sometimes need to really finesse figure placement and that’s not easy if you don’t have direct control. So I’m stuck with Word (or Pages) for some tasks, but almost everything else has been moved over.

Those cases aside, pandoc is, for me, the essential tool for creating a work life structured around plain text as the format for documents. From one perspective this isn’t all that big a deal; who ultimately cares if a document is saved as .txt or .docx. However, over time I think that it makes for bigger changes: the distinction between short notes, web content, manuscripts, etc., really starts to blur; simple command-line tools can be used to search, filter, and assemble larger documents that will end up as well-structured pdfs; anything can have simple templates applied to it to adapt it to whatever format is needed.

Leave a Reply

Your email address will not be published. Required fields are marked *