pandoc for web pages

Pandoc is a document format converter with support for several markdown flavors, including its own flavor. Markdown is easier to read and write than html and pandoc has thoughtful extensions and options that made me consider it as a source format for simple webdesign. This page is a test, while documenting my approach to produce the same kind of result as my hand-written html. The markdown source file for this page is pandoctest.md.

pandoc configuration

For this page (and maybe more to follow) I created a custom html template for pandoc and a yaml file with pandoc arguments. On my debian-based system these user files are supposed to be located in subdirs of the pandoc data dir (which had to be created first):

~/.local/share/pandoc/templates/katjaaspandoc.html
~/.local/share/pandoc/defaults/katjaas.yaml

However instead of storing the files there I specified the above paths as symlinks because I like to keep these particular config files in the web page source tree on my data partition to avoid config differences between systems (dual boot, fresh install etcetera).

html template

Markdown text defines only the body content for an html file. When pandoc is called with the -s option it will produce a “standalone” output file, and the <head> section etcetera will be included from a html template. Also this template may parse metadata and evaluate / expand other variables. Print the content of pandoc’s default template in terminal to see what it does:

$ pandoc -D html

My custom html template lives within the web source tree so it can be opened from here as katjaaspandoc.html. Rendered as a web page in a browser it looks almost empty but in html source view there is more to see. To use a non-default template with pandoc it must be specified as command argument. I have included this argument in config file katjaas.yaml but otherwise it would go like:

$ pandoc pandoctest.md -s --template=katjaaspandoc.html -o pandoctest.html

yaml file

Sets of command arguments for pandoc can be specified in so-called “defaults files”. My defaults/katjaas.yaml for rendering markdown to html contains these arguments defined as key:value pairs:

standalone: true
toc: true
toc-depth: 2
template: katjaaspandoc.html

No matter how you name a “defaults file”, pandoc will not read it as default arguments list. You can store multiple defaults files even for a single output format, and then specify a defaults file as pandoc argument, like for the case of katjaas.yaml:

$ pandoc pandoctest.md -dkatjaas -o pandoctest.html

Pandoc arguments and their equivalent yaml formats are listed in pandoc manual.

yaml block

Pandoc supports blocks of yaml code embedded within a markdown text file, but for different purposes than arguments in yaml files. Embedded yaml blocks can be used to define variables for the html template and will not be visible in rendered html. Use cases for embedded yaml code:

as literal metadata strings to be parsed anywhere but the html body
as values to be evaluated in condititional tests
as markdown code to be converted to html

Example of a yaml block with literal strings and markdown code (note the pipe character for the latter use case):

---
title: pandoc for web pages
date: 2023-04-14
markdown-flavor: pandoc

homepage: |
  [my home page](https://www.katjaas.nl)
---

It seems that pandoc arguments aren’t meant to be passed from within a markdown text file even though yaml code is supported there. I’ve been trying to specify arguments like toc-depth and template in yaml blocks and it just would not work. Maybe there are ways after all but now I’m fine as it encouraged me to only write content in the markdown file and not formatting instructions.

link syntax

A hyperlink or anchor in its basic form consists of a visible and clickable text (or image) plus a URL definition invisible in the rendered form of the document. URL can be a web address or relative path on the local filesystem, but also point to a destination in the document through a unique identifier. As if this is not complex enough markdown adds the concept of reference link, which can be understood as a link to a link. Moreover shorthand notation is sometimes allowed. This is all meant to keep your markdown text tidy but it introduces extra levels of abstraction. Below is my understanding of the concepts.

link text and URL
document-internal link
reference link
extra sugar

link text and URL

The basic link type in markdown (“inline link”) is written with the clickable link text between brackets and the URL between parenthesis. Optionally a title is written between double quotes to create a mouseover tooltip. Here an example without and with title, and below that the rendered links:

- [katjaas.nl](https://www.katjaas.nl)
- [katjaas.nl](https://www.katjaas.nl "katja's home page")

An inline link where clickable text and URL are identical can be written in a simpler form called “automatic link”:

- <https://www.katjaas.nl>

https://www.katjaas.nl

document-internal link

Links to other places in the document are indispensable for navigation in large web pages and pdf files. Commonmark does not support this kind of linking (incredible omission!) but pandoc and some others do. An internal link works through a unique identifier within the file and is mostly used for section headers but also allowed for other elements. Link destination and identifier are written in markdown using the # (hash) character. In generic form and as example for section yaml block:

[link text](#id)
[destination]{#id}

[yaml block](#yaml-block)
## yaml block {#yaml-block)

The first line describes the visible link text between brackets and the destination between parenthesis. The second line describes the visible heading title with its identifier attribute between curly braces. And this is the html equivalent:

<a href="#yaml-block">yaml block</a>
<h2 id="yaml-block">yaml block</h2>

Pandoc will automatically create such identifiers for all headings, whith whitespace replaced by dashes, upper case replaced by lower case, and non-alphanumeric characters omitted. When multiple headings of the same name exist, their identifiers will be distinguished by numbers appended to them in their order of appearance. This is useful when sections have identically named subsections.

reference link

The URL for a reference link is not written directly after the visible / clickable link text but can appear anywhere in the file, which can be convenient to keep paragraph text tidy in the case of long URLs. Instead of:

[link text](URL "title")

link text and URL are bound via a string functioning as a handle. The link definition specifies URL and optional title. Note the syntax with colon instead of parentheses:

[link text][handle]
           [handle]: URL "title"

Example of a reference link definition:

[DSPwiki]: https://en.wikipedia.org/wiki/Digital_signal_processing
      "https://en.wikipedia.org/wiki/Digital_signal_processing"

This created a handle [DSPwiki] in the markdown document which can be referenced as signal processing, digital signal processing or any link text from anywhere within the file, by writing link text and handle like so:

[signal processing][DSPwiki]
[digital signal processing][DSPwiki]

Link text, URL and title will be united in html source and the reference link definition will not be seen separately in rendered html. The handle [DSPwiki] only had a function in markdown and will disappear during conversion, as illustrated here:

<a href="https://en.wikipedia.org/wiki/Digital_signal_processing" title="https://en.wikipedia.org/wiki/Digital_signal_processing">signal processing</a>

Think of the link reference concept as a link to a link definition, where the handle name, “DSPwiki” in the example, must be a unique identifier. Like all identifiers meant to be unique it is easy to accidentally redefine them, with confusing results. It helps to choose a descriptive handle name even when that is a bit longer to type, unless there can be no mistake. As I learned the hard way.

extra sugar

The previous section described an explicit reference link, where link text and link definition were bound together via a handle name. The commonmark spec defines a shortcut form where link text is equal to handle name. Using handle [DSPwiki] as shortcut link text we get this link: DSPwiki.

Pandoc also allows multiple forms for links pointing to headings. The regular form is called explicit and can be used for other element types as well. The other two will only work with the auto-generated heading identifiers. Especially the short version which equals the exact string of the heading title can be convenient in a text. But beware that these forms are very specific pandoc sugar. Generic forms and examples:

[link text](#heading-title)
[link text][heading title]
[heading title]

[using pandoc to create web pages](#pandoc-for-web-pages)
[pandoc][pandoc for web pages]
[pandoc for web pages]

using pandoc to create web pages
pandoc
pandoc for web pages

autogenerated TOC

Further exploiting auto-generated heading identifiers it is also possible for pandoc to create a table of contents from them. This will not happen by default but only when pandoc is called with --toc argument, and only for standalone output files. Argument --toc-depth <level> specifies maximum heading level depth to be used for auto-generated table of content. Deeper levels will be omitted from the table of content but they will still get their identifier.

For in-page navigation I want to use my hamburger style drop down menu with a button in the top right position in the browser. That menu is a java-less css construction based on a class nav.menu. The menu is to be filled mainly with links to heading level 1 and 2 sections from the auto-generated table of content. But a few links to other pages must also be parsed in this menu. This was the most complicated puzzle solved for my markdown-to-html use case. But once resolved, it will be reusable for any number of web pages.

To start with, all menu items must be values for variables to be expanded in the html template. My html template has two menu variables, $related$ for pages elsewhere on my website and $toc$ for pandoc’s autogenerated table of contents.

<nav class="menu">
  <button>menu</button>
  <div>
    $related$
    <span>on this page:</span>
    $toc$
  </div>
</nav>

The $toc$ variable will be handled by pandoc, if given the --toc argument. Links for related pages on my site are declared in a yaml block and expanded in the $related$ variable in the html template. Note the pipe character which makes pandoc expand subsequent string(s) as markdown:

---
related: |
  [home]
  [html by hand]
---

The css rules to make menu items drop down from the button (another huge puzzle in itself) were already developed earlier. But now, when pandoc generates the table of content, this becomes a list with bullets and indentation for heading levels. Such styling is useful for a static table of content but not in the drop-down list which is supposed to be snall and vertically aligned. So I had to write a few extra css rules to undo all list styling for the nav.menu class:

nav.menu li{
list-style: none;
}

nav.menu ul{
padding-left: 0;
}

nav.menu p, ul {
margin: 0 0 0 0;
}

attributes

Identifier, class and other attributes can be appended to an element, in between curly braces. Since heading identifiers are already defined by default, and can even be aliased as [alias][heading title], we already have good options for in-file navigation. Custom identifiers would mainly be useful to create anchors in long pages without headings. I don’t foresee making such pages but will just note that the syntax for link and destination is:

[text of link to](#id)   destination{#id}

Class attributes can be very useful for layout purposes if the included css has those classes defined. Pandoc has some class attributes predefined but I have not found an overview and don’t know if there is a risk of name clash. Hopefully the stylesheets included by the html template have precedence. Anyway I do use this predefined class to exclude a heading from the auto-generated table of contents:

## related pages {.unlisted}

Attributes in the form of key=value pairs will also work in some cases, depending on the output format.

divs and spans

Pandoc has a pretty syntax alternative for the <div> element called “fenced div” which can be used if you have at least one attribute to define. These are equivalent:

<div>{.myclass}
This is a division.
</div>

::: {.myclass}
This is a division
:::

::: myclass
This is a division
:::

A span can be written as a string between brackets to give it an attribute, if you would ever need it:

[this is a span with attribute]{style=background-color:Indigo}

this is a span with attribute

images

An image in markdown is defined as a link to an image file preceded by an exclamation mark. Normally an image in html is an inline element, meaning it can sit inside a block together with other elements like text or other images. However pandoc can also put it inside a figure block, depending on how you define the image link.

image within a figure

Pandoc will automatically put an image together with a caption in a figure block on these two conditions:

the image link has a link text between the brackets
the image link is surrounded by newlines above and below

The link text will then appear as figure caption below the image, instead of being used as alt text. Any attribute will apply to the image and not to the figure parent. This is how that would look in markdown and html:

![image text](testimage.png){.dummy}

<figure>
<img src="testimage.png" class="dummy" alt="" /><figcaption>image text</figcaption>
</figure>

The newline condition between figures means they can never appear side by side, unless you would restyle the figure element as display: inline-block (but that would have other side effects). Also there can be only one image in a figure. I can live with that because there are other ways to get images side by side. But I want to have the image centered within the same width as text paragraphs. So my css defines a figure with the same geometry as a paragraph, and descendants img and figcaption are centered therein.

Captions in html are often superfluous or even annoying. Mostly the image and surrounding text speak for themselves because an image can appear at the exact point in the text where you want (in contrast with a pdf where page length constraints freedom of layout). However without image text pandoc will not make a figure and no alt text in html. The image will appear in a regular paragraph and figure-related css will not apply. As an easy compromise I styled figcaption to be less conspicuous in size and color.

If I really wanted to avoid the caption while still centering the image I could use a class attribute which was already defined in the context of hand-written html. When appending a newline pandoc will nest it as an inline image in a paragraph with nothing else.

![image text](testimage.png){.block}\

<p><img src="testimage.png" class="block" alt="image text" /><br />
</p>

image text

images side by side

Sometimes there is a reason to display images side by side. I’ve used tables for that purpose extensively in my older pages. But a table with fixed size images dictates the viewport of an entire web page, as I only started to realize when viewing my pages on mobile devices. If images can not flow, text will not flow either and font size tends to get smaller to the point of being unreadable.

For my page about hand-written html I had defined a container class “div.iblock” for image plus caption with the “display: inline-block” attribute. Multiple such containers can sit next to each other in a parent (a paragraph most often). This method can be used in pandoc markdown with fenced divs and class attributes. I don’t expect to use it often but it can be done using pandoc fenced div:

::: iblock
![test image](testimage.png)\
first image
:::

::: iblock
![test image](testimage.png)\
second image
:::

test image
first image

test image
second image

clickable image

Images can be written as reference link. As such it can be nested in another reference link to make a clickable image link. Actually the html equivalent is shorter to write and much more descriptive.

[![click to see section images][imagefile]][gotosection]

[imagefile]: testimage.png "click me"
[gotosection]: #images

<p><a href="#images"><img src="testimage.png" title="click me" alt="click me" /></a></p>

verbatim text

Text which must not be parsed as markdown but presented verbatim can be written inline (a span) or surrounded by blank lines (a block). It could be used for any kind of text but the html tag for verbatim text is <code>. In markdown a code span is written between backticks. A code block can be written as indented text (at least four spaces indentation). Such a block may contain one or more text lines and blank lines. A code block is called “preformatted text” and the html tag is <pre>, combined with the <code> tag. Thus a code span and code block will appear like this in markdown, html source and rendered html:

`this is a code span`

    this is a code block

<code>this is a code span<code/>
<pre><code>this is a code block<code/></pre>

Alternatively a code block can be written between lines containing at least three backticks or tildes. This is called “fenced code block”. A fenced code block has the option to add an “info string” after the opening code fence. This can be used to specify code language in the hope that syntax coloring will be applied. However I prefer indentation because any code editor is helpful with that and it looks cleaner in markdown.

Code is conventionally written in a monospace font. A slightly smaller font size will better distinguish a code span from regular text. My code blocks are displayed with distinguished background and the customary overflow: auto; attribute to avoid text flow and display a horizontal scroll bar if needed.

editors

Markdown text can be edited in any code editor, but some editors have special markdown facilities like syntax coloring, symbols list, live preview or even WHYSIWIG. The more facilities you want, the less choice you have, especially if pandoc support is a requirement.

editor: Geany

Geany GTK+ IDE can recognize markdown (but not pandoc’s extensions) via a plugin which must be installed separately. A variable width sidepane can show a symbols list (section headers in the case of markdown) and live preview. Side pane configuration:

Edit > Preferences > Interface

Markdown plugin must be enabled and configured via menu:

Tools > Plugin Manager > Markdown

Live preview uses a html template of your choice. Geany does not understand pandoc’s template variables and I simply created a modified version of Geany’s default template to load my stylesheets:

<html>
  <head>
    <link rel="stylesheet" type="text/css" href="../css/layout.css">
    <link rel="stylesheet" type="text/css" href="../css/blackbase.css">
  </head>
  <body>
    @@markdown@@
  </body>
</html>

Copying the above template code to markdown with live preview made Geany freeze. Also when editing live preview will start at the page top everytime (known bug which may be resolved in newer versions). Much more useful is the option to configure user-defined build commands associated with menu buttons and shortcut keys.

Conversion is then a matter of two clicks and the resulting html can be (re)loaded in any web browser window. Separate tabs can have all related files (css, html source, templates) open for inspection or modification. Especially useful during development phase of a markdown-to-html workflow.

editor: Ghostwriter

Ghostwriter is a KDE Qt app for markdown editing with live preview in a dual pane setup. When pandoc is installed all its supported markdown flavors can be selected for preview.

Ghostwriter does not display “standalone” output in live preview, but just the html body part as reflected in the markdown code. If you want to see it with your own styling you can select a custom css stylesheet for preview. My pages use two .css files, for layout and colors separately. Fortunately it is possible to have one stylesheet include others. So my custom stylesheet has just this text to include the real stylesheets for preview in Ghostwriter:

@import "layout-pandoc.css"; 
@import "blackbase.css";

Because Ghostwriter will only preview the html body and not use my pandoc template, an auto-generated navigation menu can not be presented here. We can not check if the menu is complete, how it looks like and if it works as intended. Fortunately Ghostwriter does provide an “outline” popup menu for section navigation being effective for markdown source and html preview together. And it supports image drag&drop to create an image link.

I have not found how to specify pandoc arguments when exporting to standalone html file through Ghostwriter. Yes you can select markdown flavor and output file type, but without pandoc arguments this is not useful at all. Pandoc will use its default template and styling. No one-click-build here! Command line is needed after all, unless you’re happy with default options and bland styling.

editor: Zettlr

Zettlr is a What-You-See-Is-What-You-Mean markdown editor, document organizer and converter with pandoc under the hood. What you see is markdown code as you type it, while categories of non-text content (such as images and LaTex math) can be selected for inline rendering, meaning their markdown definition will be hidden behind a What-You-Get presentation (with generic styling though). Drag&drop of such items will automatically create the markdown definition for them.

Zettlr has sidebars with section navigator amongst other things. With all inline rendering disabled except images, and a theme with monospace font, it almost looks like a regular code editor but with image preview and headings printed big and bold. Rendering links inline isn’t very useful because not all of them will be resolved. To check if navigation in your file will work as intended the output must be built.

Zettlr does not let you specify a pandoc command directly like Geany does, but instead gives the option to type or paste the equivalent of a pandoc defaults file in a window under File > Preferences > Assets Manager > Exporting.

Only one set of arguments can be defined per output type so if you wanted to use different arguments you have to change preferences or else call pandoc on command line. That is a missed opportunity since pandoc itself can have as many defaults files as you want and Zettlr could let you select on of them rather than type the content. But anyway, with the arguments set in Zettlr, export to html is a matter of two mouseclicks. Exporting will automatically open the file in a new browser tab everytime which is a small nuisance.

What bothers me more is the weight of the app as perceived in excessive load time (files and also the app itself), response to user input, memory consumption and disk occupation, even though CPU load is not bad at all. Maybe due to its electron base. But Zettlr is also an ambitious app, meaning to be a “personal knowledge manager” for an academic audience as explained by its author on hackmd.io.

other editors

Kate is a KDE/Qt generic code editor. Like Geany it does support a sidebar for code navigation, but… not for markdown in this case. And like Geany it can present live preview but not handle pandoc extensions or templates for that purpuse.

Panwriter is a specialized pandoc-markdown editor with live preview. I’ve tried the app very briefly just to discover that it does not provide a menu or sidebar for section navigation either.

Rstudio is an IDE for rmarkdown, with the R computer language on top of pandoc. It is designed for academic writing and data plotting with the aim of reproducible science. A fascinating concept. Rstudio app can convert markdown notebooks to publishable documents, but it can do so much more. It’s for scientists, not for me.

Visual Studio IDE advertises markdown support with section navigation and live preview, including custom stylesheets. That sounds all promising, however VSC doesn’t embrace pandoc but its own parser markdown-it.

Many more markdown editors exist and you cannot expect an app to do everything: code navigation, css customization, html source view, inline or live html rendering, one-click-build. I find myself using different apps depending on “workflow phase”.

html source

Although the rendered web page looks as good as my hand-written html, the underlying html does not. According to the online manual, pandoc should wrap html source by default for readibility. This does not work for me. And indeed my pandoc man page says “Automatic wrapping does not currently work in HTML output.” There is no indentation either. It looks like this:

Pandoc’s html source formatting is a dissapointment for me. Hard to read does not mean indecipherable but I had expected something else from the converter. Although html is in the current context not the source format I would like it to be optimally readable.

By way of experiment I converted the html standalone back to markdown. That gives interesting result. Better readable than my original, it’s fair to say. It shows pandoc’s preferred syntax with level 1 and 2 heading style underlined (equal signs and dashes respectively). As a bonus I got my complete menu printed on top, including the auto-generated table of contents.

markdown flavors

Markdown seems to be the Esperanto of document formats because it is easy to learn and not proprietary. However there is no universal consent about syntax so we see different flavors. Some flavors define themselves as a superset of commonmark. Since commonmark does not support document-internal links I opted for a flavor with extensions. Github’s GFM does support {#id} syntax but then I found pandoc with many more useful extensions like embedded yaml blocks.

I want to avoid redundant use of pandoc sugar, but Pandoc can also convert pandoc markdown to other flavors if needed. Pandoc is embraced by the academic world and will be probably be maintained for many years to come.

evaluation

At the moment of writing, this page is the first one I do with markdown for the source document. I tried to optimally separate content from styling. Of course html and css are already designed to do that. But pandoc’s markdown, yaml code and html templates let you create a source document which is much easier to write or read and need not even specify an output format, let alone styling directives like which css files to include.

While figuring out how markdown translates to html I learned to appreciate html tags as indicators for the role of components in a text structure, rather than pointers to css code for styling. In markdown you write those indicators in an intuitive way, just like intonation in spoken word and punctuation in written text. I found that explicit styling attributes in markdown can be avoided as long as you’re happy with a simple layout. Markdown makes writing scalable on a continuum from brain storm till web page.

My markdown workflow still leaves an important thing to be desired. This is not related to pandoc but to the editors. I would so much like to see images displayed in the editor window, just below their link definition. Markdown is a perfectly readable source format and I don’t need or want to see full html rendering constantly. However an image title doesn’t make an image visible. In order to feel the flow of a document while writing it I need to see images in their context. They are not decoration but part of the story. Zettlr implements this concept of inline rendering but will autohide the image link behind its display, like other “smart” features leading to confusion and slow response. Hopefully the inline rendering concept will get more attention and discussion soon.