Note: the Tartan project is not under active devleopment. We'll leave the code here on RubyForge, so feel free to do with it as you will. If you are interested in taking over the project, please contact one of the developers.

Welcome to Tartan

Tartan is a general purpose text parsing engine whose main target is wiki text parsing. (see c2.com and Wikipedia) It doesn’t implement one specific mark-up, but instead, provides a way to specify a variety of mark-ups. So, Tartan is a bit more "involved" than a purpose built parser like RedCloth or BlueCloth, but provides the following benefits:

  1. separates the specific wiki syntax specification from the implementation
  2. allows layering and extension of parsing rules
  3. allows multiple output formats from the same syntax specification

The current implementation of Tartan is in Ruby and includes a full Markdown parser (described in YAML). The format of the parsing specification has been created with an eye to having a language independent definition of wiki (and possibly other) mark-ups. That’s a lofty goal, and Tartan hasn’t quite gotten there yet, but we think there’s a clear path. In any case, even if it is only available in Ruby it will hopefully be helpful for projects needing to do something more than just convert wiki text directly into HTML.

Usage

So, really all you want to do is generate HTML from Markdown text. Here’s how you do it:

  # require 'rubygems' # if you are pulling Tartan in as a gem
  require 'tartan_markdown'

  html = TartanMarkdown.new("* howdy\n* doody").to_html
  # => "<ul>\n<li>howdy</li>\n<li>doody</li>\n</ul>"

Other parsers would have similar names and would have the same usage. In particular, you will need to require the parser class file and then creat a new instance of the parser and call to_html on that instance.

You can also have other output methods, say to_xml, which would be called in the same way on the instance of the parser object.

Layering Parsers

You can add parsing syntax to existing parsers. This is done by building up a set of parsers specifications that work together.

In the Tartan distribution you have a specification for Markdown and you also have a specification for table mark-up. You can combine them by creating a new class that layers the tables onto the Markdown definition as follows in a file called tartan_markdown_tables.rb:

  require 'tartan_markdown_def'
  require 'tartan_table_def'

  class TartanMarkdownTables < Tartan
    include TartanMarkdownDef
    include TartanTableDef
  end

In another file you could use this new parser:

  require 'tartan_markdown_tables'

  html = TartanMarkdownTables.new("[|*happy*||**days**|]").to_html
  # => "<table class=\"\">
        <tr><td><em>happy</em></td><td><strong>days</strong></td></tr>
        </table>"

The Parsing Specification

Each specific parser (Markdown to HTML, Textile to HTML, your wiki to xml, etc.) needs a parsing specification to tell Tartan how to convert the text into HTML (or what ever other format you need).

Overall Structure

Each parser is made up of a parsing definition and optional helper methods. The specification is defined in YAML and the helper methods are defined in a parser definition class.

The parsing definition in YAML has the following general structure:

  block:
    - <parsing rule>
    - <parsing rule>

  <parsing context>:
    - <parsing rules>

So the parsing rules are defined as a set of contexts and each context is an list of parsing rules. The base context defaults to block; that is, the parser starts with the block context which may point the parser off to other contexts to parse blocks of the parsed text. More on this after the explanation of the parsing rules.

Parsing Rules

The following is a simple parsing rule to match paragraphs and mark them up in HTML:

  title: paragraph
  match: /(^[^\n]+$\n)*^[^\n]+$/m
  html:
    start_mark: <p>
    end_mark:   </p>

A paragraph, in this case, is any grouping of non blank lines.

The parser will repetitively apply the match regular expression and if it matches, the html output sub-rule will put the start_mark, <p>, and the end_mark, </p>, around the text that is matched as a paragraph.

If we wanted to also mark off blocks of code that are indented by say 2 or more spaces at the beginning of the line, we could use the following rule:

  title: code
  match: /(^[ ]{2,}\S.+?$\n)+^[ ]{2,}\S.+?$/m
  html:
    start_mark: <pre><code>
    end_mark: </code></pre>

When we want to add the code rule, the ordering becomes important. If we put the paragraph rule first, it will gobble up both the paragraphs and the code blocks since it’s just looking for groups of non blank lines. To prevent this we need to put the code rule first. So the overall definition would be:

  block:
    - title: code
      match: /(^[ ]{2,}\S.+?$\n)+^[ ]{2,}\S.+?$/m
      html:
        start_mark: <pre><code>
        end_mark: </code></pre>
    - title: paragraph
      match: "/(^[^\n]+$\n)+^[^\n]+$/m"
      html:
        start_mark: <p>
        end_mark:   </p>

Now, lets say we want to be able to mark-up text with emphasis (HTML <em>) and strong emphasis (HTML <strong>) in paragraph text, but not code. We‘ll use an asterisk (*) around text we want to have emphasis and a double asterisk around text we want to have strong emphasis (**). Note that we don’t want this to happen in text in a code block.

To do this, we set up a new parsing context for paragraph body text and "point" the parser to the context when it recognizes a paragraph.

First, we create the paragraph parsing context:

  paragraph:
    - title: strong
      match: /\*\*(.*?)\*\*/
      html:
        replace: <strong>\1</strong>

    - rescan

    - title: emphasis
      match: /\*(.*?)\*/
      html:
        replace: <em>\1</em>

The rescan directive between the strong and emphasis rules tells the parser to "start over". This is needed because otherwise the strong rule would "claim" all the text it matched and the emphasis rule wouldn’t have a chance to parse any of it. This would come into play if we had a paragraph such as:

  Now listen to this **I want *you* to really hear me**.

This should get marked up as:

  <p>Now listen to this <strong>I want <em>you<em> to really hear me</strong>.</p>

but we would get the following without the rescan:

  <p>Now listen to this <strong>I want *you* to really hear me</strong>.</p>

You might also note that the ordering here, again, is important. If we leave out the rescan, we would get the following output instead:

  <p>Now listen to this <em></em>I want <em>you</em> to really hear me<em></em>.</p>

Now, we also need to modify the paragraph rule in the block context to use the new paragraph context:

  # . . .
    - title: paragraph
      match: /(^[^\n]+$\n)*^[^\n]+$/m
      subparse: paragraph
      html:
        start_mark: <p>
        end_mark:   </p>
  # . . .

To do this we use the subparse directive to tell the parser that the contents of the paragraph should be parsed by the paragraph context.

Creating a Mix-in

It’s possible to mix-in or layer a parsing specification with a base parser. This allows you to add additional markup or change the markup of an existing syntax. You could use this to add table mark-up to Markdown (in fact, this mix-in to Markdown is available as part of the Tartan code distribution).

To show how this works, we’ll look at how to specify and then add character element markup to the parser example we’ve been working with. We want to turn things like "<", "&" and "->" into "&lt;", "&amp;" and "&rarr;".

We want these transformation to be done in the context of parsing paragraphs, so we’ll only want to add to the paragraph context in our previous example.

So, to add this syntax parsing, you would create the following specification:

  paragraph:
    - rescan
    - title: amp
      match: /&/
      html:
        replace: '&amp;'
      rescan: true
    - title: rightArrow
      match: /->/
      html:
        replace: '&rarr;'
      rescan: true
    - title: lessThan
      match: /</
      html:
        replace: '&lt;'
      rescan: true
    - title: greaterThan
      match: />/
      html:
        replace: '&gt;'

That’s it for the mix-in specification. Now we add these to the previous set. We didn’t touch on file naming of specifications before, but now we need to. Let’s say that we put the previous specification in a file called example-parser.yml and we put the new spec in entities.yml. To combine them, we would create a new Ruby class like this:

  class ExampleParserWithEntities < Tartan
    yaml "example-parser.yml"
    yaml "entities.yml"
  end

By default, the rules of a mix-in are added to the end of any given context. So, the effective resulting specification once the two sets of rules are combined would be:

  block:
    - title: code
      match: /(^[ ]{2,}\S.+?$\n)+^[ ]{2,}\S.+?$/m
      html:
        start_mark: <pre><code>
        end_mark: </code></pre>
    - title: paragraph
      match: /(^[^\n]+$\n)*^[^\n]+$/m
      subparse: paragraph
      html:
        start_mark: <p>
        end_mark:   </p>
  paragraph:
    - title: emphasis
      match: /\*(.*?)\*/
      html:
        replace: <em>\1</em>
    - rescan
    - title: amp
      match: /&/
      html:
        replace: '&amp;'
      rescan: true
    - title: rightArrow
      match: /->/
      html:
        replace: '&rarr;'
      rescan: true
    - title: lessThan
      match: /</
      html:
        replace: '&lt;'
      rescan: true
    - title: greaterThan
      match: />/
      html:
        replace: '&gt;'

Going Further

Honestly, this brief tutorial just provides you with the basics of Tartan. If you want to know more, for now, the best thing is to look at the Markdown and table extension specification in the code. That will show you a real-world example of how to create a base parser and a mix-in.

There will be additional documentation to follow. In particular a reference guide that covers all the parser rule directives one at a time.

If you need some help in getting Tartan to work for your project, please don’t hesitate to post to the Tartan help-form or write me directly at bitherder@rubyforge.org.

The Name

Tartan is intended to weave together different parsing elements. It’s intended to be an alternative of both RedCloth and BlueCloth. Tartan is a kind of cloth that weaves different colors together in an interesting pattern.