Split

Split.hs

Copyright © 2007 Dave Bayer. Subject to a BSD-style license.

This module is part of the Annote project.

module Split (Split(..),Delims,split,unsplit,joinDoc,joinCode,joinDebug) where

Split divides an input file into code, documentation, and external documentation.


Regex provides regular expression matching. It is a wrapper around Text.Regex.

import Regex (mkRegex,isMatch)

Split

Split is the type of a line of text, marked either as Code, Delim, Doc, Blank, or Ext.

A Code, Delim, Doc, Blank, or Ext line is expected to be a single line of text, not terminated with a newline. A Shell line can evaluate to many lines of text, and is expected to be terminated with a newline.

data Split
    = Code  String
    | Delim String
    | Doc   String
    | Blank String
    | Ext   String
    | Shell (IO String)

Delims

Delims is the type of a tuple describing the start and end delimiters for Doc and Ext blocks. These are the strings corresponding to the option keys

 (DocStart, DocEnd, ExtStart, ExtEnd)
type Delims = (String,String,String,String)

split

split is a finite state machine which transforms input text into an array of Split lines.

split :: Bool → Delims → String → [Split]
split isCode (start,end,xstart,xend) text = startFilt $ lines text
    where

Note that the indented functions below are contained within the where clause of split, so its arguments are in scope.

startFilt is the initial line-oriented filter to apply to text. If isCode is true, then we are processing annoted source code, and text starts off as code. Otherwise, we are processing markup for a supporting web page, and text starts off as documentation. One can still include code fragments, by surrounding them with inside-out delimiters.

    startFilt = if isCode then inCode else inDoc

startDoc, endDoc, startExt, endExt are predicates matching the corresponding delimiter lines.

    startDoc, endDoc, startExt, endExt :: String → Bool

    startDoc x = isMatch x $ mkRegex start
    endDoc   x = isMatch x $ mkRegex end
    endExt   x = isMatch x $ mkRegex xend
    startExt x = if null xstart
        then False
        else isMatch x $ mkRegex xstart

isBlank is a predicate matching blank lines.

    isBlank :: String → Bool
    isBlank  x = all (`elem` " \t") x

inCode, inDoc, inExt are line-oriented filters that call one another; they can be thought of as the states of the finite state machine.

    inCode, inDoc, inExt :: [String] → [Split]

    inCode [] = []
    inCode (x:xt)
        | startDoc x = Delim x : inDoc  xt
        | startExt x = Ext   x : inExt  xt
        | isBlank  x = Blank x : inCode xt
        | otherwise  = Code  x : inCode xt

    inDoc [] = []
    inDoc (x:xt)
        | endDoc x  = Delim x : inCode xt
        | isBlank x = Blank x : inDoc  xt
        | otherwise = Doc   x : inDoc  xt

    inExt [] = []
    inExt (x:xt)
        | endExt x  = Ext x : inCode xt
        | otherwise = Ext x : inExt  xt

unsplit

unsplit is the inverse to split; it transform an array of Split lines into output text of type IO String. The IO monad is necessary because of Shell lines that involve external computations.

unsplit :: [Split] → IO String
unsplit xs =

f turns a Split into an IO ShowS; the IO monad is needed because of Shell lines that involve external computations.

Recall the type

type ShowS = String -> String

found in Prelude to facilitate constant-time concatenation using function composition. We use ShowS because a naive implementation of unsplit causes Annote to spend the majority of its time concatenating.

    let f :: Split → IO ShowS
        f x = let ln s = s ++ "\n"
                  io s  = return (ln s ++)
              in case x of
                  Code s  → io s
                  Delim s → io s
                  Doc s   → io s
                  Blank s → io s
                  Ext s   → io s
                  Shell y → do { s ← y; io s }

g is the IO ShowS analog to string concatenation:

        g :: IO ShowS → IO ShowS → IO ShowS
        g x y = do { s ← x; t ← y; return (s . t) }

We use f to convert each Split into an IO ShowS, then use g to concatenate these into a single IO ShowS. We evaluate s on the empty string to return an IO String.

    in do s ← foldr1 g $ map f xs
          return (s [])

joinDoc

joinDoc recombines a split list as documentation output, combining blank lines, leaving out Delim and Ext lines, and delimiting code using startCode and endCode. The IO monad is necessary because of Shell lines that involve external computations.

joinDoc :: [Split] → IO String
joinDoc text = (unsplit . preCode) text
    where

Note that the indented functions below are contained within the where clause of joinDoc, so its arguments are in scope.

startCode, endCode, and blank are strings used to delimit code, or replace blank lines.

The Split constructors will be stripped by unsplit, so the ambiguity as to whether blank should be constructed using Doc or Code turns out not to matter.

    startCode, endCode, blank :: Split
    startCode = Doc "\n<pre class=\"code\">"
    endCode   = Doc "</pre>\n"
    blank     = Doc ""

preCode, preDoc, inCode, inDoc, skipCode, skipDoc can again be thought of as the states of a finite state machine. We skip when reading blanks, writing blank lines only as needed.

Note the invariant that Doc and Shell constructors get identical treatment in each function.

    preCode, preDoc, inCode, inDoc, skipCode, skipDoc :: [Split] → [Split]

preCode, preDoc: We are potentially reading code or documentation, but we have not yet read a non-blank line.

    preCode [] = [blank]
    preCode (x:xt) = case x of
        Code  _ → startCode : x : inCode xt
        Delim _ → preDoc xt
        Doc   _ → x : inDoc xt
        Shell _ → x : inDoc xt
        _       → preCode xt

    preDoc [] = [blank]
    preDoc (x:xt) = case x of
        Code  _ → startCode : x : inCode xt
        Delim _ → preCode xt
        Doc   _ → x : inDoc xt
        Shell _ → x : inDoc xt
        _       → preDoc xt

inCode, inDoc: We are reading code or documentation. The most recently read lines were non-blank.

    inCode [] = [endCode]
    inCode (x:xt) = case x of
        Code  _ → x : inCode xt
        Delim _ → endCode : preDoc xt
        Doc   _ → endCode : x : inDoc xt
        Shell _ → endCode : x : inDoc xt
        Blank _ → skipCode xt
        Ext   _ → inCode xt

    inDoc [] = [blank]
    inDoc (x:xt) = case x of
        Code  _ → startCode : x : inCode xt
        Delim _ → preCode xt
        Doc   _ → x : inDoc xt
        Shell _ → x : inDoc xt
        Blank _ → skipDoc xt
        Ext   _ → inDoc xt

skipCode, skipDoc: We are reading code or documentation. We have read a non-blank line; the most recently read lines were blank.

    skipCode [] = [endCode]
    skipCode (x:xt) = case x of
        Code  _ → blank : x : inCode xt
        Delim _ → endCode : preDoc xt
        Doc   _ → endCode : x : inDoc xt
        Shell _ → endCode : x : inDoc xt
        _       → skipCode xt

    skipDoc [] = [blank]
    skipDoc (x:xt) = case x of
        Code  _ → startCode : x : inCode xt
        Delim _ → preCode xt
        Doc   _ → blank : x : inDoc xt
        Shell _ → blank : x : inDoc xt
        _       → skipDoc xt

joinCode

joinCode recombines a split list as code output, leaving out documentation. We avoid unsplit in order to directly return a String.

joinCode :: [Split] → String
joinCode text = (unlines . inCode) text
    where

    inCode, inDoc :: [Split] → [String]

    inCode [] = []
    inCode (x:xt) = case x of
        Code  s → s : inCode xt
        Blank s → s : inCode xt
        _       → inDoc xt

    inDoc [] = []
    inDoc (x:xt) = case x of
        Code  s → s : inCode xt
        Delim _ → inCode xt
        _       → inDoc xt

joinDebug

joinDebug recombines a split list, tagged with Split constructor names for debugging purposes. We avoid unsplit in order to directly return a String.

joinDebug :: [Split] → String
joinDebug text = (unlines . tag) text
    where

    tag :: [Split] → [String]
    tag [] = []
    tag (x:xt) = t : tag xt where
        t = case x of
            Code  s → "C " ++ s
            Delim s → "= " ++ s
            Doc   s → "D " ++ s
            Blank s → "B " ++ s
            Ext   s → "E " ++ s
            Shell _ → "S"