lili.lili

<!--- @:~ --->
# lili: the little literate programming tool

`lili` is a simple command line program written in portable dependency-free C
that aims to provide the absolute bare minimum of functionality needed to
enable writing programs in a literate programming style.  All `lili` does is
scan a plain-text file for a handful of control sequences, extract source code
delimited by those control sequences, and output source code files by
recursively expanding the chunks of code thus found. 

This minimalist approach ensures users of `lili` can depend on it to continue
working reliably and easily modify the source code to tailor the program to
their specific needs, such as adding support for a particular typesetting
markup or programming language.

# what is literate programming?

>Literate programming is a very simple idea: Write a book. And oh, by the
>way, the book has the actual executable source code in it.

>Timothy Daly, 2013. "[Literate Programming in the Large](https://youtu.be/Av0PQDVTP4A)"

Literate programming (first described by [Donald Knuth in his paper introducing
the idea](http://www.literateprogramming.com/knuthweb.pdf)) is a style of
computer programming that prioritizes the relationship between the program
writer and program reader.  The essence of this approach is the position that
computer programs should be written primarily as a form of communication
between humans and only incidentally as a means of instructing computing
machines to perform useful jobs.

Few would disagree that excellent computer code, in addition to whatever other
positive qualities it may possess, should be easy for any programmer to read
and understand.  Conventionally, this ideal is approached by carefully choosing
variable and class names, breaking up complex subroutines into well-named
functions, writing unit tests that demonstrate how a given subroutine is used,
and so on;  basically, the code itself is written to be self-describing.  After
all, the source code has the final word on what the program does and how it
does it.  Comments, documentation, and any other things besides are all
irreconcilably seperate from the actual logic of the program that the code
describes, and they even run the risk of misleading the reader of a program. As
Robert Martin wrote, "[the proper use of comments is to compensate for our
failure to express ourself in
code.](https://www.goodreads.com/book/show/3735293-clean-code)"

On the other hand, there is much that the code itself may struggle to
communicate or which may even be impossible for the code to communicate.
Programming languages are constrained by the systems they are meant to control,
by the systems that translate from source code to machine code, and so on.
These constraints in turn influence the writers of the source code, the ideas
that they can express through the language, and the ways in which they can
present those ideas.  Source code excels at expressing what the program does
and how it does it. Code often struggles when it comes to communicating why it
does what it does and why it does it in that particular way rather than taking
some other approach.  Almost every programming language provides some mechanism
for embedding natural language comments in the source code, precisely to help
programmers to express that which the code cannot.  Documentation is also
increasingly integrated in programming language tooling.  But both comments and
documentation remain subordinate to the source code. They serve to help the
reader decode the source code.  The source code itself remains supreme, and
often also remains opaque, esoteric, and difficult to understand.

Literate programming proposes to invert this relationship: rather than write a
program with some explanations embedded in it, you craft a piece of writing
with a program embedded in it.  Rather than write comments and documentation
interspersed in source code, write source code interspersed in a natural
language narrative.  Rather than write a program primarily to control the
behavior of a computer and only incidentally to be read by humans, write a
piece of prose primarily to communicate with humans and only incidentally
to be executed by a computer.  This is the theory of literate programming.

In order to write a literate program, the expressive power of most programming
languages is insufficient.  For this reason, the practice of literate
programming usually involves the use of specialized tooling that allows for
expressing source code embedded in a piece of literature.  These tools allow
the source code to be ordered and explained using the structure that most
naturally expresses the meaning of the program, rather than the order and
structure that is most convenient to the compiler or interpreter.  This
involves, for instance, allowing code chunks to be expressed in any order.

# should I write literate programs?

Proponents of literate programming argue that the style provides numerous
benefits.  The prioritization of communication over execution allows
programmers unfamiliar with a program to understand and take ownership of it,
allowing new team members in a collaborative environment to become productive
contributors to the project more quickly.  Literate programs are
also purportedly more resilient to the passage of time and changing of hands,
as new maintainers of a program are endowed with the knowledge of not only what
and how a program works but also why and what for.  Furthermore, writing the
natural language exposition of a literate program naturally forces the writer
to not only better comprehend the program they are writing but also to write a
program that is fundamentally more comprehensible.  As Donald Knuth writes, "my
programs are not only explained better than ever before; they also are better
programs, because the new methodology encourages me to do a better job."
Literate programming also seems especially well suited to environments in which
understanding the program is as important as the program itself, such as in
academic research where a program is part of research findings.

However, despite the purported benefits, literate programming has not seen much
mainstream adoption.  Although there have been numerous remarkable success
stories (notably [one book written in a literate programming style that won an
Academy Award](https://www.pbrt.org/)), most programmers do not program in a
literate style.  This may be explained in part by the disadvantages of literate
programming.

As Knuth recognized in the paper introducing the style, literate programming
practice usually entails writing literate programs in a document that combines
typesetting markup, literate programming specific markup, and source code.
This introduces the opportunity for syntax errors in three different languages
in addition to normal program logic errors, with the associated increase in the
complexity of debugging a literate program. More generally, literate
programming practice introduces additional tooling that necessarily adds some
complexity to writing programs.

Programming in a literate style requires authoring and maintaining not only the
program itself but also the literary work in which the program is embedded.  As
the program changes, it is essential to ensure that the associated literature
is updated with discipline lest it drift out of sync with the actual program
logic it is meant to describe. The cost of maintaining the prose in addition to
the program can be magnified in environments where the requirements for the
program and/or its design may frequently and drastically change.

However, the most damning challenge to the adoption of literate programming
comes not from the tooling nor the maintenance of prose, but simply from the
fact that programming is hard, writing is hard, and very few people are good at
both.  This means that programmers with aptitude for literate programming are
naturally somewhat rare.

In summary:

Pros:

- Improved readability decreases time for new programmers to integrate into collaborative environments
- Program maintenance is easier especially as old maintainers become separated from the project
- Authoring prose alongside code encourages writing higher quality programs
- Emphasis on communication is a natural fit for e.g. programs produced as part of research

Cons:

- Additional tooling imposes increased debugging and programming environment complexity
- Prose introduces additional maintenance overhead, especially when requirements change quickly
- Writing good programs is hard enough without also having to write good literature

If program requirements are reasonably well defined, if it is common for
newcomers to contribute to a project, if code maintainers are likely to change
over time, if the people involved in a project are reasonably good writers as
well as programmers, and especially if the full and varied meanings of the
source code are as important as the code itself, it may be worthwhile to
consider adopting a literate programming style.

Writing in a literate programming style may seem less useful for beginners,
hobbyists, and other amateurs, who may not have the well define program
requirements and deep understanding of the programs they are writing, and who
generally work alone and so miss out on the collaborative advantages of
literate programming.  However, the same people are very likely to benefit from
more widespread adoption of literate programming, because reading and
understanding a literate program is much easier than doing the same with
programs written in a more conventional style. This is the advantage of
literate programming most exciting to me personally. Literate programs have
great potential as educational tools. In the best case, literate programs can
elevate and empower the reader in a way conventional programs simply can't.

# how does `lili` work?

The remainder of this document (the literate source code for `lili`) describes
the functionality and implementation of `lili`.

`lili` knows nothing about the machine source code language, has no runtime
configuration or flags, and makes as few assumptions as possible about the
typesetting markup used for prose passages.  Furthermore, `lili` provides no
functionality for weaving, i.e.  producing typesetting markup from the literate
source code. This functionality is not essential for literate programming,
which is about organization and communication rather than typesetting and
generating cross references (see [this
article](http://akkartik.name/post/literate-programming) for example).  If the
writer wishes to produce a typeset document from the source code, typesetting
markup may be written directly in the lili-formatted document and the
typesetting system can be run directly on this document. For example, this
literate source document for `lili` is typeset in Github-flavored Markdown.
Alternatively, the user may edit the program to introduce support for weaving.

Despite its intentional limitations, `lili` provides enough functionality to
enable writing in a literate programming style.  Code chunks can be presented
to the reader in any order, and code chunks can be referred to by name inside
of other code chunks, allowing for high levels of abstraction in the
presentation of the program.  When outputting source code, chunks are
recursively expanded into the output source files while retaining indentation
for compatibility with white space sensitive languages such as python.

The help text explains the control sequences `lili` searches for to extract
the machine code embedded in your natural-language document:

```c
// ~='help text'
const char * help =
"lili: the little literate programming tool -- version %s\n\
\n\
    USAGE: %s file\n\
\n\
    lili extracts machine source code from literate source code.\n\
\n\
All control sequences begin with a special character called ATSIGN, which is \n\
set to '@' by default. Except for escaped ATSIGNs, all control sequences consume\n\
(i.e. cause to be ignored) the remainder of the line following the end of the\n\
control sequence.\n\
\n\
ATSIGN:new control character  Redefines ATSIGN to whatever immediately follows\n\
                              the : sign. This is useful if your machine source\n\
                              language uses lots of @ signs, for example. This\n\
                              control sequence is ignored if it occurs inside\n\
                              a code chunk definition.\n\
\n\
ATSIGN='chunk name'           Begin a regular chunk definition. The chunk name\n\
                              must be surrounded by matching single (') or\n\
                              double (\") quotes (no backticks). The chunk\n\
                              definition itself begins on the next line.\n\
                              Anything preceeding the ATSIGN and following the\n\
                              final quote sign is ignored, allowing this control\n\
                              sequence to be wrapped in typesetting markup.\n\
\n\
ATSIGN#'chunk name'           Begin a tangle chunk definition. This is similar\n\
                              to a regular chunk, except the name of the chunk\n\
                              is also interpreted as a file name, and the chunk\n\
                              is recursively expanded into the file with that\n\
                              name, overwriting any existing file.\n\
\n\
ATSIGN+'chunk name'           Append to a chunk. The code starting on the\n\
                              next line will be added at the end of the chunk\n\
                              named by 'chunk name'. This is useful e.g. for\n\
                              adding #include directives to the top of a c file.\n\
                              If no such chunk already exists, then this control\n\
                              sequence is equivalent to a normal chunk\n\
                              definition (ATSIGN='name').\n\
\n\
ATSIGN{chunk invocation}      Invoke a chunk to be recursively expanded into any\n\
                              tangled output files. Anything preceeding the\n\
                              ATSIGN is considered as indentation, and every\n\
                              line of code in the recursively expanded chunk\n\
                              will have this indent prepended to it. Anything\n\
                              following the closing brace is ignored, however it\n\
                              is recommended not to put anything there for\n\
                              compatibility with possible future extensions that\n\
                              may make use of this space (i.e. text following\n\
                              the closing brace is reserved).\n\
\n\
ATSIGNATSIGN                  Escape sequence. The rest of the line following\n\
                              the initial ATSIGN (including the second ATSIGN)\n\
                              is treated as normal code to copy to output\n\
                              tangled documents. This allows ATSIGN and special\n\
                              control sequences to be passed through to the\n\
                              output documents.\n\
\n\
ATSIGN/                       End an ongoing chunk declaration.\n\
\n\
ATSIGN[anything else]         An ATSIGN followed by any other character is\n\
                              ignored. However, it is recommended to avoid such\n\
                              character sequences for compatibility with future\n\
                              extensions.\n\
\n";
// end help text ~/
```

## program overview

`lili` is implemented with three main
processing stages: setup, extraction, and output. The setup phase reads the
file to be processed into memory. During extraction, the file is scanned for
code chunk definitions, and these are logged in data structures convenient for
output. The output phase then recursively expands code chunks into files.

```c
// ~#'lili.c'
~{definitions}

void lili(char * file, dict * d, list ** tangles)
{
    char * source;

    ~{load file into `char * source`}

    ~{extract code chunks}
}

int main(int argc, char ** argv)
{
    dict * d;
    list * tangles = NULL;
    char * file;

    ~{setup}

    lili(file, d, &tangles);

    ~{output tangle chunks recursively}

    return 0;
}
// end lili.c ~/
```

## code chunk extraction

### code chunk representation

A chunk of code is represented by the following structure:

```c
// ~='code chunk struct'
typedef struct CodeChunk
{
    char * name;
    list * contents;
    int invocations;
    int tangle;
} code_chunk;
// ~/
```

The `invocations` member is used when tangling output to enforce that each chunk
may only be used once (since `lili` is not meant to act as a text preprocessor
and should not support "function call by copy-and-paste" usage idioms), and to
allow a warning to be raised if a chunk is defined but never used.

The list of contents is populated by entries in the form of the following
structure:

```c
// ~+'code chunk struct'
typedef struct ChunkContents
{
    char * string;
    code_chunk * reference;
    int partial_line;
} chunk_contents;
// ~/
```

There are two different types of chunk contents, which can be differentiated on
the basis of how the fields of the `chunk_contents` struct are populated. The
most common kind of contents is a single line of code. In this case the
`string` field represents the line of code, and the `reference` field is left
empty (i.e. points to `NULL`). The other kind of contents is a chunk
invocation.  In this case the `string` field represents the indentation
preceeding the invocation, while the `reference` field points to the chunk
referred to by the invocation. An enum type and a function are provided which
formalise the distinction, and subroutines are provided for constructing
contents of one or the other type, as well as for constructing a code chunk.

```c
// ~+'code chunk struct'
typedef enum ContentType {code, reference} content_t;

content_t contents_type(chunk_contents * c)
{
    if (c->reference != NULL) return reference;
    else return code;
}

chunk_contents * code_contents_new(char * code)
{
    chunk_contents * c = malloc(sizeof(chunk_contents));
    c->string = code;
    c->reference = NULL;
    c->partial_line = 0;
    return c;
}

chunk_contents * reference_contents_new(char * indent, code_chunk * ref)
{
    chunk_contents * c = malloc(sizeof(chunk_contents));
    c->string = indent;
    c->reference = ref;
    c->partial_line = 0;
    return c;
}

code_chunk * code_chunk_new(char * name)
{
    code_chunk * chunk = malloc(sizeof(code_chunk));
    chunk->name     = name;
    chunk->contents = NULL;
    chunk->invocations = 0;
    chunk->tangle = 0;
    return chunk;
}
// ~/
```

### scanning for code

The outer-most loop of the extraction phase ignores your literate prose while
scanning for control sequences and newline characters. Anytime a newline
character is encountered the variable `line_number` is incremented.
`line_number` is currently only used when printing error messages to help the
user find the location of the error.  When an `ATSIGN` is encountered, control
flow switches depending on the character following the ATSIGN. Redefining
ATSIGN, recursing with a referenced file, and printing a warning when
encountering an unknown control sequence, these are simple cases and copied
below. Extracting chunk definitions is more involved, so the description
follows.

```c
// ~+'globals'
char ATSIGN = '@';
int line_number = 1;
// ~/

// ~='extract code chunks'
{
    char * s = source;
    while (*s != '\0')
    {
        if (*s == '\n') ++s, ++line_number;
        else if (*s++ == ATSIGN)
        {
            switch (*s)
            {
            case '#':
            case '=':
            case '+':
                ~{extract chunk definition}
                break;
            case ':':
                ++s;
                exit_fail_if ( (  *s == '=' || *s == '#' || *s == '+'
                               || *s == '{' || *s == ':' || *s == '/'
                               || *s == '\n'
                               )
                             , "Error: Cannot redefine ATSIGN to a character "
                               "used in control sequences on line %d\n"
                             , line_number
                             );
                ATSIGN = *s++;
                break;
            default:
                exit_fail_if(1
                            , "Error: Unrecognized control sequence ATSIGN%c "
                              "while scanning prose on line %d\n"
                            , *s, line_number
                            );
            }
        }
    }
}
// ~/
```

### parsing and extracting a chunk

When a well formed chunk definition is encountered, a new chunk is created or
an existing one is retrieved for appending to, and an inner loop
parses the chunk definition. 

```c
// ~='extract chunk definition'
if (*(s + 1) != '\'' && *(s + 1) != '\"') 
{
    exit_fail_if(1
                , "Error: Chunk definition sequence on line %d is missing a "
                  "quote-delimited name. Ignoring\n"
                , line_number
                );
    break;
}
else
{
    code_chunk * chunk;
    ~{prepare chunk}
    ~{parse chunk}
}
// ~/
```

When a chunk definition is found, `s` points to the character following the
ATSIGN, which determines what type of definition to extract (1).  The next
character is either a single or double quote, and `s` is advanced to this
character (2.a) before being passed to `extract_name` to fulfill the invariant
of that function (that `s` should point to the delimeter surrounding the name
when it is passed into the function) (2.b). 

Once the name is extracted, any existing chunk by that name is retrieved from
the chunk dictionary (3). If no chunk by the given name already exists, then a
new one is allocated (4). This is the expected behaviour of regular and tangle
definitions, and append definitions if the named chunk hasn't been defined
already.

If the named chunk already exists and the definition is a regular or tangle
chunk (5), then the contents list of the chunk has to be examined. If it is not
empty, this indicates that the chunk is being redefined, which is considered an
error (6). If the contents list is empty, then the chunk must have been
allocated in response to its invocation in a chunk defined before the present
definition. In this case, or in case the existing chunk was explicitly
retrieved for appending to, nothing else needs to be done. If requested, the
selected chunk is marked (7.a) and added to the list of chunks to tangle into
output machine code (7.b), and then `s` is advanced to point to the start of
the next line (8).  Chunk parsing and extraction can now proceed to populate
the contents list of `chunk`.

```c
// ~='prepare chunk'
{
    /* (1) */
    int tangle = *s == '#';
    int append = *s == '+';

    ++s; /* (2.a) */
    char * name = extract_name(&s); /* (2.b) */
    chunk = dict_get(d, name); /* (3) */

    if (chunk == NULL) /* (4) new chunk definition */
    {
        chunk = code_chunk_new(name);
        dict_add(d, chunk);
    }
    else if (!append) /* (5) */
    {
        exit_fail_if(chunk->contents != NULL /* (6) */
                    , "Error: Redefinition of chunk '%s' on line %d.\n"
                      "       Maybe you meant to use a + chunk or accidentally "
                      "used the same name twice?\n"
                    , name, line_number
                    );
        /* todo: free existing chunk? */
    }
    if (tangle)
    {
        chunk->tangle = 1; /* (7.a) */
        list_push(tangles, (void *)chunk); /* (7.b) */
    }
}

exit_fail_if(!advance_to_next_line(&s) /* (8) */
            , "Error: File ended before beginning of definition of chunk '%s' "
              "on line '%d'\n"
            , chunk->name, line_number
            );
// ~/
```

Once the `chunk` is prepared and pointing to the chunk to edit, parsing takes
place. `s` points to the beginning of the first line of the chunk on entry to
the parse loop (1). The loop scans until it encounters a newline (2.a) or a
control sequence (2.b). In these cases, a full line can be extracted, and the
pointer to the `start_of_line` can be updated (3.a, 3.b). An exception is when
an unrecognized control sequence is encountered (3.c), in which case the loop
continues processing the line from the character after the false-alarm ATSIGN
(3.d). The loop may also scan to the end of the file if there is an error, such
as if the author forgets to put a termination sequence at the end of the last
chunk definition in the file (4).

```c
// ~='parse chunk'
{
    char * start_of_line = s; /* (1) */
    for (;;)
    {
        if (*s == '\n') /* (2.a) */
        {
            ~{extract code line}
            start_of_line = s; /* (3.a) */
        }
        else if (*s == ATSIGN) /* (2.b) */
        {
            ++s;
            if (*s == '/')
            {
                ~{end chunk extraction}
            }
            else if (*s == '{')
            {
                ~{extract reference line}
            }
            else if (*s == ATSIGN)
            {
                ~{extract line with escape sequence}
            }
            else /* (3.c) */
            {
                exit_fail_if(1, "Error: Unrecognized control sequence ATSIGN%c "
                                "while parsing chunk on line %d\n"
                            , *s, line_number
                            );
                continue; /* (3.d) */
            }
            start_of_line = s; /* (3.b) */
        }
        else /* (4) */ 
        {
            exit_fail_if(*s == '\0'
                        , "Error: File ended during definition of chunk %s"
                        , chunk->name
                        );
            ++s;
        }
    }
}
// ~/
```

A line of code is extracted when the parse scans a whole line without
encountering any control sequences. The pointer to the start of the line is
copied to a new contents list entry (1), and the entry is added to the list
(2).  The line number is incremented manually (3).

The newline character at the end of the line is then changed in place to a
null character (4), effectively terminating the string pointed to by the
contents list entry; this strategy is used throughout the program, for
extracting code, indents, and chunk names, and saves from having to allocate
more memory to copy strings that are already represented in the memory region
pointed to by `s`. 

```c
// ~='extract code line'
chunk_contents * full_line = code_contents_new(start_of_line); /* (1) */
list_push_back(&chunk->contents, (void *)full_line); /* (2) */
++line_number; /* (3) */
*s++ = '\0'; /* (4) */
// ~/
```

When the control sequence for the end of the chunk definition is encountered,
`s` is simply advanced to the next line and the extraction loop is escaped with
a `break` statement. No further action is necessary.

```c
// ~='end chunk extraction'
advance_to_next_line(&s);
break;
// ~/
```

When a chunk invocation sequence is encountered, three things need to happen:

    (1) the indent string needs to be extracted, 
    (2) a pointer to the chunk referred to by the invocation needs to be retrieved, 
    (3) and both of these need to be added to the contents list.

The `start_of_line` pointer already marks the beginning of the indent (1.a).
The ATSIGN marks the end of the indent, so it is converted to a null character
(1.b, 1.c), terminating the indent string in place.

To get a pointer to the chunk referred to by the invocation, the name between
the braces of the invocation (2.a) is looked up in the chunk dictionary (2.b).
If the chunk doesn't already exist, a new chunk is constructed using the name
from the invocation (2.c). This chunk can be defined later, at which point it
will be recognized as having been invoked before its definition by the fact
that its contents list is empty.

With a pointer to the chunk and a pointer to the null terminated indent ready,
the new contents entry can be constructed and appended to the contents list (3).
The rest of the line is ignored (4).

```c
// ~='extract reference line'
code_chunk * ref;
char * indent = start_of_line; /* (1.a) */

{
    char * end_of_indent = s - 1; /* (1.b) */
    *end_of_indent = '\0'; /* (1.c) */
}

{
    char * name = extract_name(&s); /* (2.a) */
    ref = dict_get(d, name); /* (2.b) */
    if (ref == NULL) /* chunk hasn't been defined yet */
    {
        ref = code_chunk_new(name); /* (2.c) */
        dict_add(d, ref);
    }
    exit_fail_if(!advance_to_next_line(&s) /* (4) */
                , "Error: File ended during definition of chunk '%s'\n"
                  "       following invocation of chunk '%s' on line '%d'\n"
                , chunk->name, name, line_number
                );
}

list_push_back(&chunk->contents, (void *)reference_contents_new(indent, ref)); /* (3) */
// ~/
```

A line with an escape sequence is very similar to a regular line of code,
except with an extra ATSIGN in the middle that shouldn't be tangled to outputs.
This is represented by adding two entries to the contents list, the beginning part spans
from the beginning of the line up to (but not including) the first ATSIGN. The
ending part spans from the second ATSIGN to the end of the line, including the
second ATSIGN in its range. These are terminated in place at the end of the
subroutine after `s` has been advanced to the end of the line (1).

The beginning part is marked as a partial line. This flag is picked up by the
tangling subroutine to suppress the newline that is normally printed after code
entries (2).

```c
// ~='extract line with escape sequence'
chunk_contents * beginning_part = code_contents_new(start_of_line);
chunk_contents * ending_part = code_contents_new(s);
char * at_the_atsign = s - 1;

exit_fail_if(!advance_to_next_line(&s)
        , "Error: File ended during definition of chunk '%s'\n"
          "       following the escape sequence on line '%d'\n"
        , chunk->name, line_number);

/* (1) */
*at_the_atsign = '\0'; /* terminate beginning part */
*(s - 1) = '\0'; /* terminate end part; s - 1 points to a newline character */

beginning_part->partial_line = 1; /* (2) */

list_push_back(&chunk->contents, (void *)beginning_part);
list_push_back(&chunk->contents, (void *)ending_part);
// ~/
```

## code tangling and output

Each code chunk recorded in the list of chunks to tangle is output to a file
named the same as the code chunk. The top level loop iterates over all the
tangle chunks (1), opens their file (2), and calls into the recursive print
function (3):

```c
// ~='output tangle chunks recursively'
for(; tangles != NULL; tangles = tangles->successor) /* (1) */
{
    FILE * f;
    code_chunk * c = tangles->data;

    ~{prevent invocation of tangle chunks}

    f = fopen(c->name, "w"); /* (2) */
    if (f == NULL)
    {
        exit_fail_if(1, "Error: Failed to open file '%s', skipping tangle\n"
                    , c->name
                    );
        continue;
    }
    code_chunk_print(f, d, c, NULL, 1); /* (3) */
    fclose(f);
}
// ~/
```

The representation of code chunks is designed mainly for the convenience of
this function. Every contents entry is either a line of code or a reference to
another chunk. If a chunk contains code (1), it is printed. Otherwise, the
function recurses.

```c
// ~='code chunk print'
void code_chunk_print(FILE * f, dict * d, code_chunk * c, list * indents, int tangle)
{
    list * l;

    ~{prevent multiple invocations}

    for (l = c->contents; l != NULL; l = l->successor)
    {
        chunk_contents * contents = l->data;
        if (contents_type(contents) == code)
        {
            ~{print code}
        }
        else if (contents_type(contents) == reference)
        {
            ~{recurse}
        }
    }
}
// ~/
```


Chunk invocation contents entries keep track of indentation in their struct.
In order to ensure that nested invocations end up with nested indentation, all
of the indents have to be kept track of. A linked list of indents is used for
this purpose. As the print function recurses deeper, indents are appended to
the list. As the layers peel back again, the indents are removed from the list.

```c
// ~='recurse'
code_chunk * next_c = contents->reference;
list_push_back(&indents, (void *)contents->string);
code_chunk_print(f, d, next_c, indents, 0);
list_pop_back(&indents);
// ~/
```

When it comes time to print a line of code, the list of indents is expanded and
the contents of the line are printed, except when the line is empty (1). When a
partial line is encountered, its successor is immediately printed; this is
necessary to avoid indentation being printed after escape sequences (2).
Finally, a newline terminates the line of code (3).

```c
// ~='print code'
if (*contents->string != '\0') /* (1) */
{
    /* print indents on non-empty lines */
    list * i;
    for (i = indents; i != NULL; i = i->successor)
    {
        fputs((char *)i->data, f);
    }
    fputs(contents->string, f); 
}


if (contents->partial_line) /* (2) TODO should this be while? */
{
    l = l->successor;
    exit_fail_if(l == NULL
                , "Error: Partial line without successor in chunk '%s':\n"
                  "       %s"
                , c->name, contents->string
                );
    contents = l->data;
    fputs(contents->string, f);
}

fputc('\n', f); /* (3) */
// ~/
```

Multiple invocations of any given chunk are not permitted by `lili`, which is
not a text preprocessor and does not wish to encourage users to duplicate code
by making it easy to do so. Disallowing multiple invocations also simplifies
the prospect of untangling machine source to synchronize changes back to
literate source, which may be implemented in the future.

```c
// ~='prevent multiple invocations'
if (c->invocations != 0)
{
    if (c->invocations == 1)
        exit_fail_if(1, "Error: Ignoring multiple invocations of chunk %s.\n"
                    , c->name
                    );
    c->invocations += 1;
    return;
}
else if (c->tangle)
{
    if (tangle) c->invocations = 1;
    else
    {
        exit_fail_if(1
                    , "Error: Ignoring invocation of tangle chunk %s within "
                      "another chunk.\n"
                    , c->name
                    );
        return;
    }
}
else c->invocations = 1;
// ~/
```

## extra details

Read on if you are interested in further details, such as the definition of the
list and dict datatypes, the helper functions, and other minutiae.

### list type

This is a very simple implementation of a linked list. The user is responsible
for ensuring the list entries are appropriately converted to and from void
pointers, and for managing the memory and lifetime of the data in the entries,
but the list nodes themselves are allocated and freed appropriately
automatically.

```c
// ~='data types'
typedef struct List
{
    void * data;
    struct List * successor;
} list;

/* a list must be initialized with data */
list * list_new(void * d)
{
    list * l = malloc(sizeof(list));
    l->data = d;
    l->successor = NULL;
    return l;
}

/* we need a double pointer so that if we are passed lst == NULL we mutate *lst
 * so that it points to a new list. This happens in dict_add if the new chunk
 * is hashed into a bucket not already containing a pointer to a list */
void list_push_back(list ** lst, void * elem)
{
    list * a = list_new(elem);
    if (*lst == NULL)
    {
        *lst = a;
    }
    else
    {
        list * l = *lst;
        while (l->successor != NULL) l = l->successor; /* go to the end of l */
        l->successor = a;
    }
}

void list_pop_back(list ** lst)
{
    if (*lst == NULL) return;
    if ((*lst)->successor == NULL)
    {
        free((void *)(*lst));
        *lst = NULL;
        return;
    }
    else
    {
        list * l1 = *lst;
        list * l2 = (*lst)->successor;
        while (l2->successor != NULL) 
        {
            l1 = l2;
            l2 = l2->successor;
        }
        free((void *)l2);
        l1->successor = NULL;
        return;
    }
}

void list_push(list ** lst, void * elem)
{
    list * p = list_new(elem);
    if (*lst == NULL)
    {
        *lst = p;
    }
    else
    {
        list * l = *lst;
        *lst = p;
        p->successor = l;
    }
}

void list_pop(list ** lst)
{
    list * p = *lst;
    if (p == NULL) return;
    list * l = p->successor;
    *lst = l;
    free((void *)p);
}
// ~/
```

### dict type

This is a very simple hash map dictionary. It only holds code chunks, but code
chunks are technically just named lists, so really it could hold anything. The
above linked list implementation is used to hold entries, so the memory etc.
management features are basically the same as for lists (i.e. basically none),
except there's currently no way to remove entries from the dict.

```c
// ~+'data types'
~{code chunk struct}

/* http://www.cse.yorku.ca/~~oz/hash.html */
unsigned long hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while ((c = *str++)) hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

typedef struct Dict
{
    list ** array;
    size_t size;
} dict;

dict * dict_new(size_t size)
{
    dict * d = malloc(sizeof(dict));
    d->array = malloc(size * sizeof(list));
    memset(d->array, (size_t)NULL, size * sizeof(list));
    d->size = size;
    return d;
}

void dict_add(dict * d, code_chunk * c)
{
    unsigned long h = hash((unsigned char *)c->name) % d->size;
    list_push_back(&(d->array[h]), (void *) c);
}

code_chunk * dict_get(dict * d, char * name)
{
    unsigned long h = hash((unsigned char *)name) % d->size;
    list * l = d->array[h];
    while (l != NULL)
    {
        code_chunk * c = l->data;
        if (strcmp(name, c->name) == 0) return c;
        else l = l->successor;
    }
    return NULL;
}
~/
```

### helper functions

This one is a wrapper around `printf` for killing the program if an
unrecoverable error is encountered.

```c
// ~='functions'
void exit_fail_if(int condition, char * message, ...)
{
    if (!condition) return;
    va_list args;
    va_start(args, message);
    vfprintf(stderr, message, args);
    va_end(args);
    exit(EXIT_FAILURE);
}
// ~/
```

This one null terminates in place a name delimited by matching identical
characters (e.g. 'name', "name", .name., $name$) or braces (e.g. {name})
returns the name, and as a side effect advances the `char *` to the character
after the terminal name delimiter. The pointer argument must point to the first
delimiter, and the function fails if the second delimiter can't be found on
the line, or if the name is empty.

```c
// ~+'functions'
char * extract_name(char ** source)
{
    char * s = *source;
    char terminus = *s++;
    char * destination = s;

    switch (terminus)
    {
        case '{': terminus = '}'; break;
        case '[': terminus = ']'; break;
        case '(': terminus = ')'; break;
        case '<': terminus = '>'; break;
    }

    for (; *s != terminus; ++s)
        exit_fail_if ( (*s == '\n' || *s == '\0')
                     , "Error: Unterminated name on line %d\n"
                     , line_number
                     );
    exit_fail_if ( destination == s
                 , "Error: Empty name on line %d\n"
                 , line_number
                 );

    *s = '\0';
    *source = s + 1;
    return destination;
}
// ~/
```

This helper scans for the end of the line, incrementing `line_number` if
successful, and returning 0 if unsuccessful. Failure, in this case, means that
the end of the file was encountered before the end of the line, and the fail
state is returned instead of just dying so the caller can provide more detail
in the error message.

```c
// ~+'functions'
int advance_to_next_line(char ** source)
{
    char * s = *source;
    while (*s != '\n') if (*s++ == '\0') return 0;
    *source = s + 1;
    ++line_number;
    return 1;
}
// ~/
```

Code chunk print is described above, and appended here in the program.

```c
// ~+'functions'
~{code chunk print}
// ~/
```

### setup routine

```c
// ~='setup'
~{parse command line arguments}

~{allocate dict memory}
// ~/
```

Presently, lili takes no flags and accepts only one file to tangle at a time.

```c
// ~='parse command line arguments'
if (argc < 2 || argc > 2 || *argv[1] == '-' /* assume -h */) 
{
    fprintf(stderr, help, VERSION, argv[0]);
    exit(EXIT_SUCCESS);
}
file = argv[1];
// ~/
```

```c
// ~='load file into `char * source`'
{
    int file_size;
    size_t fread_size;
    FILE * source_file = fopen(file, "r");
    exit_fail_if ( (source_file == NULL)
                 , "Error: Could not open source file %s\n", file
                 );
    
    /* get file size */
    fseek(source_file, 0L, SEEK_END);
    file_size = ftell(source_file);
    rewind(source_file);
    
    /* copy file into memory */
    source = malloc(1 + file_size); /* one added for null termination */
    fread_size = fread(source, 1, file_size, source_file);
    if (fread_size != file_size) fprintf(stderr, "Warning: fread_size doesn't match file_size???\n");
    fclose(source_file);
    
    source[file_size] = 0;
}
// ~/
```

```c
// ~='allocate dict memory'
d = dict_new(128); /* for storing chunks */
// ~/
```

### front matter structure

These last chunks are invoked by the main tangle chunk and serve mainly to
abstract away some detail at that high level.

```c
// ~='includes'
#include <ctype.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <stddef.h>
#include <stdarg.h>
// ~/
```

```c
// ~='definitions'
~{includes}

~{globals}

~{data types}

~{functions}

~{help text}
// ~/
```

## tests

The main test case for `lili` is its own literate source (i.e. this document).
Producing `lili.c` by tangling `lili.lili` exercises all of the control
sequences specified in the help text. `lili` should produce a source file when
run, and the same resulting source file every time it is run. This result
should be the same regardless of which version of lili is run. These assertions
are tested in the Makefile included in this repository. Simple run `make test`
to ensure `lili` is working as expected.