-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathlili.lili
1197 lines (1036 loc) · 41.4 KB
/
lili.lili
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!--- @:~ --->
# lili: the little literate programming tool
`lili` is a simple command line program written in portable dependency-free C
that aims to provide the absolute bare minimum of functionality needed to
enable writing programs in a literate programming style. All `lili` does is
scan a plain-text file for a handful of control sequences, extract source code
delimited by those control sequences, and output source code files by
recursively expanding the chunks of code thus found.
This minimalist approach ensures users of `lili` can depend on it to continue
working reliably and easily modify the source code to tailor the program to
their specific needs, such as adding support for a particular typesetting
markup or programming language.
# what is literate programming?
>Literate programming is a very simple idea: Write a book. And oh, by the
>way, the book has the actual executable source code in it.
>Timothy Daly, 2013. "[Literate Programming in the Large](https://youtu.be/Av0PQDVTP4A)"
Literate programming (first described by [Donald Knuth in his paper introducing
the idea](http://www.literateprogramming.com/knuthweb.pdf)) is a style of
computer programming that prioritizes the relationship between the program
writer and program reader. The essence of this approach is the position that
computer programs should be written primarily as a form of communication
between humans and only incidentally as a means of instructing computing
machines to perform useful jobs.
Few would disagree that excellent computer code, in addition to whatever other
positive qualities it may possess, should be easy for any programmer to read
and understand. Conventionally, this ideal is approached by carefully choosing
variable and class names, breaking up complex subroutines into well-named
functions, writing unit tests that demonstrate how a given subroutine is used,
and so on; basically, the code itself is written to be self-describing. After
all, the source code has the final word on what the program does and how it
does it. Comments, documentation, and any other things besides are all
irreconcilably seperate from the actual logic of the program that the code
describes, and they even run the risk of misleading the reader of a program. As
Robert Martin wrote, "[the proper use of comments is to compensate for our
failure to express ourself in
code.](https://www.goodreads.com/book/show/3735293-clean-code)"
On the other hand, there is much that the code itself may struggle to
communicate or which may even be impossible for the code to communicate.
Programming languages are constrained by the systems they are meant to control,
by the systems that translate from source code to machine code, and so on.
These constraints in turn influence the writers of the source code, the ideas
that they can express through the language, and the ways in which they can
present those ideas. Source code excels at expressing what the program does
and how it does it. Code often struggles when it comes to communicating why it
does what it does and why it does it in that particular way rather than taking
some other approach. Almost every programming language provides some mechanism
for embedding natural language comments in the source code, precisely to help
programmers to express that which the code cannot. Documentation is also
increasingly integrated in programming language tooling. But both comments and
documentation remain subordinate to the source code. They serve to help the
reader decode the source code. The source code itself remains supreme, and
often also remains opaque, esoteric, and difficult to understand.
Literate programming proposes to invert this relationship: rather than write a
program with some explanations embedded in it, you craft a piece of writing
with a program embedded in it. Rather than write comments and documentation
interspersed in source code, write source code interspersed in a natural
language narrative. Rather than write a program primarily to control the
behavior of a computer and only incidentally to be read by humans, write a
piece of prose primarily to communicate with humans and only incidentally
to be executed by a computer. This is the theory of literate programming.
In order to write a literate program, the expressive power of most programming
languages is insufficient. For this reason, the practice of literate
programming usually involves the use of specialized tooling that allows for
expressing source code embedded in a piece of literature. These tools allow
the source code to be ordered and explained using the structure that most
naturally expresses the meaning of the program, rather than the order and
structure that is most convenient to the compiler or interpreter. This
involves, for instance, allowing code chunks to be expressed in any order.
# should I write literate programs?
Proponents of literate programming argue that the style provides numerous
benefits. The prioritization of communication over execution allows
programmers unfamiliar with a program to understand and take ownership of it,
allowing new team members in a collaborative environment to become productive
contributors to the project more quickly. Literate programs are
also purportedly more resilient to the passage of time and changing of hands,
as new maintainers of a program are endowed with the knowledge of not only what
and how a program works but also why and what for. Furthermore, writing the
natural language exposition of a literate program naturally forces the writer
to not only better comprehend the program they are writing but also to write a
program that is fundamentally more comprehensible. As Donald Knuth writes, "my
programs are not only explained better than ever before; they also are better
programs, because the new methodology encourages me to do a better job."
Literate programming also seems especially well suited to environments in which
understanding the program is as important as the program itself, such as in
academic research where a program is part of research findings.
However, despite the purported benefits, literate programming has not seen much
mainstream adoption. Although there have been numerous remarkable success
stories (notably [one book written in a literate programming style that won an
Academy Award](https://www.pbrt.org/)), most programmers do not program in a
literate style. This may be explained in part by the disadvantages of literate
programming.
As Knuth recognized in the paper introducing the style, literate programming
practice usually entails writing literate programs in a document that combines
typesetting markup, literate programming specific markup, and source code.
This introduces the opportunity for syntax errors in three different languages
in addition to normal program logic errors, with the associated increase in the
complexity of debugging a literate program. More generally, literate
programming practice introduces additional tooling that necessarily adds some
complexity to writing programs.
Programming in a literate style requires authoring and maintaining not only the
program itself but also the literary work in which the program is embedded. As
the program changes, it is essential to ensure that the associated literature
is updated with discipline lest it drift out of sync with the actual program
logic it is meant to describe. The cost of maintaining the prose in addition to
the program can be magnified in environments where the requirements for the
program and/or its design may frequently and drastically change.
However, the most damning challenge to the adoption of literate programming
comes not from the tooling nor the maintenance of prose, but simply from the
fact that programming is hard, writing is hard, and very few people are good at
both. This means that programmers with aptitude for literate programming are
naturally somewhat rare.
In summary:
Pros:
- Improved readability decreases time for new programmers to integrate into collaborative environments
- Program maintenance is easier especially as old maintainers become separated from the project
- Authoring prose alongside code encourages writing higher quality programs
- Emphasis on communication is a natural fit for e.g. programs produced as part of research
Cons:
- Additional tooling imposes increased debugging and programming environment complexity
- Prose introduces additional maintenance overhead, especially when requirements change quickly
- Writing good programs is hard enough without also having to write good literature
If program requirements are reasonably well defined, if it is common for
newcomers to contribute to a project, if code maintainers are likely to change
over time, if the people involved in a project are reasonably good writers as
well as programmers, and especially if the full and varied meanings of the
source code are as important as the code itself, it may be worthwhile to
consider adopting a literate programming style.
Writing in a literate programming style may seem less useful for beginners,
hobbyists, and other amateurs, who may not have the well define program
requirements and deep understanding of the programs they are writing, and who
generally work alone and so miss out on the collaborative advantages of
literate programming. However, the same people are very likely to benefit from
more widespread adoption of literate programming, because reading and
understanding a literate program is much easier than doing the same with
programs written in a more conventional style. This is the advantage of
literate programming most exciting to me personally. Literate programs have
great potential as educational tools. In the best case, literate programs can
elevate and empower the reader in a way conventional programs simply can't.
# how does `lili` work?
The remainder of this document (the literate source code for `lili`) describes
the functionality and implementation of `lili`.
`lili` knows nothing about the machine source code language, has no runtime
configuration or flags, and makes as few assumptions as possible about the
typesetting markup used for prose passages. Furthermore, `lili` provides no
functionality for weaving, i.e. producing typesetting markup from the literate
source code. This functionality is not essential for literate programming,
which is about organization and communication rather than typesetting and
generating cross references (see [this
article](http://akkartik.name/post/literate-programming) for example). If the
writer wishes to produce a typeset document from the source code, typesetting
markup may be written directly in the lili-formatted document and the
typesetting system can be run directly on this document. For example, this
literate source document for `lili` is typeset in Github-flavored Markdown.
Alternatively, the user may edit the program to introduce support for weaving.
Despite its intentional limitations, `lili` provides enough functionality to
enable writing in a literate programming style. Code chunks can be presented
to the reader in any order, and code chunks can be referred to by name inside
of other code chunks, allowing for high levels of abstraction in the
presentation of the program. When outputting source code, chunks are
recursively expanded into the output source files while retaining indentation
for compatibility with white space sensitive languages such as python.
The help text explains the control sequences `lili` searches for to extract
the machine code embedded in your natural-language document:
```c
// ~='help text'
const char * help =
"lili: the little literate programming tool -- version %s\n\
\n\
USAGE: %s file\n\
\n\
lili extracts machine source code from literate source code.\n\
\n\
All control sequences begin with a special character called ATSIGN, which is \n\
set to '@' by default. Except for escaped ATSIGNs, all control sequences consume\n\
(i.e. cause to be ignored) the remainder of the line following the end of the\n\
control sequence.\n\
\n\
ATSIGN:new control character Redefines ATSIGN to whatever immediately follows\n\
the : sign. This is useful if your machine source\n\
language uses lots of @ signs, for example. This\n\
control sequence is ignored if it occurs inside\n\
a code chunk definition.\n\
\n\
ATSIGN='chunk name' Begin a regular chunk definition. The chunk name\n\
must be surrounded by matching single (') or\n\
double (\") quotes (no backticks). The chunk\n\
definition itself begins on the next line.\n\
Anything preceeding the ATSIGN and following the\n\
final quote sign is ignored, allowing this control\n\
sequence to be wrapped in typesetting markup.\n\
\n\
ATSIGN#'chunk name' Begin a tangle chunk definition. This is similar\n\
to a regular chunk, except the name of the chunk\n\
is also interpreted as a file name, and the chunk\n\
is recursively expanded into the file with that\n\
name, overwriting any existing file.\n\
\n\
ATSIGN+'chunk name' Append to a chunk. The code starting on the\n\
next line will be added at the end of the chunk\n\
named by 'chunk name'. This is useful e.g. for\n\
adding #include directives to the top of a c file.\n\
If no such chunk already exists, then this control\n\
sequence is equivalent to a normal chunk\n\
definition (ATSIGN='name').\n\
\n\
ATSIGN{chunk invocation} Invoke a chunk to be recursively expanded into any\n\
tangled output files. Anything preceeding the\n\
ATSIGN is considered as indentation, and every\n\
line of code in the recursively expanded chunk\n\
will have this indent prepended to it. Anything\n\
following the closing brace is ignored, however it\n\
is recommended not to put anything there for\n\
compatibility with possible future extensions that\n\
may make use of this space (i.e. text following\n\
the closing brace is reserved).\n\
\n\
ATSIGNATSIGN Escape sequence. The rest of the line following\n\
the initial ATSIGN (including the second ATSIGN)\n\
is treated as normal code to copy to output\n\
tangled documents. This allows ATSIGN and special\n\
control sequences to be passed through to the\n\
output documents.\n\
\n\
ATSIGN/ End an ongoing chunk declaration.\n\
\n\
ATSIGN[anything else] An ATSIGN followed by any other character is\n\
ignored. However, it is recommended to avoid such\n\
character sequences for compatibility with future\n\
extensions.\n\
\n";
// end help text ~/
```
## program overview
`lili` is implemented with three main
processing stages: setup, extraction, and output. The setup phase reads the
file to be processed into memory. During extraction, the file is scanned for
code chunk definitions, and these are logged in data structures convenient for
output. The output phase then recursively expands code chunks into files.
```c
// ~#'lili.c'
~{definitions}
void lili(char * file, dict * d, list ** tangles)
{
char * source;
~{load file into `char * source`}
~{extract code chunks}
}
int main(int argc, char ** argv)
{
dict * d;
list * tangles = NULL;
char * file;
~{setup}
lili(file, d, &tangles);
~{output tangle chunks recursively}
return 0;
}
// end lili.c ~/
```
## code chunk extraction
### code chunk representation
A chunk of code is represented by the following structure:
```c
// ~='code chunk struct'
typedef struct CodeChunk
{
char * name;
list * contents;
int invocations;
int tangle;
} code_chunk;
// ~/
```
The `invocations` member is used when tangling output to enforce that each chunk
may only be used once (since `lili` is not meant to act as a text preprocessor
and should not support "function call by copy-and-paste" usage idioms), and to
allow a warning to be raised if a chunk is defined but never used.
The list of contents is populated by entries in the form of the following
structure:
```c
// ~+'code chunk struct'
typedef struct ChunkContents
{
char * string;
code_chunk * reference;
int partial_line;
} chunk_contents;
// ~/
```
There are two different types of chunk contents, which can be differentiated on
the basis of how the fields of the `chunk_contents` struct are populated. The
most common kind of contents is a single line of code. In this case the
`string` field represents the line of code, and the `reference` field is left
empty (i.e. points to `NULL`). The other kind of contents is a chunk
invocation. In this case the `string` field represents the indentation
preceeding the invocation, while the `reference` field points to the chunk
referred to by the invocation. An enum type and a function are provided which
formalise the distinction, and subroutines are provided for constructing
contents of one or the other type, as well as for constructing a code chunk.
```c
// ~+'code chunk struct'
typedef enum ContentType {code, reference} content_t;
content_t contents_type(chunk_contents * c)
{
if (c->reference != NULL) return reference;
else return code;
}
chunk_contents * code_contents_new(char * code)
{
chunk_contents * c = malloc(sizeof(chunk_contents));
c->string = code;
c->reference = NULL;
c->partial_line = 0;
return c;
}
chunk_contents * reference_contents_new(char * indent, code_chunk * ref)
{
chunk_contents * c = malloc(sizeof(chunk_contents));
c->string = indent;
c->reference = ref;
c->partial_line = 0;
return c;
}
code_chunk * code_chunk_new(char * name)
{
code_chunk * chunk = malloc(sizeof(code_chunk));
chunk->name = name;
chunk->contents = NULL;
chunk->invocations = 0;
chunk->tangle = 0;
return chunk;
}
// ~/
```
### scanning for code
The outer-most loop of the extraction phase ignores your literate prose while
scanning for control sequences and newline characters. Anytime a newline
character is encountered the variable `line_number` is incremented.
`line_number` is currently only used when printing error messages to help the
user find the location of the error. When an `ATSIGN` is encountered, control
flow switches depending on the character following the ATSIGN. Redefining
ATSIGN, recursing with a referenced file, and printing a warning when
encountering an unknown control sequence, these are simple cases and copied
below. Extracting chunk definitions is more involved, so the description
follows.
```c
// ~+'globals'
char ATSIGN = '@';
int line_number = 1;
// ~/
// ~='extract code chunks'
{
char * s = source;
while (*s != '\0')
{
if (*s == '\n') ++s, ++line_number;
else if (*s++ == ATSIGN)
{
switch (*s)
{
case '#':
case '=':
case '+':
~{extract chunk definition}
break;
case ':':
++s;
exit_fail_if ( ( *s == '=' || *s == '#' || *s == '+'
|| *s == '{' || *s == ':' || *s == '/'
|| *s == '\n'
)
, "Error: Cannot redefine ATSIGN to a character "
"used in control sequences on line %d\n"
, line_number
);
ATSIGN = *s++;
break;
default:
exit_fail_if(1
, "Error: Unrecognized control sequence ATSIGN%c "
"while scanning prose on line %d\n"
, *s, line_number
);
}
}
}
}
// ~/
```
### parsing and extracting a chunk
When a well formed chunk definition is encountered, a new chunk is created or
an existing one is retrieved for appending to, and an inner loop
parses the chunk definition.
```c
// ~='extract chunk definition'
if (*(s + 1) != '\'' && *(s + 1) != '\"')
{
exit_fail_if(1
, "Error: Chunk definition sequence on line %d is missing a "
"quote-delimited name. Ignoring\n"
, line_number
);
break;
}
else
{
code_chunk * chunk;
~{prepare chunk}
~{parse chunk}
}
// ~/
```
When a chunk definition is found, `s` points to the character following the
ATSIGN, which determines what type of definition to extract (1). The next
character is either a single or double quote, and `s` is advanced to this
character (2.a) before being passed to `extract_name` to fulfill the invariant
of that function (that `s` should point to the delimeter surrounding the name
when it is passed into the function) (2.b).
Once the name is extracted, any existing chunk by that name is retrieved from
the chunk dictionary (3). If no chunk by the given name already exists, then a
new one is allocated (4). This is the expected behaviour of regular and tangle
definitions, and append definitions if the named chunk hasn't been defined
already.
If the named chunk already exists and the definition is a regular or tangle
chunk (5), then the contents list of the chunk has to be examined. If it is not
empty, this indicates that the chunk is being redefined, which is considered an
error (6). If the contents list is empty, then the chunk must have been
allocated in response to its invocation in a chunk defined before the present
definition. In this case, or in case the existing chunk was explicitly
retrieved for appending to, nothing else needs to be done. If requested, the
selected chunk is marked (7.a) and added to the list of chunks to tangle into
output machine code (7.b), and then `s` is advanced to point to the start of
the next line (8). Chunk parsing and extraction can now proceed to populate
the contents list of `chunk`.
```c
// ~='prepare chunk'
{
/* (1) */
int tangle = *s == '#';
int append = *s == '+';
++s; /* (2.a) */
char * name = extract_name(&s); /* (2.b) */
chunk = dict_get(d, name); /* (3) */
if (chunk == NULL) /* (4) new chunk definition */
{
chunk = code_chunk_new(name);
dict_add(d, chunk);
}
else if (!append) /* (5) */
{
exit_fail_if(chunk->contents != NULL /* (6) */
, "Error: Redefinition of chunk '%s' on line %d.\n"
" Maybe you meant to use a + chunk or accidentally "
"used the same name twice?\n"
, name, line_number
);
/* todo: free existing chunk? */
}
if (tangle)
{
chunk->tangle = 1; /* (7.a) */
list_push(tangles, (void *)chunk); /* (7.b) */
}
}
exit_fail_if(!advance_to_next_line(&s) /* (8) */
, "Error: File ended before beginning of definition of chunk '%s' "
"on line '%d'\n"
, chunk->name, line_number
);
// ~/
```
Once the `chunk` is prepared and pointing to the chunk to edit, parsing takes
place. `s` points to the beginning of the first line of the chunk on entry to
the parse loop (1). The loop scans until it encounters a newline (2.a) or a
control sequence (2.b). In these cases, a full line can be extracted, and the
pointer to the `start_of_line` can be updated (3.a, 3.b). An exception is when
an unrecognized control sequence is encountered (3.c), in which case the loop
continues processing the line from the character after the false-alarm ATSIGN
(3.d). The loop may also scan to the end of the file if there is an error, such
as if the author forgets to put a termination sequence at the end of the last
chunk definition in the file (4).
```c
// ~='parse chunk'
{
char * start_of_line = s; /* (1) */
for (;;)
{
if (*s == '\n') /* (2.a) */
{
~{extract code line}
start_of_line = s; /* (3.a) */
}
else if (*s == ATSIGN) /* (2.b) */
{
++s;
if (*s == '/')
{
~{end chunk extraction}
}
else if (*s == '{')
{
~{extract reference line}
}
else if (*s == ATSIGN)
{
~{extract line with escape sequence}
}
else /* (3.c) */
{
exit_fail_if(1, "Error: Unrecognized control sequence ATSIGN%c "
"while parsing chunk on line %d\n"
, *s, line_number
);
continue; /* (3.d) */
}
start_of_line = s; /* (3.b) */
}
else /* (4) */
{
exit_fail_if(*s == '\0'
, "Error: File ended during definition of chunk %s"
, chunk->name
);
++s;
}
}
}
// ~/
```
A line of code is extracted when the parse scans a whole line without
encountering any control sequences. The pointer to the start of the line is
copied to a new contents list entry (1), and the entry is added to the list
(2). The line number is incremented manually (3).
The newline character at the end of the line is then changed in place to a
null character (4), effectively terminating the string pointed to by the
contents list entry; this strategy is used throughout the program, for
extracting code, indents, and chunk names, and saves from having to allocate
more memory to copy strings that are already represented in the memory region
pointed to by `s`.
```c
// ~='extract code line'
chunk_contents * full_line = code_contents_new(start_of_line); /* (1) */
list_push_back(&chunk->contents, (void *)full_line); /* (2) */
++line_number; /* (3) */
*s++ = '\0'; /* (4) */
// ~/
```
When the control sequence for the end of the chunk definition is encountered,
`s` is simply advanced to the next line and the extraction loop is escaped with
a `break` statement. No further action is necessary.
```c
// ~='end chunk extraction'
advance_to_next_line(&s);
break;
// ~/
```
When a chunk invocation sequence is encountered, three things need to happen:
(1) the indent string needs to be extracted,
(2) a pointer to the chunk referred to by the invocation needs to be retrieved,
(3) and both of these need to be added to the contents list.
The `start_of_line` pointer already marks the beginning of the indent (1.a).
The ATSIGN marks the end of the indent, so it is converted to a null character
(1.b, 1.c), terminating the indent string in place.
To get a pointer to the chunk referred to by the invocation, the name between
the braces of the invocation (2.a) is looked up in the chunk dictionary (2.b).
If the chunk doesn't already exist, a new chunk is constructed using the name
from the invocation (2.c). This chunk can be defined later, at which point it
will be recognized as having been invoked before its definition by the fact
that its contents list is empty.
With a pointer to the chunk and a pointer to the null terminated indent ready,
the new contents entry can be constructed and appended to the contents list (3).
The rest of the line is ignored (4).
```c
// ~='extract reference line'
code_chunk * ref;
char * indent = start_of_line; /* (1.a) */
{
char * end_of_indent = s - 1; /* (1.b) */
*end_of_indent = '\0'; /* (1.c) */
}
{
char * name = extract_name(&s); /* (2.a) */
ref = dict_get(d, name); /* (2.b) */
if (ref == NULL) /* chunk hasn't been defined yet */
{
ref = code_chunk_new(name); /* (2.c) */
dict_add(d, ref);
}
exit_fail_if(!advance_to_next_line(&s) /* (4) */
, "Error: File ended during definition of chunk '%s'\n"
" following invocation of chunk '%s' on line '%d'\n"
, chunk->name, name, line_number
);
}
list_push_back(&chunk->contents, (void *)reference_contents_new(indent, ref)); /* (3) */
// ~/
```
A line with an escape sequence is very similar to a regular line of code,
except with an extra ATSIGN in the middle that shouldn't be tangled to outputs.
This is represented by adding two entries to the contents list, the beginning part spans
from the beginning of the line up to (but not including) the first ATSIGN. The
ending part spans from the second ATSIGN to the end of the line, including the
second ATSIGN in its range. These are terminated in place at the end of the
subroutine after `s` has been advanced to the end of the line (1).
The beginning part is marked as a partial line. This flag is picked up by the
tangling subroutine to suppress the newline that is normally printed after code
entries (2).
```c
// ~='extract line with escape sequence'
chunk_contents * beginning_part = code_contents_new(start_of_line);
chunk_contents * ending_part = code_contents_new(s);
char * at_the_atsign = s - 1;
exit_fail_if(!advance_to_next_line(&s)
, "Error: File ended during definition of chunk '%s'\n"
" following the escape sequence on line '%d'\n"
, chunk->name, line_number);
/* (1) */
*at_the_atsign = '\0'; /* terminate beginning part */
*(s - 1) = '\0'; /* terminate end part; s - 1 points to a newline character */
beginning_part->partial_line = 1; /* (2) */
list_push_back(&chunk->contents, (void *)beginning_part);
list_push_back(&chunk->contents, (void *)ending_part);
// ~/
```
## code tangling and output
Each code chunk recorded in the list of chunks to tangle is output to a file
named the same as the code chunk. The top level loop iterates over all the
tangle chunks (1), opens their file (2), and calls into the recursive print
function (3):
```c
// ~='output tangle chunks recursively'
for(; tangles != NULL; tangles = tangles->successor) /* (1) */
{
FILE * f;
code_chunk * c = tangles->data;
~{prevent invocation of tangle chunks}
f = fopen(c->name, "w"); /* (2) */
if (f == NULL)
{
exit_fail_if(1, "Error: Failed to open file '%s', skipping tangle\n"
, c->name
);
continue;
}
code_chunk_print(f, d, c, NULL, 1); /* (3) */
fclose(f);
}
// ~/
```
The representation of code chunks is designed mainly for the convenience of
this function. Every contents entry is either a line of code or a reference to
another chunk. If a chunk contains code (1), it is printed. Otherwise, the
function recurses.
```c
// ~='code chunk print'
void code_chunk_print(FILE * f, dict * d, code_chunk * c, list * indents, int tangle)
{
list * l;
~{prevent multiple invocations}
for (l = c->contents; l != NULL; l = l->successor)
{
chunk_contents * contents = l->data;
if (contents_type(contents) == code)
{
~{print code}
}
else if (contents_type(contents) == reference)
{
~{recurse}
}
}
}
// ~/
```
Chunk invocation contents entries keep track of indentation in their struct.
In order to ensure that nested invocations end up with nested indentation, all
of the indents have to be kept track of. A linked list of indents is used for
this purpose. As the print function recurses deeper, indents are appended to
the list. As the layers peel back again, the indents are removed from the list.
```c
// ~='recurse'
code_chunk * next_c = contents->reference;
list_push_back(&indents, (void *)contents->string);
code_chunk_print(f, d, next_c, indents, 0);
list_pop_back(&indents);
// ~/
```
When it comes time to print a line of code, the list of indents is expanded and
the contents of the line are printed, except when the line is empty (1). When a
partial line is encountered, its successor is immediately printed; this is
necessary to avoid indentation being printed after escape sequences (2).
Finally, a newline terminates the line of code (3).
```c
// ~='print code'
if (*contents->string != '\0') /* (1) */
{
/* print indents on non-empty lines */
list * i;
for (i = indents; i != NULL; i = i->successor)
{
fputs((char *)i->data, f);
}
fputs(contents->string, f);
}
if (contents->partial_line) /* (2) TODO should this be while? */
{
l = l->successor;
exit_fail_if(l == NULL
, "Error: Partial line without successor in chunk '%s':\n"
" %s"
, c->name, contents->string
);
contents = l->data;
fputs(contents->string, f);
}
fputc('\n', f); /* (3) */
// ~/
```
Multiple invocations of any given chunk are not permitted by `lili`, which is
not a text preprocessor and does not wish to encourage users to duplicate code
by making it easy to do so. Disallowing multiple invocations also simplifies
the prospect of untangling machine source to synchronize changes back to
literate source, which may be implemented in the future.
```c
// ~='prevent multiple invocations'
if (c->invocations != 0)
{
if (c->invocations == 1)
exit_fail_if(1, "Error: Ignoring multiple invocations of chunk %s.\n"
, c->name
);
c->invocations += 1;
return;
}
else if (c->tangle)
{
if (tangle) c->invocations = 1;
else
{
exit_fail_if(1
, "Error: Ignoring invocation of tangle chunk %s within "
"another chunk.\n"
, c->name
);
return;
}
}
else c->invocations = 1;
// ~/
```
## extra details
Read on if you are interested in further details, such as the definition of the
list and dict datatypes, the helper functions, and other minutiae.
### list type
This is a very simple implementation of a linked list. The user is responsible
for ensuring the list entries are appropriately converted to and from void
pointers, and for managing the memory and lifetime of the data in the entries,
but the list nodes themselves are allocated and freed appropriately
automatically.
```c
// ~='data types'
typedef struct List
{
void * data;
struct List * successor;
} list;
/* a list must be initialized with data */
list * list_new(void * d)
{
list * l = malloc(sizeof(list));
l->data = d;
l->successor = NULL;
return l;
}
/* we need a double pointer so that if we are passed lst == NULL we mutate *lst
* so that it points to a new list. This happens in dict_add if the new chunk
* is hashed into a bucket not already containing a pointer to a list */
void list_push_back(list ** lst, void * elem)
{
list * a = list_new(elem);
if (*lst == NULL)
{
*lst = a;
}
else
{
list * l = *lst;
while (l->successor != NULL) l = l->successor; /* go to the end of l */
l->successor = a;
}
}
void list_pop_back(list ** lst)
{
if (*lst == NULL) return;
if ((*lst)->successor == NULL)
{
free((void *)(*lst));
*lst = NULL;
return;
}
else
{
list * l1 = *lst;
list * l2 = (*lst)->successor;
while (l2->successor != NULL)
{
l1 = l2;
l2 = l2->successor;
}
free((void *)l2);
l1->successor = NULL;
return;
}
}
void list_push(list ** lst, void * elem)
{
list * p = list_new(elem);
if (*lst == NULL)
{
*lst = p;
}
else
{
list * l = *lst;
*lst = p;
p->successor = l;
}
}
void list_pop(list ** lst)
{
list * p = *lst;
if (p == NULL) return;
list * l = p->successor;
*lst = l;
free((void *)p);
}
// ~/
```
### dict type
This is a very simple hash map dictionary. It only holds code chunks, but code
chunks are technically just named lists, so really it could hold anything. The
above linked list implementation is used to hold entries, so the memory etc.
management features are basically the same as for lists (i.e. basically none),
except there's currently no way to remove entries from the dict.
```c
// ~+'data types'
~{code chunk struct}
/* http://www.cse.yorku.ca/~~oz/hash.html */
unsigned long hash(unsigned char *str)
{
unsigned long hash = 5381;
int c;
while ((c = *str++)) hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
typedef struct Dict
{
list ** array;
size_t size;
} dict;
dict * dict_new(size_t size)
{
dict * d = malloc(sizeof(dict));
d->array = malloc(size * sizeof(list));
memset(d->array, (size_t)NULL, size * sizeof(list));
d->size = size;
return d;
}
void dict_add(dict * d, code_chunk * c)
{
unsigned long h = hash((unsigned char *)c->name) % d->size;
list_push_back(&(d->array[h]), (void *) c);