forked from cisco/hash-sigs
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathprogramming.notes
674 lines (652 loc) · 40.7 KB
/
programming.notes
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
These are some notes about the internal programming style of the HSS
subsystem. These are of no interest to someone who just wants to use LMS
signatures; they might give some insight to someone spelunking through the
sources.
- General philosophy
This subsystem implements the HSS siganture scheme, which is a moderately
complex data structure, involving multiple Merkle trees. To make things
even more fun, we implement restarting, that is, loading a private key
into memory, even if that private key has already generated some signatures
(and constructing the Merkle tree structures to reflect those signatures),
and multithreading (that is, we can spread many of the computations over
multiple processors).
Now, there is a rather annoying amount of complexity within the package;
we strive to isolate that from the user. When the user creates a private
key, he has to tell us the parameter set (we can't make that up ourselves);
however, after that, he just loads the keys, and signs messages, without
worrying about the internal structure of the trees. The parts that we
do leave to the user (such as the functions to load/store the private key)
are there in attempt to make it easier to use the system securely.
- Merkle trees, subtrees
The core of this system is the signer, and the working key. To sign
a message, we generate the OTS signature for that message, and generate
the authentication path to the root; this authentication path are the
nodes adjacent to the actual path from the OTS public key to the Merkle
tree root; hence when we generate the signature, we need to have those
internal node values computed. We could compute them on the fly, but
that's be expensive. We could store the entire Merkle tree in memory
and retrieve them that way, but that'd take too much memory (for H-25,
that's 2Gigabytes) [1]. Instead, what we do is implement a Merkle tree
walk, which recomputes a few OTS public keys (and Merkle internal nodes)
per signature, and ensures that we have the data needed for the
authentication path when we need them. The algorithm we actually use for
this is inspired by the paper "Fractal Merkle Tree Representation and
Traversal", by Jakobsson et al. Now, we don't do the exact algorithm
found in the paper (Algorithm 3); we go for "constant OTS pubkey
computation", and not "constant hash evaluations" (the algorithm in the
paper assumes that generating a leaf node takes one hash; it's more
like a few thousand). In addition, we need to deal with the complexity
of dealing with multiple trees at once, their algorithm considers only
one. If you do go through the paper, their $Exist_i$ tree is our
ACTIVE_TREE, and their $Desire_i$ tree is our BUILDING_TREE (and they
don't have a NEXT_TREE). This algorithm allows a time/memory trade-off,
and even if we don't give it much memory, it's still decently efficient.
This algorithm is based on a "subtree"; a subtree is a triangular subsection
of the tree that consists of nodes from level A to level A+h-1 of the tree
which are rooted by a specific level A node; 'h is the height of the
subtree. If the Merkle tree is level H, then we divide up the tree into H/h
(rounded up) levels, and deal with the subtrees at each levels (if h doesn't
divide H cleanly, then the top level subtree will be shorter). For each
such level, we track an ACTIVE_TREE, a BUILDING_TREE and a NEXT_TREE
(exceptions: the top-most subtree doesn't get a BUILDING_TREE, and the
subtrees for the top level Merkle tree don't get NEXT_TREE's; they wouldn't
be used).
The ACTIVE_TREE lives on the current authentication path (that is, on the
path from the current (actually, next-to-be-used) leaf node to the Merkle
tree root. The ACTIVE_TREE is always fully populated (that is, it contains
the correct value for all the internal nodes within the subtree), and so we
can read the authentication path by looking at the nodes within the
ACTIVE_TREE.
The BUILDING_TREE is the subtree to the immediate right of the ACTIVE_TREE,
within the same Merkle tree. That is, the root of the BUILDING_TREE is the
node next to the root of the ACTIVE_TREE. As we generate signatures from
the ACTIVE_TREE, we also incrementally compute nodes in the BUILDING tree.
If the root of the two subtrees are height L from the leaf nodes, then we
generate 2**L signatures from the current ACTIVE_TREE, and we need 2**L
OTS pubkey generations to fully build the BUILDING tree; hence for each
signature generated, we compute one OTS pubkey (and do the other work
involved with constructing the tree); when we complete the ACTIVE_TREE,
the BUILDING_TREE is all constructed, and ready to become the new active.
We need to do this update 1 time for each building tree in the bottom
merkle tree; there are H/h-1 (rounded up) such levels (the topmost
subtree doesn't get a building tree, as there's no subtree adjacent
to it), and so the building subtrees involve H/h-1 OTS pubkey gens.
The NEXT_TREE is the first subtree that's in the next Merkle tree. Except
for the topmost Merkle level, when we exhaust one Merkle tree, we're
expected to roll over to the next. The NEXT_TREE is there so that when
we do, we'll have a fresh set of ACTIVE_TREEs are ready to go. If the
height of the Merkle tree is H, then the current Merkle tree can sign
2**H signatures, and it takes 2**H OTS pubkey gens to construct the
next Merkle tree, hence we do one OTS pubkey gen for the next tree
per signature (in addition to the building tree pubkey gens listed
above).
Also, cryptographically, all levels of the HSS hierarchy are the same, but
from an implementation standpoint, they're not. Almost all the work
computing the next authentication path is done with the bottom level Merkle
tree, and so if the user allows us extra memory, we'll devote all that
memory to expanding the bottom subtrees (which reduces the number of OTS
pubkey computations per signature; by increasing h, we decrease H/h). For
Merkle trees other than the bottommost, we always use the subtree height
that gives us the minimal memory (which isn't always the smallest subtrees)
[2]. Because we hardly ever need to update the subtrees there, any extra
memory we use would be wasted.
Another difference between the bottom-most Merkle tree and higher trees
is the update strategy. For the bottom-most Merkle tree, we perform the
needed OTS pubkey gens as a part of the signature call (as that's the only
time we have to perform any operations). However, for higher level trees,
we don't gen those OTS pubkeys when we generate the signature (as that
would cause that signature to be unexpectedly expensive); instead we
spread the OTS pubkey gens over a number of previous signature operations.
Also, we deliberately do those when the bottom-most tree is doing fewer
operations than expected (e.g. one of its ACTIVE_TREEs is on the right
side, and so there's no need for us to update the BUILDING_TREE), and so
the caller doesn't see any unexpected expense at all.
When we load a private key in memory, the bulk of the work is initializing
the subtrees to be what we'd expect them to hold, based on what the current
count is. Actually, we advance the building and next trees to be slightly
in advance of what they'd be if we incremented them manually (that turns
out to be somewhat simpler).
[1] Actually for the bottom level Merkle tree, if the user allows us enough
memory, we will explicitly represent the entire Merkle tree in memory; we
just have it implement one huge subtree, that is, H=h. However, we don't
mandate that the user allow us that much memory, if he allows us less,
we'll go with a more compact (and slower) representation.
[2] Except in those cases where the immediately below Merkle tree won't
give us enough updates, which can happen only in rather obscure parameter
sets, e.g. 25/5.
- Aux data; what is it, really?
When we first generate the public key, we must compute the entire Merkle
tree contents for the top level Merkle tree (in order to compute the public
key). When we load the private key into memory, we must compute the
authentication path for active Merkle trees (which includes the top level
one); this involves computing the entire Merkle tree contents. Obviously,
there is a lot of repeated computation going on; our answer to that it
auxiliary data. When we generate the Merkle tree the first time (when we
generate the public key), we save some of the contents of this tree; when we
load the key into memory, we use these saved contents (rather than
recomputing those nodes); this significantly decreases the amount of
recomputation we need. We actually save the values at the bottom of the
subtrees; that means that those bottom nodes for the subtrees we've saved
are free; we recompute the internal nodes for those subtrees ourselves (but
that's comparatively cheap). The higher level subtrees cost us the least
amount of disk space (there are fewer of them), and they save us the most
computation time, and so those get priority. We know (given the current
allocation algorithm) what subtree heights would be (and hence where their
bottom levels would be; based on the amount of aux data we're allowed, we'll
save as many bottom levels as can fit.
Also, we protect the aux data with an HMAC (and if that doesn't validate on
reload, we ignore it); this means that an attacker can't cause us to
generate invalide signatures by message with the aux data; they can make
key reloading take more time.
- Extra info, what's up with that?
There are some information that the application may want to specify, but
most applications really don't care. Some examples are "how many threads
should I use", or "have we just signed the last signature?" or "why did
that call just fail?". Instead how having independant parameters for each
one (and having to modify the API every time we think of another one), we
bundled them all into a structure (struct hss_extra_info), and allow the
application to set it, pass in a pointer to it, and on return, check the
results. Of course, if the application is OK with the defaults, it can
just pass in a NULL. Currently, we have three things that can be passed:
- Number of threads that we can use
- Whether we just used up the last signature
- Most recent failure reason (hss_error_code, you'll see that there are
a lot of possible reaosns)
In the future, we'll likely to add support for partial trees, and possibly
other things we haven't thought of yet; one nice thing is that we can add
something here to add an obscure (== "not of interest to most applications")
parameter without modifying the API to existing applications.
- Data structures; this is a review of some of the structures used within
this subsystem, and what they really mean
private key
This is what the raw private key looks like. It's not a formal C
structure; instead, it is a byte array, currently 48 bytes long,
split up into:
- 8 bytes of count; this is the state that gets updated with every
signature, and consists of a bigendian count of the number of
signatures so far. By convention, a setting of 0xffffffffffffffff
means 'we're used up all our signatures'
- 8 bytes of parameter set, in a compressed format. This is here so
that the application needn't tell us what the parmaeter set when
loading a key (and can't get it wrong)
- 32 bytes of random seed; this is where all the security comes from.
It is 32 bytes (256 bits) so Grover's algorthm can't recover it.
This is a flat set of bytes, because we hand it to read_private_key
and update_private_key routines, which are expected to read/write
them to long term storage.
Random musing: should we have included a version parameter (so we
could change the format without breaking things???)
struct hss_working_key
This structure holds all the current state of a loaded private key.
It contains a copy of the private key (so we can write it out as
required; we have the entire thing in memory, in case the write
routine isn't able to do a partial update), the current and reserve
counts (the reserve count is what we've last written to the private
key; we use it to implement the 'reserve' functionality), the
current signed public keys that we place into the signature,
and all the levels that make up this hierarchy.
One nonobvious member of this structure is 'stack'. Some of the
subtrees (the nonbottom building and next subtrees) require a
stack to hold intermediate hash values (as we compute them one
OTS pubkey at a time). On the other hand, other subtrees (the
active ones) have no such need; so, to save a bit of space, we
consolidate all the stacks into one contiguous region (and have
each subtree point into the part of the region that's theirs).
And, when we swap the active subtree with a building/next, we move
the stack pointer from the old building/next subtree to the new
(as the new active one doesn't need it).
struct merkle_level
Actually, this is not a Merkle tree (even though the code typically
names variables of this type 'tree'). Instead, it stands for a
specific tree level within the HSS hierarchy, and all the trees
that might live at that level. It contains the parameter set
that is used for this level, how the trees are implemented (subtree
sizes), and also two different trees at this level; the one the active
path is going through, and the next tree that will be used at this
level (once the current tree all the signatures it is allowed). This
structure has pointers to the various subtrees that hold the known node
values for the two trees.
struct subtree
This contains the node values for a particular subtree. The
height of this subtree is implicit (we use values in the merkle_level
to recompute it), it does contain the location of the subtree within
the larger tree. There are three different flavors of subtrees,
active, building and next; we tell which one this is based on
the pointer from the merkle_level. For building and next subtrees,
the subtree may not be complete; the current_index value tells
us where in the building process we are. For nonbottom subtrees,
this building process involves a stack (to combine nodes that
are lower than the bottom of this subtree); the stack member is
a pointer to a region dedicated for this rebuild for this subtree.
The actual node values are in the array nodes[] (and the structure
will be malloc'ed large enough to hold all the nodes); the root
of the subtree will be at location 0.
struct thread_collection
This is the abstract structure that stands for a collection of threads.
Its contents are specific to the actual threading implementation (the
trivial implementation doesn't actually bother defining it, as it never
allocates an object of this type). It holds information about the
threads, any necessary locks, and the tasks that have been asked for.
In addition, any such pointer to a (struct thread_collection) may be
NULL; this is never considered an error condition, rather it is an
indication that we're running in single-threaded mode (either because
that's how we're linked, or because we got an error spawning the
thread).
struct expanded_aux_data
This is an array of pointers to the various node values that occur
at various sublevels within the aux data; for any level that isn't
stored in the aux data, the pointer will be NULL. This will always
point into an application provided buffer, hence we don't need to
worry about memory allocation.
This is used in two slightly different ways: if we're building the
aux data (during key generation time), this will point to where the
aux data should go (that is, into the application-provided buffer).
If we're loading the aux data (during key load time), this will point
into where in the buffer the various levels actually are (and if the
buffer doesn't validate (wrong length or bad HMAC), we NULL out all
the pointers (and so the generation code treats it as if we had no
aux data).
- Types
We try to use types in a stereotypical way; we do this not so much to allow
the compiler to do type checking (as most of the types of flavors of int,
which the compiler will allow to mix-and-match freely), intead, it's
intended as a hint to the maintainer what this variable is supposed to be.
Of course, it counts as a hint only if the reader actually knows what we use
which types for :-)
sequence_t
This is the internal sequence number across an entire HSS tree structure.
That is, it's used to represent the total number of signatures that an
HSS private key has signed so far. This is a 64 bit type, as we allow
parameter sets that can sign more than 2**32 signatures.
merkle_index_t
This is the 'index' within a single Merkle tree; however we do give it
four distinct meanings:
- This is the count of the number of signatures generated with a single
Merkle tree so far
- The 'address' of a Merkle node that we use when computing its hash;
that is, the four bytes we include in the hash. In the draft, the
variable 'r' is this type.
- The offset of a node from the left side of the Merkle tree. In the
draft, the variable 'q' in the OTS signatures is this type, however we
use it for internal nodes within the Merkle tree as well.
- The offset of a node from the left side of the subtree it is in. This
is different from the previous definition if the subtree we're in isn't
the leftmost. Obviously, this meaning is never referenced in the draft
(as the draft never discusses subtrees).
If we were doing a rewrite, we'd probably give different typedef's for
the distinct meanings.
size_t
This is the size of an object (buffer, signature, whatever). Two notes:
- If we know that the size cannot be greater than 65535 (e.g. the size
of a hash), we sometimes use 'unsigned' instead
- In a couple of places, we want the size, however we want to allow
negative values as well. In those places, we use the type
'signed long' (and, yes, I know that 'signed' is redundant; I want
to emphesize the signedness).
bool
We use a bool in two different ways; the first is the obvious (a value
which is true or false); the other is a success value (did the operation
work or not?); we use the convention "true == it worked",
"false == it failed". It might make sense to use the opposite convention
(0 means it worked, nonzero means it failed; the exact nonzero value
might give an indication as to why it failed); however to me, having
0 meaning success is sufficiently nonintuitive that I just can't do it;
we have added another way to communicating the failure reason to the
application, in case it cares.
param_set_t
This is either an LM or an OTS parameter set (32 bits).
aux_level_t
This is the 32 bit flag that goes in front of the auxilary data; bits
within this flag indicate which Merkle tree levels we actual have
auxilary data for (and whether we have any aux data at all). If the
msbit of the initial byte of the aux data (which corresponds to bit 31
of the aux_level_t) is clear, then we assume we have no aux data (even
if the aux data consists of only that one byte; that means for any
real aux_level_t, bit 31 will be set.
hss_error_code
This is the code that stands for an error reason; for some structures,
we also use this as a 'is-this-structure-usable' flag (with something
other than "no error" (hss_error_none) meaning 'if someone tries to
use this structure, report this error'.
We've divided the various error codes into ranges, to stand for the
general types of errors (essentially, who we think was at fault):
- hss_range_normal_failures; these are the types of errors you get
when running the package normally (signature validation failure,
private key expired)
- hss_range_bad_parameters; these are the types of errors caused by
the application misusing the package (unsupported parameter set,
passed buffer not big enough, etc)
- hss_range_processing_error; these are errors caused by something in
the environment (rng failure, nvread/write failure, malloc
failure)
- hss_range_my_problem; these are errors caused by something internal
to this package; currently, they're all dubbed hss_error_internal,
and are caused by either something scribbling over our memory
or a bug somewhere
struct seed_derive
This is the structure we use to derive keys in a potentially side
channel resistant manner. There are two different versions of this
structure (with the same API, controlled by #define's) that map
the current seed, I value, q value (Merkle tree node index) and
j value (Winternitz digit index) into unpredictable values for the
LM-OTS private values (and the C values and the child tree I, seed
values). We make this into a structure because, when we're doing
a tree based derivation method (SECRET_METHOD==1), adjacent values
usually share most of the nodes of the tree, and this structure is
a convienent place to store those shared values (allowing us to
avoid recomputation).
This structure is used as follows:
- hss_seed_derive_init to set the I, seed values
- hss_seed_derive_set_q to set the q value (and if you call this again
to set it to a different q value, this tries to reuse as many nodes
as possible to minimize computation
- hss_seed_derive_set_j to set the initial j value.
- hss_seed_derive to get the next seed value. If increment_j is set,
this also sets up the structure for the next j value, and in a way
that minimized the number of hashes done.
- hss_seed_derive_done when we're done (to zeroize any internal state
information).
The programmer can modify the SECRET_METHOD/SECRET_MAX settings to
change the efficiency/side channel resistance mix; however any
such change modifies the mapping between seeds and private LM-OTS
values (that is, your private keys no longer work).
- Side channel resistance and key derivation.
We are inherently resistant to timing and cache-based side channel attacks
(as those behaviors are uncorrelated to any secret information). However,
when we come to DPA-style attacks, we do try to have some protection. To
perform a DPA attack, the attacker would need to see us use the same secret
value in a number of distinct hashes (which would allow them to build up
the statistics). To prevent this from being a problem, we can be configured
so that no secure is ever used for more than a limited number of hashes
(and so the attacker cannot get enough information to reconstruct any
secret). This is controlled by the SECRET_METHOD and SECRET_MAX #defines
found in hss_derive.h. SECRET_METHOD==0 means that we don't worry that much
(and we use the LM-OTS key generation procedure found in Appendix A).
SECRET_METHOD==2 means that we use the same procedure that ACVP expects to
translate seed values to the random values used. It's a variation of the
SECRET_METHOD==0 method, and isn't designed to be side-channel resistant.
SECRET_METHOD==1 means that the number of times we use any secret is
bounded to a maximum of 2**SECRET_MAX times. In this method, we derive
keys using a tree-based process, with each node having up to 2**SECRET_MAX
children. Decreasing SECRET_MAX takes a bit more time (as the tree becomes
deeper), however even SECERET_MAX==1 (smallest allowed value) isn't that
expensive.
- Threading architecture
We support doing operations on multiple threads, however we'd rather not
have the majority of the code know whether we're actually doing threading
or not. The compromise we came up with is the hss_thread.h API; this
compromise is not perfect (at least half of the complexity within
hss_generate.c is logic to try to work with multiple threads efficiently),
however it's better than embedding conditional pthread calls throughout
the code. This hss_thread API allows us to issue "tasks"; these tasks may
be run by the main thread or may be run in a spawned thread; the only
guarantee is that, by the time hss_thread_done returns, all the tasks have
completed. Now, we provide two different implementations of this API; a
trivial one (hss_thread_single.c), which doesn't try to spawn any threads,
and one that assumes the POSIX pthread API (hss_thread_pthread.c), which
makes calls to the pthread library to do this. You're expected to link
either one or the other when compiling the subsystem (and the Makefile we
provide generates two libraries, hss_lib.a and hss_lib_thread.a, which does
that for you). We do this, rather than attempting some internal switch,
because hss_thread_pthread.c makes direct calls to pthread (hence, would be
a link error if the OS didn't provide some implementation of those). Also,
if you're interesting in supporting some other threading library (such as
the one defined in C11), that shouldn't be that hard to add; we really use
only the basics of what a threading library ought to provide.
Also, as for avoiding race conditions (always a good idea), we have the
convention that, when a task is writing the result, and the area it's
writing to is malloc'ed or automatic region that might be shared by
other threads, that task must call hss_thread_before_write(col) before the
write (and hss_thread_after_write(col) afterwards); this makes sure that the
thread is the only one to write into that region. We do this even if two
threads aren't actually updating the exact same bytes; because we're not
careful to make sure our fields are aligned with the natural word size
of the CPU (whatever that is), then what an update of one field might
end up doing is a read/modify/write of a larger word, which might end
up overlapping the target of another thread (which might be doing a
read/modify/write of the same larger word simultaneously). This is a rather
unlikely event, however the race condition it would cause if it did happen
would be *really* hard to track down. Now, hss_thread_before_write does a
lock, which means that no other thread can write into the common region
until we release the lock, hence the time any thread holds the lock ought to
be short; the current code abides by that.
Now, we allow the application to specify the number of threads; if we're
using the pthread library, that's the number of child threads we'll
attempt to spawn (we don't count the parent thread, however while child
threads are active, the parent thread doesn't do much). Of course, it
is subject to a sane maximum. If the number of threads is specified as
1, we'll fall back to single-threaded mode. If it specifies 0, it gets
a supposedly-reasonable default (DEFAULT_THREAD). On the other hand, if
we're linked with the nonthreaded library, this doesn't do anything (we're
always in single-threaded mode).
The obvious question is: how is the application supposed to know how
many threads are appropriate? I don't have a great answer to that.
The number of threads you want are dependent on the number of cores
you have on the system (spawning more than that just gives the OS a
bit of a workout without speeding anything; after all, all the threads
are CPU-bound), and how much of the system resources you want to devote to
this task. The pthread virtualization doesn't give us a hint on the first
one; the second one is something the application might have a clue about.
Also, while the key generation and loading can take advantage of as many
threads as they can get, the signature generation and verification logic
can't. The signature generation logic is limited by the number of subtree
levels in the bottom merkle tree (plus one); the signature verification
logic is limited to the number of merkle levels.
- Use of malloc
If you go through the code, you'll see an occasional call to malloc. In
allocate_working_key (hss_alloc.c), we use malloc to build the working key
structure, and we'll fail (return 0) on a malloc failure. The working key
structure will contain everything we need to generate signatures (so we
never *have* to do a malloc later). Now, we will try to perform malloc's
elsewhere, however they are always strictly optional; we'll never fail
because malloc objected. Instead, the code will step to a "plan B" on a
failure, which will do the exact same job (but slower; if it wasn't slower,
it wouldn't be plan B).
The testing code (and demo.c) does use malloc, and will fail if the
malloc fails; these tools can be expected to run only when we have plenty
of memory, hence we feel we don't need a plan B there.
- Use of Variable Length Arrays
Now, we used to use VLAs (a C99 language feature) at places. However,
someone's compiler couldn't handle them (even though they implemented the
rest of the C99 features we used), and so we went and reworked the code to
remove them. In any case, removing the VLAs might make this code a bit more
small-end-device friendly (as those small devices tend not to have huge
stacks).
- Zeroization
Whenever we're done with a value whose leakage might allow someone to
generate a forgery, we zeroize it (hss_zeroize) before we release the
memory. Now, it turns out that most of the values we compute wouldn't
actually allow a forgery; the ones that do are: the seeds, the OTS private
keys and the OTS previous winternitz chain values. We also zeroize the
aux data HMAC values (those wouldn't allow a forgery; they would allow
someone to cause us to misbehave by modifying the aux data).
- Use of globals
There are no globals (other than the optional debugging flag hss_verbose).
All memory is either a buffer provided by the calling application,
dynamically allocated (malloc), or automatic (stack). Globals are evil,
reentrancy is good. The regression code does have globals (for things like
coordinating with the randomness generator; no normal program has any need
for that); the regression code isn't intended for use for other programs...
- Use of floating point
Crypto code hardly ever uses floating point. However, we're an exception;
in the hss_generate.c function, we do actually do some float point
computations; we do this to figure out a reasonable way to split the
building task between threads (and for this task, the imprecision inherent
in floating point is not a problem; if two ways of splitting the task are
so close in cost that the rounding error actually makes a difference, it
doesn't really matter which way we go). Now, we include a macro
(DO_FLOATING_POINT) which disables the use of floating point; a platform
that does not support floating point can set it to 0, and that code is
commented out. Now, if you use threading, you really want DO_FLOATING_POINT
If you don't, it doesn't matter for performance, and actually, turing it off
comments out quite a bit of code that you doesn't actually buy you anything;
it doesn't matter how we divide tasks between threads if the same thread
will end up performing them all anyways...
We also use floating point in the regression code; to figure out when to
update the displayed % completed.
- Debugging
Good luck...
- Regression tests
This package includes the test_hss executable, which is meant to be a set
of regression tests for this package. It ought to be run early and often
(with "test_hss all" being a good default), if not in -full mode, it's
relatively quick.
The usage is:
test_hss [-f] [-q] [-full] test_1 test_2 test_3
Without any parameters, it gives a usage message (and the list of the
supported tests)
The parameters are:
-f Normally, this test suite stops at the first failure; with this flag
it keeps on going
-q Normally, some of the longer tests some minimal progress messages
(percent complete); with this flag, it just lists the test being run
and a pass/fail message (and possibly a failure reason).
-full Normally, each test takes no more than 15 seconds to run (to
encourage you to run it early and often). With this flag, we allow
the tests to run much longer; warning, it'll take several hours to
run the full test suite in -full mode. I'm also not convinced that
-full mode gives you that much better coverage, on the other hand,
it really has found problems that the short tests haven't, and so it
should be run occasionally
all This is a shorthand to specify every test the test suite knows about.
On my test machine, 'test_hss all' currently takes about 70 seconds.
Now, there are things that the regression tests currently don't test:
- Do we assume malloc gives us an initiallized buffer?
- Do we handle malloc failures as designed?
- How about thread spawn failures? We're supposed to handle those
transparently
- Do we have any memory leaks?
- We're supposed to be able to limit the number of times we hash any
specific secret; do we actually abide by that?
Testing those would require more infrastructure than we have right now.
Also, it might not be that bad of an idea to run a code-coverage tool to
check out how much of the code the regression tests actually tests.
- Files; this is a listing of the files that make up this subsystem, and a
brief description of what's in them. Note that for many .c files, we have a
.h file with the prototypes; we list those together.
common_defs.h This is a central spot to put definitions of general
interest of the entire subsystem
demo.c This is an example program that uses this subsystem; it
implements a simple file signer. Note: because it
doesn't get that great of randomness (due to the need
to restrict ourselves to standard C), it probably
shouldn't be used as is. It does try to use
/dev/urandom; that's of help only on OS's that
actually implement /dev/urandom
endian.[ch] Routines to read/write values in bigendian format
hash.[ch] Routines to implement the hashing API, as used by the
rest of the subsystem. This currently only implements
SHA-256, it'll support other hashes once LMS does.
Note that there really are three separate APIs to do
hashing (hash this entire string, hash this string
using this context variable as a temp, do on-line
hashing); those are all used at various times.
hss.c This used to be where all the code lived; however, we
have since migrated the vast majority of the routines
to more appropriate source files (so we don't have a
huge .c file). Now, it just has a handful of routines
that don't have a better home.
hss.h This is the public include file for the entire
subsystem; this is the file that we expect an
application that uses this subsystem to include.
hss_alloc.c This is the routine whose job it is to allocate a
working key (struct hss_working_key). Note that it
doesn't actually put anything in there, it just
allocates the memory (and initializes some
key-independent fields).
hss_aux.[ch] These are the routines that handle auxiliary data (that
is, data that holds part of the top level Merkle tree,
and is used to speed up the key load process)
hss_common.c These are routines that are of interest to both an
implementation that generates signatures, and one that
only does signature verification.
hss_common.h These are the prototypes for the above routines; we
list it separately to emphesize that this may be
included by an application.
hss_compute.[ch] These are routines that do some common computation;
these are shared between multiple source files.
hss_derive.[ch] This is the structure that does key derivation. It
allows a trade-off between efficiency, and side channel
resistance.
hss_generate.c This is the routine that takes an allocated working key
(hss_alloc.c), and loads a private key into it. Sound
simple? Well, if you go through this, you'll find out
that it isn't.
hss_internal.h These are the prototypes and structures that are common
to this subsystem, but shouldn't be used outside of it.
hss_keygen.c This is the routine that generates a public/private
keypair.
hss_param.c These are routines that deal with parameter sets.
hss_reserve.[ch] These are routines that deal with reservations, and
updating the sequence number in a private key.
hss_sign.c This is the routine that generates an HSS signature.
hss_sign_inc.c This is the routine that generates an HSS signature,
in an incremental fashion.
hss_sign_inc.h This is the public include file for the incremental
signature routines. It's in its own file because it
needs to pull in some internal files (e.g. hash.h)'
that we generally don't need to hand to people
hss_thread.h This is the internal prototype for our internal
threading abstraction. We have two implementations of
the abstraction, we expect to link with one of the two.
hss_thread_pthread.c This is the implementation of the threading API that
links with the POSIX pthread library, and uses that to
to multithreading.
hss_thread_single.c This is the implementation of the threading API that
assumes that we don't have any threading support at all
(and so the main thread does all the work).
hss_verify.c This is the routine that verifies an HSS signature. It
is in its own file so thar someone who wants to only
verify signatures doesn't need to pull in the signing
logic.
hss_verify.h This is the public API for the verifier.
hss_verify_inc.c This is the routine that verifies an HSS signature in
an incremental fashion; that is, you can hand it
pieces of the message in succession (so we don't need
to assume the entire message fits in memory). It
is in its own file so thar someone who wants to only
verify signatures doesn't need to pull in the signing
logic. This API is somewhat less efficient that the
hss_verify.c logic if you're multithreaded (we can't
paralleize the check of the bottom signature with
the upper ones), however it's not a huge delta.
hss_verify_inc.h This is the public API for the incremental verifier.
It's in its own file because it needs to pull in some
internal files (e.g. hash.h) that we generally don't
need to hand to people
hss_zeroize.[ch] This is a routine to clear out memory; it is used to
make sure we don't accidently leak any secrets by
free()ing them, or having them go out of scope.
lm_common.[ch] These are routines that support the (single level) LMS
routines that are of interest to both an implementation
that generates signatures, and one that only does
signature verification.
lm_ots.h Prototype for the OTS signature routines.
lm_ots_common.[ch] These are routines that support the OTS routines that
are of interest to both an implementation that
generates signatures, and one that only does
signature verification.
lm_ots_sign.c Routines that generate OTS public keys, and OTS
signatures
lm_ots_verify.c Routine that computes the public key given an OTS
signature and a message.
lm_verify.[ch] Routine that verifies an LMS signature
sha256.c Pure C implementation of SHA-256; it is included if
USE_OPENSSL is 0. This is provided in case you don't
have OpenSSL available.
sha256.h Routine that computes the SHA-256 hash. This is the
same interface that OpenSSL presents. We also
include a #define (USE_OPENSSL); if 1, these are
direct calls to OpenSSL; if 0, we use our own
implementation (in case you don't have OpenSSL
handy). If OpenSSL is available, use that - it has
an assembly language SHA-256 implementation, and that
performs better.
test_hss.c This is the main driver code for the regression tests.
It doesn't actually implement any tests itself;
instead, it deals with handling the test run
test_hss.h This has the prototypes for all the actual tests
test_*.c These are the actual tests.
And, one final note: I would claim to be a competent C programmer, however my
skills at generating makefiles are laughable. I would ask that you keep the
taunting about my lack of ability there to reasonable bounds.