forked from villevoutilainen/papers
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathhashing.html
3478 lines (3003 loc) · 108 KB
/
hashing.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Types Don't Know #</title>
<style>
p {text-align:justify}
li {text-align:justify}
blockquote.note
{
background-color:#E0E0E0;
padding-left: 15px;
padding-right: 15px;
padding-top: 1px;
padding-bottom: 1px;
}
ins {color:#00A000}
del {color:#A00000}
</style>
</head>
<body>
<address align=right>
Document number: D3980
<br/>
<br/>
<a href="mailto:[email protected]">Howard E. Hinnant</a><br/>
<a href="mailto:[email protected]">Vinnie Falco</a><br/>
<a href="mailto:[email protected]">John Bytheway</a><br/>
2014-05-24
</address>
<hr/>
<h1 align=center>Types Don't Know #</h1>
<h2>Contents</h2>
<ul>
<li><a href="#Introduction">Introduction</a></li>
<li><a href="#Example">The Example</a></li>
<li><a href="#Solution1">Solution 1: Specialize <code>std::hash<X></code></a>
<ul>
<li><a href="#Solution1B">What about implementing it with N3333?</a></li>
</ul>
</li>
<li><a href="#generalpurpose">How to get X to use a general purpose hashing algorithm</a>
<ul>
<li><a href="#Universal">Introducing the Universal hash function!</a></li>
<li><a href="#hash_append">What is <code>hash_append</code>?</a>
<ul>
<li><a href="#hash_append_rules">Rules Relating <code>hash_append</code> to <code>operator==</code></a></li>
</ul>
</li>
<li><a href="#hash_append_vector"><code>hash_append</code> for <code>vector<T, A></code></a></li>
<li><a href="#hash_append_pair"><code>hash_append</code> for <code>std::pair<T, U></code></a></li>
<li><a href="#hash_append_int"><code>hash_append</code> for <code>int</code></a></li>
<li><a href="#is_contiguously_hashable">An Optimization: <code>is_contiguously_hashable<T></code>:</a></li>
<li><a href="#hash_combine">Wait a minute. Isn't <code>hash_append</code> the same thing as <code>boost::hash_combine</code>?</a></li>
<li><a href="#serialization">Wait a minute. Isn't <code>hash_append</code> the same thing as serialization?</a></li>
<li><a href="#variadic">Is there a variadic version of <code>hash_append</code>?</a></li>
<li><a href="#adapt_algorithm">How easily can algorithms other than FNV-1a be used?</a></li>
<li><a href="#switch_algorithm">What is involved in switching hashing algorithms?</a></li>
<li><a href="#pimpl">How does one <code>hash_append</code> Pimpl designs?</a></li>
<li><a href="#seeding">How does one apply random seeding?</a></li>
<li><a href="#unordered">What about unordered containers?</a></li>
<li><a href="#testing">How does the quality of the resulting hash codes compare to the <code>hash_combine</code> solution?</a></li>
</ul>
</li>
<li><a href="#Summary">Summary</a>
<ul>
<li><a href="#proposedinfrastructure">Summary of proposed infrastructure</a></li>
</ul>
</li>
<li><a href="#type_erased_hasher">Appendix A: <code>type_erased_hasher</code></a></li>
<li><a href="#bikeshed">Appendix B: B is for Bike Shed</a></li>
<li><a href="#debugHasher">Appendix C: <code>debugHasher</code></a></li>
<li><a href="#wording">Appendix D: Proposed Wording</a></li>
<li><a href="#Acknowledgments">Acknowledgments</a></li>
</ul>
<a name="Introduction"></a><h2>Introduction</h2>
<p>
This paper proposes a new hashing infrastructure that completely decouples
hashing algorithms from individual types that need to be hashed. This
decoupling divides the hashing computation among 3 different programmers who
need not coordinate with each other:
</p>
<ol>
<li><p>
Authors of hashable types (keys of type <code>K</code>) write their hashing
support just once, using no specific hashing algorithm. This code resembles
(and is approximately the same amount of work as) <code>operator==</code> and
<code>swap</code> for a type.
</p></li>
<li><p>
Authors of hashing algorithms write a functor (e.g. <code>H</code>) that
operates on a contiguous chunk of generic memory, represented by a <code>void
const*</code> and a number of bytes. This code has no concept of a specific key
type, only of bytes to be hashed.
</p></li>
<li><p>
Clients who want to hash keys of type <code>K</code> using hashing algorithm
<code>H</code> will form a functor of type <code>std::uhash<H></code> to
give to an unordered container.
</p>
<blockquote><pre>
unordered_set<K, uhash<H>> my_set;
</pre></blockquote>
<p>
Naturally, there could be a default hashing algorithm supplied by the std::lib:
</p>
<blockquote><pre>
unordered_set<K, uhash<>> my_set;
</pre></blockquote>
</li>
</ol>
<p>
To start off with, we emphasize: there is nothing in this proposal that changes
the existing <code>std::hash</code>, or the unordered containers. And there is
also nothing in this proposal that would prohibit the committee from standardizing
both this proposal, and either one of
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3333.html">N3333</a>
or
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3876.pdf">N3876</a>.
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3333.html">N3333</a>
and
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3876.pdf">N3876</a>
contradict each other, and thus compete with each other. Both cannot be
standardized. This proposal, on the other hand, addresses a problem not addressed
by
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3333.html">N3333</a>
or
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3876.pdf">N3876</a>.
Nor does this proposal depend upon anything in
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3333.html">N3333</a>
or
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3876.pdf">N3876</a>.
</p>
<p>
This paper simply takes a completely different approach to producing hash
codes from types, in order to solve a problem that was beyond the scope of
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3333.html">N3333</a>
and
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3876.pdf">N3876</a>.
The problem solved herein is how to support the hashing of N different types of
keys using M different hashing algorithms, using an amount of source code that
is proportional to N+M, as opposed to the current system based on
<code>std::hash<T></code> which requires an amount of source code
proportional to N*M. And consequently in practice today M==1, and the single
hashing algorithm is supplied only by the std::lib implementor. As it is too
difficult and error prone for the client to supply alternative algorithms for
all of the built-in scalar types (<code>int</code>, <code>long</code>,
<code>double</code>, etc.). Indeed, it has even been too difficult for the
committee to supply hashing support for all of the types our clients might
reasonably want to use as keys: <code>pair</code>, <code>tuple</code>,
<code>vector</code>, <code>complex</code>, <code>duration</code>,
<code>forward_list</code> etc.
</p>
<p>
This paper makes ubiquitous hash support for most types as easy and as
practical as is today's support for <code>swap</code> and
<code>operator==</code>.
</p>
<p>
This paper starts with an assertion:
</p>
<blockquote class=note><p>
Types should not know how to hash themselves.
</p></blockquote>
<p>
The rest of this paper begins with demonstrating the problems created when
software systems assume that types do know how to hash themselves, and what
can be done to solve these problems.
</p>
<a name="Example"></a><h2>The Example</h2>
<p>
Instead of starting with a basic example like <code>std::string</code> or
<code>int</code>, this paper will introduce an example class X
that is meant to be representative of a type that a programmer would write,
and would want to create a hash code for:
</p>
<blockquote><pre>
namespace mine
{
class X
{
std::tuple<short, unsigned char, unsigned char> date_;
std::vector<std::pair<int, int>> data_;
public:
X();
// ...
friend bool operator==(X const& x, X const& y)
{
return std::tie(x.date_, x.data_) == std::tie(y.date_, y.data_);
}
};
} // mine
</pre></blockquote>
<blockquote class=note><p>
How do we write the hash function for X?
</p></blockquote>
<a name="Solution1"></a><h2>Solution 1: Specialize <code>std::hash<X></code></h2>
<p>
If we standardize
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3876.pdf">N3876</a>
which gives us <code>hash_combine</code> and <code>hash_val</code> from
<a href="http://www.boost.org/doc/libs/1_55_0/doc/html/hash/combine.html">boost</a>,
then this is relatively doable:
</p>
<blockquote><pre>
} // mine
namespace std
{
template <>
struct hash<mine::X>
{
size_t
operator()(mine::X const& x) const noexcept
{
size_t h = hash<tuple_element<0, decltype(x.date_)>::type>{}(get<0>(x.date_));
hash_combine(h, get<1>(x.date_), get<2>(x.date_));
for (auto const& p : x.data_)
hash_combine(h, p.first, p.second);
return h;
}
};
} // std
</pre></blockquote>
<p>
First we need to break out of our own namespace, and then specialize
<code>std::hash</code> in <code>namespace std</code>. And we also need to add
a <code>friend</code> statement to our class X:
</p>
<blockquote><pre>
friend struct std::hash<X>;
</pre></blockquote>
<p>
Without <code>hash_combine</code> from
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3876.pdf">N3876</a>
we would have to write our own <code>hash_combine</code>. This could easily
result in a bad hash function as aptly described in
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3876.pdf">N3876</a>.
</p>
<a name="Solution1B"><h3>What about implementing it with
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3333.html">N3333</a>?</h3>
<p>
In our first attempt to use the tools presented in
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3333.html">N3333</a>,
we were surprised at the difficulty, as we were expecting it to be easier.
However after studying the reference implementation in
<a href="http://llvm.org">LLVM</a>, we succeeded in writing the following
<code>friend</code> function of X:
</p>
<blockquote><pre>
friend
std::hash_code
hash_value(X const& x)
{
using std::hash_value;
return std::hash_combine
(
hash_value(std::get<0>(x.date_)),
hash_value(std::get<1>(x.date_)),
hash_value(std::get<2>(x.date_)),
std::hash_combine_range(x.data_.begin(), x.data_.end())
);
}
</pre></blockquote>
<p>
We also strongly suspect that with a little more work on the proposal, this
could be simplified down to:
</p>
<blockquote><pre>
friend
std::hash_code
hash_value(X const& x)
{
using std::hash_value;
return std::hash_combine(hash_value(x.date_), std::hash_combine_range(x.data_.begin(), x.data_.end()));
}
</pre></blockquote>
<p>
Or possibly even:
</p>
<blockquote><pre>
friend
std::hash_code
hash_value(X const& x) noexcept
{
using std::hash_value;
return std::hash_combine(hash_value(x.date_), hash_value(x.data_));
}
</pre></blockquote>
<p>
The reduced burden on the author of X on writing the code to hash X is very
much welcomed! However, hashing algorithms are notoriously difficult to write.
Has the author of X written a <i>good</i> hashing algorithm?
</p>
<p>
The answer is that the author of X does not know, until he experiments with
his data. The hashing algorithm is supplied by the std::lib implementor. If
testing reveals that the algorithm chosen by the std::lib implementor is not
appropriate for the client's data set, then <b>everything</b> offered by both
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3333.html">N3333</a>
and
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3876.pdf">N3876</a>
is for naught. The author of X is on his own, starting from scratch, to
build an alternate hashing algorithm -- even if just to experiment.
</p>
<p>
This concern is not theoretical. If the keys to be hashed can be influenced by
a malicious attacker, it is quite possible for the attacker to arrange for
<i>many distinct</i> keys that all hash to the <i>same</i> hash code. Even
some <i>seeded</i> hashing algorithms are vulnerable to such an attack.
</p>
<p>
<a href="https://131002.net/siphash/murmur3collisions-20120827.tar.gz">Here</a>
is a very short and fast C++ program that can generate as many distinct keys as
you like which all hash to the same hash code using
<a href="http://code.google.com/p/smhasher/wiki/MurmurHash3">MurmurHash3</a>,
even with a randomized seed.
<a href="https://131002.net/siphash/citycollisions-20120730.tar.gz">Here</a> is
another such C++ program demonstrating a similar seed-independent attack on
<a href="https://code.google.com/p/cityhash/">CityHash64</a>. These attacks
do not mean that these are bad hashing algorithms. They simply are evidence
that it is not wise to tie yourself down to a single hashing algorithm. And
if the effort to change, or experiment with, hashing algorithms takes effort
that is O(N) (where N is the number of types or sub-types to be hashed), then
one is tied down.
</p>
<p>
This paper demonstrates infrastructure allowing the author of X to switch
hashing algorithms with O(1) work, regardless of how many sub-types of X need
to be hashed. No matter what hashing algorithm is used, the C++ code to hash
X is the same:
</p>
<blockquote><pre>
template <class HashAlgorithm>
friend
void
hash_append(HashAlgorithm& h, X const& x) noexcept
{
using std::hash_append;
hash_append(h, x.date_, x.data_);
}
</pre></blockquote>
<p>
With this proposal, the author of X gets simplicity, without being heavily invested
in any single hashing algorithm. The hashing algorithm is completely encapsulated
in the templated parameter <code>HashAlgorithm</code>, and the author of X remains
fully and gratefully ignorant of any specific hashing algorithm.
</p>
<a name="generalpurpose"></a><h2>How to get X to use a general purpose hashing algorithm</h2>
<p>
The key to solving this problem is the recognition of one simple observation:
</p>
<blockquote class=note><p>
Types should not know how to hash themselves. However types do know what parts
of their state should be exposed to a hashing algorithm.
</p></blockquote>
<p>
The question now becomes: How do you present X to a general purpose hashing
algorithm without binding it to any specific algorithm?
</p>
<p>
Just as an example, here is a very simple hashing algorithm that many have used
with great success:
</p>
<blockquote><pre>
std::size_t
fnv1a (void const* key, std::size_t len)
{
unsigned char const* p = static_cast<unsigned char const*>(key);
unsigned char const* const e = p + len;
std::size_t h = 14695981039346656037u;
for (; p < e; ++p)
h = (h ^ *p) * 1099511628211u;
return h;
}
</pre></blockquote>
<p>
Although most modern hashing algorithms are much more complicated than
<a href="http://www.isthe.com/chongo/tech/comp/fnv/index.html"><code>fnv1a</code></a>
shown above, there are similarities among them.
</p>
<ul>
<li>They generally take a stream of bytes as their input. This is often
specified as a <code>void const*</code> and a <code>size_t</code> length.</li>
<li>
This interface implies that they work on a contiguous array of bytes.
</li>
<li>
The algorithms generally have an initialization stage, often taking an
optional seed, followed by an accumulation stage which depends on the
supplied bytes, followed by a finalization stage after all of the bytes
are consumed.
</li>
</ul>
<p>
Not all, but most of the algorithms also have the property that they consume
bytes in the order that they are received, possibly with a fixed sized
internal buffer. This characteristic can be taken advantage of in order to
hash <i>discontiguous</i> memory.
</p>
<p>
For example consider this minor repackaging of the FNV-1a algorithm:
</p>
<blockquote><pre>
class fnv1a
{
std::size_t state_ = 14695981039346656037u;
public:
using result_type = std::size_t;
void
operator()(void const* key, std::size_t len) noexcept
{
unsigned char const* p = static_cast<unsigned char const*>(key);
unsigned char const* const e = p + len;
for (; p < e; ++p)
state_ = (state_ ^ *p) * 1099511628211u;
}
explicit
operator result_type() noexcept
{
return state_;
}
};
</pre></blockquote>
<p>
Now the algorithm can be accessed in 3 stages:
</p>
<ol>
<li>The algorithm is initialized in a constructor, in this case the
implicit default constructor. Other constructors / initializations
should be possible. But we start out with this simplest of algorithms.</li>
<li>The algorithm consumes bytes in the <code>operator()(void const* key,
std::size_t len)</code> function. Note that this function can be called any
number of times. In each call the hashed memory is contiguous. But there is
no requirement at all that separate calls refer to a single block of memory.
On each call, the state of the algorithm is recalled from the previous call
(or from the initialization step) and updated with the new <code>len</code>
bytes located at <code>key</code>.</li>
<li>The algorithm is finalized when the object is converted to a
<code>result_type</code> (in this case a <code>size_t</code>).
This is the finalization stage, which in this
case is trivial, but could be arbitrarily complex.</li>
</ol>
<p>
With the FNV-1a algorithm divided into its 3 stages like this, one can call
it in various ways, for example:
</p>
<blockquote><pre>
fnv1a::result_type
hash_contiguous(int (&data)[3])
{
fnv1a h;
h(data, sizeof(data));
return static_cast<fnv1a::result_type>(h);
}
</pre></blockquote>
<p>
or
</p>
<blockquote><pre>
fnv1a::result_type
hash_discontiguous(int data1, int data2, int data3)
{
fnv1a h;
h(&data1, sizeof(data1));
h(&data2, sizeof(data2));
h(&data3, sizeof(data3));
return static_cast<fnv1a::result_type>(h);
}
</pre></blockquote>
<p>
But either way it is called, given the same inputs, the algorithm outputs
the exact same result:
</p>
<blockquote><pre>
int data[] = {5, 3, 8};
assert((hash_contiguous(data) == hash_discontiguous(5, 3, 8)));
</pre></blockquote>
<p>
We can say that <code>fnv1a</code> meets the requirements of a
<code>HashAlgorithm</code>. A <code>HashAlgorithm</code> is a class type
that can be constructed (default, or possibly with seeding), has an
<code>operator()</code> member function with the signature represented above.
The <code>operator()</code> member function processes bytes, updating the
internal state of the <code>HashAlgorithm</code>. This internal state can be
arbitrarily complex. Indeed an extreme example of internal state could be a
copy of every chunk of memory supplied to the <code>HashAlgorithm</code>.
And finally a <code>HashAlgorithm</code> can be explicitly converted to the
nested type <code>result_type</code>, which when used with the unordered
containers should be an alias for <code>size_t</code>.
</p>
<p>
At all times during its lifetime, a <code>HashAlgorithm</code> is
<code>CopyConstructible</code> and <code>CopyAssignable</code>, with each
copy getting an independent copy of the state from the right-hand-side of the
copy (value semantics -- no aliasing among copies). Thus if one knew that
two sequences of data shared a common prefix, one could hash the prefix in
just one sequence, make a copy of the <code>HashAlgorithm</code>, and then
continue after the prefix in each sequence with the two independent
<code>HashAlgorithm</code>s. This would be a pure optimization, getting the
same results as if one had hashed each sequence in full.
</p>
<a name="Universal"></a><h3>Introducing the Universal hash function!</h3>
<p>
Given the concept of <code>HashAlgorithm</code>, a universal hash functor,
which takes <b>any</b> type <code>T</code> can now be written (almost):
</p>
<blockquote><pre>
template <class HashAlgorithm>
struct uhash
{
using result_type = typename HashAlgorithm::result_type;
template <class T>
result_type
operator()(T const& t) const noexcept
{
HashAlgorithm h;
using std::hash_append;
hash_append(h, t);
return static_cast<result_type>(h);
}
};
</pre></blockquote>
<p>
Now one can use <code>uhash<fnv1a></code> as the hash function for
<code>std::unordered_map</code>, for example:
</p>
<blockquote><pre>
std::unordered_map<MyKey, std::string, uhash<fnv1a>> the_map;
</pre></blockquote>
<p>
First note several important attributes of <code>uhash</code>:
</p>
<ol>
<li>
<code>uhash</code> depends only on the hashing algorithm, which is
encapsulated in the <code>HashAlgorithm</code>. <code>uhash</code> does not
depend upon the type <code>T</code> being hashed.
</li>
<li>
<code>uhash</code> is simple. Though such a utility should certainly be
supplied by the std::lib, any programmer can very easily implement their own
variant of <code>uhash</code> for desired customizations (e.g.
<a href="http://en.wikipedia.org/wiki/Random_seed">random seeding</a>,
<a href="http://en.wikipedia.org/wiki/Salt_(cryptography)">salting</a>,
or <a href="http://en.wikipedia.org/wiki/Padding_(cryptography)">padding</a>),
<b>without</b> having to revisit the hashing code for distinct types.
</li>
<li>
For applications other than unordered containers, and for hashing algorithms
that support it, the programmer can easily create a hash functor that returns
something besides a <code>size_t</code>. For example, this could come in
handy for computing a
<a href="http://en.wikipedia.org/wiki/SHA-2">SHA-256</a> result. And all
without having to revisit each individual type!
</li>
</ol>
<p>
Let's walk through <code>uhash</code> one step at a time.
</p>
<ol>
<li><p>
The <code>HashAlgorithm</code> is constructed (default constructed in this
example, but that is not the only possibility). This step initializes the
hashing algorithm encapsulated in the <code>HashAlgorithm</code>.
</p></li>
<li><p>
It is appended to using <code>t</code> as a key. The function
<code>hash_append</code> is implemented for each type that supports hashing.
We will see below that such support code need be written only once per type in
order to support many hashing algorithms. It is implemented in the type's own
namespace, but there are implementations in namespace std for most scalars
(just like <code>swap</code>).
</p></li>
<li><p>
And then the <code>HashAlgorithm</code> is explicitly converted to the
desired result. This is where the algorithm is "finalized."
</p></li>
</ol>
<p>
The above <code>hash</code> functor is accessing the generic hashing algorithm
by its 3 distinct phases. Additionally, this <code>hash</code> functor could
even be defaulted to use your favorite hash algorithm:
</p>
<blockquote><pre>
template <class HashAlgorithm = fnv1a> struct uhash;
</pre></blockquote>
<p>
The question usually arises now: Are you proposing that <code>uhash<></code>
replace <code>hash<T></code> as the default hash functor in the
unordered containers? The answer is it really almost doesn't matter. With
templated using declarations, it is just so easy for programmers to specify
their own defaults:
</p>
<blockquote><pre>
namespace my
{
template <class Key, class T, class Hash = std::uhash<>, class Pred = std::equal_to<Key>,
class Alloc = std::allocator<std::pair<Key const, T>>>
using unordered_map = std::unordered_map<Key, T, Hash, Pred, Alloc>;
} // my
// ...
my::unordered_map<MyKey, std::string> the_map; // uses std::uhash<> instead of std::hash<MyKey>
</pre></blockquote>
<a name="hash_append"></a><h3>What is <code>hash_append</code>?</h3>
<p>
The <code>hash_append</code> function is the way that individual types
communicate with the <code>HashAlgorithm</code>.
</p>
<p>
Each type <code>T</code> is responsible only for exposing its hash-worthy state
to the <code>HashAlgorithm</code> in the function <code>hash_append</code>.
<code>T</code> is <i>not</i> responsible for combining hash codes. Nor is it
responsible for any hashing arithmetic whatsoever. It is only responsible for
pointing out where its data is, how many different chunks of data there are,
and what order they should be presented to the <code>HashAlgorithm</code>.
</p>
<p>
For example, here is how X might implement <code>hash_append</code>:
</p>
<blockquote><pre>
class X
{
std::tuple<short, unsigned char, unsigned char> date_;
std::vector<std::pair<int, int>> data_;
public:
// ...
friend bool operator==(X const& x, X const& y)
{
return std::tie(x.date_, x.data_) == std::tie(y.date_, y.data_);
}
<b>// Hook into the system like this
template <class HashAlgorithm>
friend void hash_append(HashAlgorithm& h, X const& x) noexcept
{
using std::hash_append;
hash_append(h, x.date_);
hash_append(h, x.data_);
}</b>
}
</pre></blockquote>
<p>
Like <code>swap</code>, <code>hash_append</code> is a customization point for
each type. Only a type knows what parts of itself it should expose to a
<code>HashAlgorithm</code>, even though the type has no idea what algorithm
is being used to do the hashing. Note that X need not concern itself with
details like whether or not its sub-types are <i>contiguously hashable</i>.
Those details will be handled by the <code>hash_append</code> for the
individual sub-types. The <i>only</i> information the
<code>hash_append</code> overload for X must implement is what sub-types need
to be presented to the <code>HashAlgorithm</code>, and in what order.
Furthermore the <code>hash_append</code> function is intimately tied to the
<code>operator==</code> for the same type. For example if for some reason
<code>x.data_</code> did not participate in the equality computation, then it
should also not participate in the <code>hash_append</code> computation.
</p>
<a name="hash_append_rules"></a><h4>Rules Relating <code>hash_append</code> to <code>operator==</code></h4>
<p>
For all combination of two values of X, <code>x</code> and <code>y</code>, there
are two rules to follow in designing <code>hash_append</code> for type X.
Actually the second rule is more of a guideline. But it should be followed as
closely as possible:
</p>
<ol>
<li><p>
If <code>x == y</code>, then both <code>x</code> and <code>y</code>
<i>shall</i> send the same message to the <code>HashAlgorithm</code> in
<code>hash_append</code>.
</p></li>
<li><p>
If <code>x != y</code>, then <code>x</code> and <code>y</code> <i>should</i>
send different messages to the <code>HashAlgorithm</code> in
<code>hash_append</code>.
</p></li>
</ol>
<p>
It is very important to keep these two rules in mind when designing the
<code>hash_append</code> function for any type, or for any instantiation of a
class template. Failure to follow the first rule will mean that equal values
hash to different codes. Clients such as unordered containers will simply
fail to work, resulting in run time errors if this rule is violated. Failure
to follow the second guideline will result in hash collisions for the two
different values that send identical messages to the
<code>HashAlgorithm</code>, and will thus degrade the performance of clients
such as unordered containers.
</p>
<a name="hash_append_vector"></a><h3><code>hash_append</code> for <code>vector<T, A></code></h3>
<p>
For example <code>std::vector<T, A></code> would
never expose its <code>capacity()</code>, since <code>capacity()</code> can be
different for <code>vector</code>'s that otherwise compare equal. Likewise
it should not expose its <code>allocator_type</code> to <code>hash_append</code>,
since this value also does not participate in the equality computation.
</p>
<p>
Should <code>vector</code> expose its <code>size()</code> to the
<code>HashAlgorithm</code>? To find out, lets look closer at the
<code>operator==</code> for <code>vector</code>:
</p>
<blockquote><p>
Two <code>vector</code>'s <code>x</code> and <code>y</code> compare equal if
<code>x.size() == y.size()</code> and if <code>x[i] == y[i]</code> for
<code>i</code> in the range of 0 to <code>x.size()</code>.
</p></blockquote>
<p>
To meet <a href="#hash_append_rules">rule 1</a>, it is sufficient that every
element in the <code>vector</code> be sent to the <code>HashAlgorithm</code>
as part of the <code>vector</code>'s message. A logical convention is that
the elements will be sent in order from <code>begin()</code> to
<code>end()</code>. But this alone will not satisfy <a
href="#hash_append_rules">rule 2</a>. Consider:
</p>
<blockquote><pre>
std::vector<std::vector<int>> v1{};
std::vector<std::vector<int>> v2{1};
assert(v1 != v2);
</pre></blockquote>
<p>
<code>v1</code> and <code>v2</code> are not equal. <code>v1.size() ==
0</code> and <code>v2.size() == 1</code>. However <code>v2.front().size()
== 0</code>. If an empty <code>vector<int></code> sends no message at
all to the <code>HashAlgorithm</code>, then <code>v2</code>, even though it
is not empty, also sends no message to the <code>HashAlgorithm</code>. And
therefore <code>v1</code> and <code>v2</code> send the same (0 length)
message to the <code>HashAlgorithm</code>, violating <a
href="#hash_append_rules">rule 2</a>.
</p>
<p>
One idea for fixing this is to special case 0-length <code>vector</code>s to
output a special value such as "empty" or 0. However in the first case the
result would be ambiguous with a <code>vector<string></code> of length
1 containing the string "empty". The second case has the exact same problem
but for <code>vector<int></code>.
</p>
<p>
The right way to fix this problem is to have <code>vector<T></code>
send its <code>size()</code> in addition to sending all of its members to the
<code>HashAlgorithm</code>. Now the only question is: Should it send its
<code>size</code> before or after sending its members to the
<code>HashAlgorithm</code>?
</p>
<p>
To answer this last question, consider another sequence container:
<code>forward_list<T></code>. It has the exact same issues as we have
been discussing for <code>vector<T></code>, but
<code>forward_list<T></code> has no <code>size()</code> member. In
order to send its <code>size()</code>, <code>forward_list<T></code> has
to loop through all of its members to first compute <code>size()</code>. In
order to avoid the requirement that <code>hash_append</code> for
<code>forward_list<T></code> make two passes through the list, we
should specify that the <code>size()</code> of the container is sent to the
<code>HashAlgorithm</code> <i>after</i> all of the elements are sent. And
for consistency, we should do this for all std-containers for which
<code>hash_append</code> is defined.
</p>
<blockquote><pre>
template <class HashAlgorithm, class T, class Alloc>
void
hash_append(HashAlgorithm& h, std::vector<T, Alloc> const& v) noexcept
{
for (auto const& t : v)
hash_append(h, t);
hash_append(h, v.size());
}
</pre></blockquote>
<p>
I.e. <code>vector</code> considers itself a message composed of 0 or more
sub-messages, and appends each sub-message (in order) to the state of the
generic <code>HashAlgorithm</code>. And this is followed with a final
message consisting of the <code>size()</code> of the <code>vector</code>.
</p>
<p>
Note that as
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3333.html">N3333</a>
and
<a href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3876.pdf">N3876</a>
both stand today, this critically important but subtle detail is not treated,
and is left up to the client (the author of X) to get right. This proposal
states that this is a detail that the <code>hash_append</code> for
<code>vector</code> (and every other hashable std-container) is responsible for.
</p>
<p><b>Emphasis</b></p>
<blockquote><p>
The message a type sends to a <code>HashAlgorithm</code> is part of its
public API. E.g. whether or not a container includes its <code>size()</code>
in its <code>hash_append</code> message, and if so, whether the
<code>size()</code> is prepended or appended to the message, is critical
information a type's client needs to know, in order to ensure that their
composition of some type's message with another type's message doesn't
produce an ambiguous message (doesn't create collisions).
</p><p>
The standard should clearly document the message emanating from every
<code>hash_append</code> it defines, to the extent possible. It might not be
possible to nail down that an implementation is using IEEE floating point or
two's complement signed integers. But the standard can certainly document
the message produced by a <code>vector</code> or any other std-defined class
type.
</p></blockquote>
<a name="hash_append_pair"></a><h3><code>hash_append</code> for <code>std::pair<T, U></code></h3>
<p>
The situation is simpler for <code>std::pair<T, U></code>:
</p>
<blockquote><pre>
template <class HashAlgorithm, class T, class U>
void
hash_append (HashAlgorithm& h, std::pair<T, U> const& p) noexcept
{
hash_append (h, p.first);
hash_append (h, p.second);
}
</pre></blockquote>
<p>
All there is to do is to just <code>hash_append</code> the first and second
members of the pair.
</p>
<a name="hash_append_int"></a><h3><code>hash_append</code> for <code>int</code></h3>
<p>
Eventually <code>hash_append</code> will drill down to a scalar type such as
<code>int</code>:
</p>
<blockquote><pre>
template <class HashAlgorithm>
void
hash_append(HashAlgorithm& h, int const& i) noexcept
{
h(&i, sizeof(i));
}
</pre></blockquote>
<p>
Whereupon a contiguous chunk of memory is actually accumulated by the
<code>HashAlgorithm</code>, using the <code>HashAlgorithm</code>'s
<code>operator()</code>. Recall that the <code>HashAlgorithm</code> has a
member function <code>operator()(const void* key, std::size_t len)
noexcept</code>. And the <code>int</code> is just a chunk of
<i>contiguous</i> memory that is <i>hashable</i>. It is now prudent to
deeply consider what it means to say that a type (such as <code>int</code>)
is <i>contiguously hashable</i>.
</p>
<p>
A type <code>T</code> is <i>contiguously hashable</i> if for all combinations of
two values of a type, say <code>x</code> and <code>y</code>, if <code>x ==
y</code>, then it must also be true that <code>memcmp(addressof(x),
addressof(y), sizeof(T)) == 0</code>. I.e. if <code>x == y</code>, then
<code>x</code> and <code>y</code> have the same bit pattern representation. A
2's complement <code>int</code> satisfies this property because every bit
pattern an <code>int</code> can have results in a distinct value (<a
href="#hash_append_rules">rule 2</a>). And there are no "padding bits" which might take on
random values. This property is necessary because if two values are equal, then
they must hash to the same hash code (<a href="#hash_append_rules">rule 1</a>).
</p>
<a name="is_contiguously_hashable"></a><h3>An Optimization: <code>is_contiguously_hashable<T></code>:</h3>
<p>
With that in mind we can easily imagine a type trait:
</p>
<blockquote><pre>
template <class T> struct is_contiguously_hashable;
</pre></blockquote>
<p>
which derives from either <code>true_type</code> or <code>false_type</code>. And
on 2's complement systems,
<code>is_contiguously_hashable<int>::value</code> is <code>true</code>.
And we might anticipate that some other types, such as <code>char</code> and
<code>long long</code> are also <i>contiguously hashable</i>. With this
tool we can now easily write <code>hash_append</code> for all contiguously
hashable types:
</p>
<blockquote><pre>
template <class HashAlgorithm, class T>
inline
std::enable_if_t
<
is_contiguously_hashable<T>::value
>
hash_append(HashAlgorithm& h, T const& t) noexcept
{
h(addressof(t), sizeof(t));