-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathL2F_18mo.html
1499 lines (946 loc) · 43.5 KB
/
L2F_18mo.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html>
<head>
<title>L2F</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
<link rel="stylesheet" href="fonts/quadon/quadon.css">
<link rel="stylesheet" href="fonts/gentona/gentona.css">
<link rel="stylesheet" href="slides_style.css">
<script type="text/javascript" src="assets/plotly/plotly-latest.min.js"></script>
</head>
<body>
<textarea id="source">
name:opening
**A Theory and Practice of the Lifelong Learnable Forest**<br>
Hayden Helm | Ronak Mehta <br>
Carey E. Priebe | Raman Arora | [Joshua T. Vogelstein](https://neurodata.io) (JHU) <br>
<!-- {[BME](https://www.bme.jhu.edu/),[ICM](https://icm.jhu.edu/),[CIS](http://cis.jhu.edu/),[KNDI](http://kavlijhu.org/)}@[JHU](https://www.jhu.edu/) -->
<a href="https://neurodata.io"><img src="images/neurodata_purple.png" style="height:430px;"/></a>
<!-- <img src="images/funding/jhu_bme_blue.png" STYLE="HEIGHT:95px;"/> -->
<!-- <img src="images/funding/KNDI.png" STYLE="HEIGHT:95px;"/> -->
<!-- <font color="grey"></font> -->
.foot[[jovo@jhu.edu](mailto:[email protected]) | <http://neurodata.io/talks> | [@neuro_data](https://twitter.com/neuro_data)]
---
### Program goals
1. Formalize A General Theory of (Lifelong) Learning
2. Develop Lifelong Learning Forests (L2F) algorithm
3. Demonstrate L2F achieves SOA lifelong learning
3. Prove L2F theoretically "solves" lifelong learning
4. Discuss Future plans
---
class: middle, inverse
## .center[A General Theory of Learning]
---
## What is Learning?
<img src="images/Vapnik71.png" style="width:500px;"/>
--
<img src="images/Valiant84.png" style="width:500px;"/>
---
## What is Learning?
<img src="images/Mitchell97.png" style="width:600px;"/>
---
## What is Learning?
Tom Mitchell 1997 (not exact quote):
"An algorithm $f$ learns from data $D_n$ with respect to task $T$ with performance measure $\mathcal{E}$, if $f$'s performance at task $T$ improves with $D_n$.''
---
### What is a Task?
A task is defined by a septuple $T = \lbrace \mathcal{Z}, P, \mathcal{P}, \mathcal{A}, \mathcal{H}, \ell, {R} \rbrace$
| Definition | Notation | Example
|:--- |:--- |:--- |
| Measurements | $ \mathcal{Z}$ | $z_i = (x_i,y_i): i \in [n]$ |
| Density | $P\_\theta : \mathcal{Z} \rightarrow \mathbb{R}_{\geq 0}$ | standard normal
| Model | $\mathcal{P} := \lbrace P_\theta : \theta \in \Theta \rbrace$ | Gaussian
| Action space | $\mathcal{A}$ | turn left
| Hypotheses | $\mathcal{H} = \lbrace h: \mathcal{Z} \to \mathcal{A} \rbrace$ | $a = \mathbb{I}(z>0)$
| Loss Function | $\ell: \mathcal{H} \times \mathcal{A} \to \mathbb{R}_+$ | $ (a - y)^2$
| Risk Functional | $R: \mathcal{P} \times \mathcal{L} \times \mathcal{H} \to \mathbb{R}_+$ | $\mathbb{E}_P[\ell( h(z), a)]$
---
### What is an Algorithm?
$f$ is a sequence of learning procedures, $f_1, f_2, \ldots$, where each $f_n \in \mathcal{F}$ maps from a data corpus $\mathcal{D}_n$ to a hypothesis $h$, $f\_n : \mathcal{Z}^n \to \mathcal{H}$.
### What is Performance?
Generalization error $\mathcal{E}$ is the expected risk with respect to training dataset:
$$ \mathcal{E}\_T(f\_n) = \mathbb{E}_{P}[R(f_n(D_n))].$$
---
## What is Learning?
"An algorithm $f$ learns from data $D_n$ with respect to task $T$ with performance measure $\mathcal{E}$, if $f$'s performance at task $T$ improves with $D_n$."
<br>
$f$ learns from data iff $\mathcal{E}_T(f_n)$ is better than $\mathcal{E}_T(f_0)$, or more formally
<br>
$$\frac{\mathcal{E}_T(f_0)}{\mathcal{E}_T(f_n)} > 1$$
---
### What is Transfer Learning?
- Given
- data $D_0$, and an algorithm $f_0$ trained using only $D_0$.
- data $D_n$ from possibly another distribution,
- algorithm $f_n$ trained using both $D_0$ and $D_n$
"An algorithm $f$ .pu[transfer] learns from data $D_n$ with respect to transfer learning task $T$ with performance measure $\mathcal{E}$, if $f$'s performance at task $T$ improves over with $D_n$.'"
--
$f$ .pu[transfer] learns from data iff
$$\frac{\mathcal{E}\_T(f\_0)}{\mathcal{E}\_T(f\_{n})} > 1.$$
---
### What is Multi-Task (MT) Learning?
- Given
- for $j=1,\ldots, J$
- data $D_j$, possibly from different distributions,
- algorithm $f_j$ trained using only $D_j$,
- algorithm $f_n$ trained using all $D_j$'s,
- a multi-task risk (e.g., average risk across tasks).
An algorithm $f$ .pu[multi-task] learns from data $D_n$ with respect to multi-task $T$ with overall performance measure $\mathcal{E}_T$, if $f$'s performance at task $T$ improves with $D_n$."
--
$f_0$ .pu[multi-task] learns from data iff
$$\frac{\mathcal{E}\_T( \lbrace f\_j \rbrace )}{\mathcal{E}\_T(f\_{n})} > 1.$$
---
### What is Lifelong Learning?
- Given
- for $j=1,\ldots$
- data $D_j$, possibly from different distributions,
- algorithm $f_j$ trained using only $D_j$,
- algorithm $f_n$ .pu[sequentially] trained using all $D_j$'s,
- a multi-task risk (e.g., average risk across tasks).
An algorithm $f$ .pu[lifelong] learns from data $D_n$ with respect to lifelong task $T$ with overall performance measure $\mathcal{E}_T$, if $f$'s performance at task $T$ improves with $D_n$."
--
$f$ .pu[lifelong] learns from data iff
$$\frac{\mathcal{E}\_T( \lbrace f\_j \rbrace )}{\mathcal{E}\_T(f\_{n})} > 1.$$
---
### Generalized Learning
<br>
| | $J>1$ | Sequential |
| :---: | :---: | :---: |
| Machine | 0 | 0 |
| Multi-task | 1 | 0 |
| Lifelong | 1 | 1 |
---
### Three Kinds of Lifelong Learning
1. Supervised: setting is provided for each data sample
2. Unsupervised: setting is not provided for any data sample
3. Semi-supervised: setting is sometimes provided for some data samples
--
We largely focus on supervised setting hereafter.
---
class: middle, inverse
## .center[General (Lifelong) Learning Machines]
---
### Machine Learning: Basic Set-Up
Given
1. A task $ T $
2. Observed data $ z\_1, z\_2, .., z\_n $
Assume
1. The $ z\_t $ are independent
---
### Constituents of Learning Machines
1. .pu[representation]: data are transformed into a representation space
2. .pu[transformer]: maps each point to the representation space
3. .pu[decider]: takes an action using the transformed data
---
### Example 1
Linear 2-class classification (we want to label black dot)
<img src="images/l2m_18mo/cml_all_data.png" style="width:500px" class="center"/>
---
#### Example 1.1 - Linear Discriminant Analysis
1. .pu[representation]: the real number line
2. .pu[transformer]: projection matrix that maps each feature vector onto real number line
3. .pu[decider]: threshold such that values above threshold are one class, and below are the other
<img src="images/l2m_18mo/cml_linear.jpg" style="width:600px" class="center"/>
<!-- diagram of linear classifier (projection from R^d to R then threshold) for two class, spherical gaussians -->
<!-- figure of problem -->
<!-- figure with representation -->
<!-- figure showing transformation of one data point to the line -->
<!-- figure showing decider acting on a the transformed data point -->
---
#### Example 1.2 - Decision Tree
1. .pu[representation]: a partition of $ \mathbb{R}^{d} $
2. .pu[transformer]: maps a data point to a cell of the partition
3. .pu[deciders]: plurality (or average) of elements per cell
<img src="images/l2m_18mo/cml-tree.jpg" style="width:650px" class="center"/>
<!-- diagram of a partition for two class, spherical gaussians -->
<!-- figure of representation (partition) -->
<!-- figure of a data point in a cell of the partition -->
<!-- figure of classifying a data point in a cell of the partition -->
---
#### Example 1.3 - Decision Forests
1. .pu[representation]: a set of partitions of $ \mathbb{R}^{d} $
2. .pu[transformer]: maps a data point to a cell of each of the partitions
3. .pu[deciders]: average of decision trees (posterior probabilities)
<img src="images/l2m_18mo/cml_forest.jpg" style="width:600px" class="center"/>
Each tree uses a different subset of data to transform (partition) space
<!-- Each partition learned from the various subsamples is a **representation**, and the decision tree **transforms** data into those representations. The **decider** chooses the class based on the cell probability. -->
<!-- --- -->
<!-- #### Example 1.3 - Decision Forests (Tree 1)
Subsample the data and learn a partition of the input space. The input space is $\mathbb{R}^2$ and the classes are green or purple. Each cell of the partition has an associated probability of class green. A new data point is first placed into a cell, and then its class is predicted based on the cell's probability.
<img src="images/l2m_18mo/decisiontree_for_forest1.png" style="width:400px" class="center"/> -->
<!-- --- -->
<!-- #### Example 1.3 - Decision Forests (Tree 2) -->
<!-- Similarly, another random subsample will yield a different decision tree, with different cells and cell probabilities. -->
<!-- <img src="images/l2m_18mo/decisiontree_for_forest2.png" style="width:500px" class="center"/> -->
<!-- --- -->
<!-- #### Example 1.3 - Decision Forests -->
<!-- Each partition learned from the various subsamples is a **representation**, and the decision tree **transforms** data into those representations. The **decider** chooses the class based on the cell probability. -->
<!-- <img src="images/l2m_18mo/cml_forest.jpg" style="width:600px" class="center"/> -->
<!-- diagram for two class, spherical gaussians -->
<!-- figure of representtaion (multiple partitions) -->
<!-- figure of data point in a cell of each of the partitions -->
<!-- figure of classifying a data point -->
<!-- ---
### Decision Forests
-.pu[transformer]: partition space using a subset of the $ n $ data samples
-.pu[decider]: majority vote on remaining samples
<img src="images/l2m_rf_cartoon.jpg" style="width:475px" class="center"/> -->
<!-- diagram 1: all data
diagram 2: transformer data
diagram 3: partitions learned on transformer data
diagram 4: posteriors estimated using transformer data (continue to show data)
diagram 5: posteriors estimated using decider data (continue to show data)
diagram 6: show true posterior
diagram 7: consider showing posterior errors
- explan n, d
-->
<!--
new slide: add error + posterior plots for linear example
-->
---
### Example 2
XOR (we want to label black dot)
- Samples in the (+, +) and (-, -) quadrants are class 1
- samples in the (-, +) and (+, -) quadrants are class 0.
- The optimal decision boundary is the coordinate axes.
<img src="images/l2m_18mo/xor.png" style="width:400px" class="center"/>
---
### Example 2.1 - Honest Forests
<img src="images/l2m_18mo/just_xor.png" alt="Snow" style="height:300px;">
<!-- <img src="images/l2m_18mo/posterior_heatmap_xor_nxor_rf.png" style="height:300px;"> -->
- Error is estimated generalization error, $\hat{\mathbb{E}}_{P}[R(f_n(D_n))]$
- transformer (partition) learned using one subset of data from XOR
- decider (posteriors) learned using .pu[other] data from XOR
- Error monotonically decreases with increasing sample size
---
### Example 2.1 - Honest Forests
<img src="images/l2m_18mo/just_xor.png" alt="Snow" style="height:300px;">
<img src="images/l2m_18mo/posterior_heatmap_xor_nxor_rf.png" style="height:300px;">
- learned probability of an observation being green
- using 100 XOR training samples
---
### Transfer Learning: Basic Set-Up
Given
1. a task $ T $
2. observed source data $ z'\_1, z'\_2, .., z'\_s $
3. observed target data $ z\_{s+1}, z\_{s+2}, .., z\_n $
Assume
1. the $ z\_t $, $ z'\_t $ are all mutually independent
<!-- Make the following assumptions:
1. We are given $m$ training samples from a .r[source] task
2. At each time step $t$, we observe
- a target sample $z_t \in \mathcal{Z}$, and
- a task identifier $j_t \in \mathcal{J}$ (which includes $\emptyset$).
3. The number of target samples $n$ increases.
4. All samples are independent of each other.
Assumption 4 can be relaxed. -->
---
### Constituents of Transfer Machines
1. .pu[representation]: source and target data are transformed into this space
2. .pu[transformers]: map each source and target point to the representation space
3. .pu[decider]: takes an action using all the transformed data
---
### Example 3
A 2-class classification problem, also with other 2-class data
<div style="display:inline-flex">
<div>
<img src="images/l2m_18mo/cml_linear_problem1.png" alt="Snow" style="height:300px; width:300px">
</div>
<div>
<img src="images/l2m_18mo/cml_linear_problem_dataset2.png" alt="nah" style="height:300px; width:300px">
</div>
</div>
<!-- 1. .pu[representation]: the real number line
2. .pu[transformer]: projection matrix that maps data onto real number line
3. .pu[decider]: threshold such that target values above threshold are one class, and below are the other -->
<!-- task 1:
- mean1 = (1,1) orange
- mean0 = (0,0) purple
task 2:
- mean1 = (0,0) green
- and mean0 = (1,1) magenta
-->
---
### Constituents of Transfer Trees
1. .pu[representation]: a partition of $ \mathbb{R}^{d} $
2. .pu[transformer]: places data point into a cell of the partition learned on **source** data
3. .pu[decider]: majority vote of **target** data in the cell of the partition
<img src="images/l2m_18mo/transfer-tree.jpg" class="center" alt="Snow" style="height:300px; width:550px">
<!-- update above image as above: label arrows transformer and decider -->
<!--
new slide:
show error and posteriors
but, x-axis in # of source data samples (for fixed target)
y-axis is target error
2 lines: one using only target, and one also using increasing source
give title to posterior plot
http://colorbrewer2.org/#type=diverging&scheme=PuOr&n=11
-->
<!-- ---
## Transfer Tree
- Use **source** data to learn a transformer (estimate partitions)
- Use **source** and **target** data to learn deciders (estimate posteriors)
- Average predictions from each decider.
<img src="images/l2m_18mo/transfer-tree.jpg" class="center" alt="Snow" style="height:300px; width:550px"> -->
---
### Example 4: .blue[Not-XOR (N-XOR)] to .r[XOR]
- .r[XOR]
- Samples in the (+, +) and (-, -) quadrants are green
- samples in the (-, +) and (+, -) quadrants are purple.
- .blue[N-XOR]
- Samples in the (+, +) and (-, -) quadrants are green
- samples in the (-, +) and (+, -) quadrants are purple.
- The optimal decision boundaries for both problems are the coordinate axes.
<img src="images/l2m_18mo/xor_nxor.png" style="width:475px" class="center"/>
---
### Transfer Forests
<div style="display:inline-flex">
<div>
<img src="images/l2m_18mo/L2M_18mo_xor_nxor_transfer.png" alt="Snow" style="height:300px; width:300px">
</div>
<div>
<img src="images/l2m_18mo/posterior_heatmap_xor_nxor_transfer.png" alt="nah" style="height:300px; width:300px">
</div>
</div>
- (Left) Decision Forest .r[XOR] uses:
- 100 samples from .r[XOR] to learn decider (posterior)
- $n$ data from .blue[N-XOR] to learn transformer (partition)
- (Right) learned probability of an observation being green
- used 100 .r[XOR] and 300 .blue[N-XOR] training samples
<!-- change posterior heatmap colors -->
---
### Transfer Forests
<div style="display:inline-flex">
<div>
<img src="images/l2m_18mo/L2M_18mo_xor_nxor_transfer.png" alt="Snow" style="height:300px; width:300px">
</div>
<div>
<img src="images/l2m_18mo/posterior_heatmap_xor_nxor_transfer.png" alt="nah" style="height:300px; width:300px">
</div>
</div>
- Learning partitions is "harder" than posteriors:
- DF .r[XOR] only gets 100 samples to learn partitions,
- DF .blue[N-XOR] gets up to 400
- DF .blue[N-XOR] exhibits transfer learning, because its error decreases on .r[XOR] as the number of samples from .blue[N-XOR] increases for a fixed amount of data from .r[XOR]
<!-- change posterior heatmap colors -->
---
### Lifelong Learning: Basic Set-up
Given
1. a universe of potential tasks, $\mathcal{J}$
1. a sequence of data and task pairs $(z_1, j_1), (z_2,j_2),\ldots$, where each $j_t \in \mathcal{J}$
<!-- 2. data and t$ z'\_1, z'\_2, .., z'\_s $ -->
<!-- 3. observed target data $ z\_1, z\_2, .., z\_n $ -->
Assume
1. the $ z\_t $ are all independent
1. total \# samples > \# unique tasks
---
### Constituents of Lifelong LM (L2M)
1. .pu[representation]: data from all tasks are transformed into this space
2. .pu[transformers]: map each data point from all tasks to the representation space
3. .pu[deciders]: takes an action using transformed data from all tasks
---
### Example 5
$J$ increasing linear 2-class classification tasks
<div style="display:inline-flex">
<div>
<img src="images/l2m_18mo/cml_linear_problem1.png" alt="Snow" style="height:200px; width:200px">
</div>
<div>
<img src="images/l2m_18mo/cml_linear_problem_dataset2.png" alt="nah" style="height:200px; width:200px">
</div>
<div>
<img src="images/l2m_18mo/cml_linear_problem3.png" alt="Snow" style="height:200px; width:200px">
</div>
</div> ...
<!-- 1. .pu[representation]: the real number line
2. .pu[transformers]: $J$ different projection matrix that maps data onto real number line
3. .pu[deciders]: $J$ thresholds such that values above threshold are one class, and below are the other -->
<!-- means are +/- j (potentially with jitter added) -->
---
### Lifelong Trees
1. .pu[representation]: set of partitions
2. .pu[transformers]: a tree to partition space for each task
3. .pu[deciders]: average of majority vote in each node from each tree applied to each task's data
<img src="images/l2m_18mo/lifelong-tree.jpg" style="height:400px;width:550px">
---
### Lifelong Trees
1. .pu[representation]: set of partitions
2. .pu[transformers]: a tree to partition space for each task
3. .pu[deciders]: average of majority vote in each node from each tree applied to each task's data
- **each task** yields a representation,
- the data are transformed into each representation,
- the deciders make predictions in each representation,
- the prediction is averaged across representations.
<!-- update a la other ones -->
---
### Example 6: .r[XOR] and .blue[Not-XOR (N-XOR)]
- .r[XOR]
- Samples in the (+, +) and (-, -) quadrants are green
- samples in the (-, +) and (+, -) quadrants are purple.
- .blue[N-XOR]
- Samples in the (+, +) and (-, -) quadrants are purple
- samples in the (-, +) and (+, -) quadrants are green.
- The optimal decision boundaries for both problems are the coordinate axes.
<img src="images/l2m_18mo/xor_nxor.png" style="width:475px" class="center"/>
<!-- update colors -->
---
### Lifelong Forests
<div style="display:flex">
<div>
<img src="images/l2m_18mo/L2M_18mo_xor_nxor_lifelong.png" alt="Snow" style="height:300px; width:300px">
</div>
<div>
<img src="images/l2m_18mo/posterior_heatmap_xor_nxor_lifelong.png" alt="nah" style="height:300px; width:300px">
</div>
</div>
- .green[Lifelong Forests] learn the optimal decision boundary using both
- 100 samples from .r[XOR], and
- $n$ samples from .blue[N-XOR].
- Posteriors are learned using only 300 samples from .blue[N-XOR] on both partitions.
- Prediction is based on the learned posterior averaged over both partitions.
<!-- make same as transfer learning plots but add new line
update text accordingly
-->
---
### Lifelong Forests
<div style="display:flex">
<div>
<img src="images/l2m_18mo/L2M_18mo_xor_nxor_lifelong.png" alt="Snow" style="height:300px; width:300px">
</div>
<div>
<img src="images/l2m_18mo/posterior_heatmap_xor_nxor_lifelong.png" alt="nah" style="height:300px; width:300px">
</div>
</div>
- .green[Lifelong Forests] perform as well as the best method regardless of the amount of .blue[N-XOR] data
- .green[Lifelong Forests] therefore exhibits .pu[adaptation to new tasks and circumstances] more effectively than a naïve transfer forest
<!-- may want to show plots for other transfer direction -->
---
### Adversarial Tasks
- .r[XOR] vs .blue[N-XOR] is the easiest possible sequence of tasks
- Consider an adversarial task:
- a second task, .blue[Rotated-XOR (R-XOR)], which has zero information about the first task;
- therefore, its data can only add noise and increase error
<img src="images/l2m_18mo/xor_rxor.png" style="width:475px" class="center"/>
<!-- add two linear two-class classification problem w adverserial source task -->
<!-- adverserial source task = either means reflected across x-axis or two gaussians centered at 0,0 -->
---
### Example 7: .r[XOR] and .blue[R-XOR]
<!-- integrate this content into slide with figure -->
- Left figures:
- y-axis is error on .r[XOR] learned using 100 samples
- x-axis is # of .blue[R-XOR] data samples
- we used 100 .r[XOR] data samples
- Right figures:
- estimated probability of an observation from purple
- learned using 100 .r[XOR] and 300 .blue[R-XOR] training samples
---
### Decision Forests
<div style="display:flex">
<div>
<img src="images/l2m_18mo/L2M_18mo_xor_rxor_rf.png" alt="Snow" style="height:300px; width:300px">
</div>
<div>
<img src="images/l2m_18mo/posterior_heatmap_xor_rxor_rf.png" alt="nah" style="height:300px; width:300px">
</div>
</div>
- Decision Forest .r[XOR] uses
- data from .r[XOR] to learn transformer (partition)
- other data from .r[XOR] to learn decider (posterior)
- Since the number of samples from .r[XOR] is fixed, the error of DF .r[XOR] is constant
<!-- fix posterior plots as above, with titles -->
---
### Transfer Forests
<div style="display:flex">
<div>
<img src="images/l2m_18mo/L2M_18mo_xor_rxor_transfer.png" alt="Snow" style="height:300px; width:300px">
</div>
<div>
<img src="images/l2m_18mo/posterior_heatmap_xor_rxor_transfer.png" alt="nah" style="height:300px; width:300px">
</div>
</div>
- Decision Forest .blue[R-XOR] uses
- data from .blue[R-XOR] to learn transformers(partition)
- data from .r[XOR] to learn decider (posterior)
- Error only decreases slightly because the partitions are nearly uninformative (the informativeness is actually due to noisy estimate of the partition, the optimal partition is fully uninformative)
<!-- fix posterior plots as above -->
---
### Lifelong Forests
<div style="display:flex">
<div>
<img src="images/l2m_18mo/L2M_18mo_xor_rxor_lifelong.png" alt="Snow" style="height:300px; width:300px">
</div>
<div>
<img src="images/l2m_18mo/posterior_heatmap_xor_rxor_lifelong.png" alt="nah" style="height:300px; width:300px">
</div>
</div>
- .green[Lifelong Forests] learn the optimal decision boundary using data from both (1) .r[XOR], and (2) .blue[R-XOR].
- Posteriors are learned using only data from .r[XOR] on both partitions.
- Prediction is based on the estimated posterior averaged over both partitions.
<!-- fix posterior plots as above -->
---
### Lifelong Forests
<div style="display:flex">
<div>
<img src="images/l2m_18mo/L2M_18mo_xor_rxor_lifelong.png" alt="Snow" style="height:300px; width:300px">
</div>
<div>
<img src="images/l2m_18mo/posterior_heatmap_xor_rxor_lifelong.png" alt="nah" style="height:300px; width:300px">
</div>
</div>
- .green[Lifelong Forests] do not dramatically degrade performance of past tasks, despite maximally adversarial tasks
- Therefore, .green[Lifelong Forests] avoid .pu[catastrophic forgetting] regardless of new task distributions, making it fundamentally .pu[safe].
<!-- fix poster plots as above -->
---
class: center, middle, inverse
## .center[L2F Lifelong Learns Empirically on Real Data]
---
## "Real" Data: CIFAR 10-by-10
- CIFAR 100 is a popular image classification dataset with 100 classes of images, containing 600 images, each 32x32 colour images.
- There are 500 training images and 100 testing images per class.
- CIFAR 10x10 break the 100 class task problem into 10 problems, each a 10 class problem.
<img src="images/l2m_18mo/cifar-10.png" style="position:absolute; left:250px; width:400px;"/>
<!-- show images of each task/class
eg, https://www.cs.toronto.edu/~kriz/cifar.html for 1 task,
make a table left column is superclass and next 5 columns are tasks
include this info: This dataset is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a "fine" label (the class to which it belongs) and a "coarse" label (the superclass to which it belongs).
Here is the list of classes in the CIFAR-100:
-->
---
### Previous State-of-the-Art #1
- David Lopez-Paz, Marc'Aurelio Ranzato. "[Gradient Episodic Memory for Continual Learning](https://papers.nips.cc/paper/7225-gradient-episodic-memory-for-continual-learning.pdf)." NIPS 2017.
- 127 Citations
<img src="images/l2m_18mo/evoplot_cifar100.png" style="width:750px;"/>
Figure 1: evolution of the test accuracy at the first task, as more tasks are learned.
---
### Previous State-of-the-Art #2
- Seungwon Lee, James Stokes, and Eric Eaton. "[Learning Shared Knowledge for Deep Lifelong Learning Using Deconvolutional Networks](https://www.ijcai.org/proceedings/2019/393)." IJCAI, 2019.
<img src="images/l2m_18mo/progressive_nets.png" style="width:500px;"/>
Figure 1: No approaches demonstrate reverse transfer, and most demonstrate catastrophic forgetting.
---
#### Key Algorithm Parameters Per Tree
- To learn partitions:
- bootstrap = true (sklearn default)
- min_samples_leaf=1 (sklearn default; maximally pure trees)
- max_features = $\sqrt{p}$ (sklearn default; not relevant for 2D data)
- max_depth=30 (to accelerate run time; makes accuracy worse typically)
- max_samples = 0.32 (to enable samples for learning posteriors)
- To learn posteriors:
- finite sample correction = 1 / ( 2 $\times$ # samples in cell)
- Nothing special about our tree construction, basically:
- default sklearn parameters, plus
- modifications to enable learning posteriors (deciders)
---
<img src="images/l2m_18mo/cifar-100-TLE-Accuracy.png" style="width:750px;"/>
- Lifelong Forests exhibit both "reverse" or "backward" transfer: its performance on previously observed tasks increases as more tasks are observed.
<!--
keep these 2 plots in appendix
add error normalized each line by dividing by single task performance
with errorbars
-->
<!-- ---
<img src="images/initial-cifar-100-debugged-1-subsample-recall.png" style="width:750px;"/>
-->
---
class: middle, inverse
## .center[Hardware Design Implications]
---
### Forest > Neural Nets (single threaded)
<img src="images/s-rerf_6plot_times.png" style="width:750px;"/>
- Forests are red, Neural nets are blue
---
### Forest > Neural Nets (single threaded)
- Forests can be trained in a massively parallel fashion
- unlike DeepNets and ConvNets
- Forests can make predictions in a massively parallel fashion
- unlike DeepNets and ConvNets
- Even single threaded,
- forest training is faster than NN
- forest prediction is faster than NN
- Forests can only access a small fraction of the stored model to make a prediction
- because most leaves are not activated for any given input
---
### Implications
- GPUs/TPUs are not required
- Forests can be deployed on IoT/Mobile devices with hard drives to store model
- Additional layers of memory (L1/L2/L3 Cache, etc.) will immediately accelerate prediction time
- Incremental updating of trees is a new access pattern that no modern hardware is designed to efficiently support
---
class: middle, inverse
## .center[L2F Lifelong Learns Consistently]
---
### What is Consistency?
$\mathcal{E}\_{S}(f\_n,P) \rightarrow \arg \min\_f \; \mathcal{E}\_{S}(f,P)$ as $n \to \infty$.
### What is Universal Consistency?
$\forall P \in \mathcal{P}: \mathcal{E}\_{S}(f\_n,P) \rightarrow \arg \min\_f \; \mathcal{E}\_{S}(f,P)$ as $n \to \infty$.
---
### Key Insight of Uncertainty Forests
- Sub-sample data into two sets, one to learn transformers, and one to learn deciders
- This yields consistent estimates of the probabilities averaged per cell:
$$\hat{P}(f\_n) \rightarrow P$$
---
### L2F Theorems
Lemma 1: Uncertainty Forests produce universally consistent estimates of posterior probabilities for k-class classification problems.
Lemma 2: Posteriors estimated on sufficiently deep and random trees are also universally consistent for k-class classification problems.
Conj 3: For a specific $\vec{S}$, where $\mathcal{Z}\_j = (\mathcal{X} \times \mathcal{Y}\_j)$, L2F is a universal lifelong learner.
Proof: Stone's Theorem, 1977 + Biau et al. 2008 + a little more.
---
class: middle, inverse
## .center[Future Plans]
---
## Test and Evaluation Process
1. Test dataset 1: psychopathy data
- ~1000 subjects
- Cognitive, brain imaging, and other measurements acquired over 10 year span
- Location, personnel, hardware, software, etc. changed over time.
- Goal: predict recidivism (important social and moral considerations)
2. Test dataset 2: lifespan human connectome project data
- ~1000 subjects
- Cognitive, brain imaging, and other measurements acquired from people ranging from 5 to 81 years old
- Goal: predict cognitive decline (important public health considerations)
- Metrics: LLE as a function of sample size
- Status: We have data, it is being processed.
---
### Concrete Examples and Limitations
- Any supervised lifelong learning classification problem with the same feature space code will run on today.
- Any supervised lifelong learning .pu[regression] problem code will run soon.
- If all feature spaces are "typical" .pu[manifolds] (images, time-series), improved code will run soon.
- We will soon have an implementation that does not require the data or model to fit in main memory.
- If different tasks have different feature spaces (eg, some only have images, others only have text), code will run in near future.
- We do not plan to have reinforcement learning in near future.
---
## 5 Core Capabilities
1. Continual Learning: ✓ (can stream updating probabilities)
2. Adaptation to New Tasks: ✓ (no catastrophic forgetting)
3. Goal Driven Perception: ✓ (all conditional on $S$)
4. Selective Plasticity: ✓ (uses consistent posterior estimates)
5. Monitoring & Safety ✗
--
Will incorporate streaming output of probabilities and changes in held-out error rate as new tasks are included.
---
## Other Next Steps
- Extend empirical and theoretical results to more general settings
- Extend empirical and theoretical results to deep learning
---
### Polytope space partitioning algorithms
<img src="images/deep-polytopes.png" style="width:750px;"/>
- NNs with ReLu nodes also partitions feature space into polytopes ([NIPS, 2014](http://papers.nips.cc/paper/5422-on-the-number-of-linear-regions-of-deep-neural-networks.pdf)).