-
Notifications
You must be signed in to change notification settings - Fork 8
/
Copy pathProgL_2020Q3.html
2148 lines (1264 loc) · 58.7 KB
/
ProgL_2020Q3.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE html>
<html>
<head>
<title>Learning</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<link rel="stylesheet" href="fonts/quadon/quadon.css">
<link rel="stylesheet" href="fonts/gentona/gentona.css">
<link rel="stylesheet" href="slides_style_i.css">
<script type="text/javascript" src="assets/plotly/plotly-latest.min.js"></script>
</head>
<body>
<textarea id="source">
### What is learning?
[JHU](https://www.jhu.edu/): Hayden Helm | Jayanta Dey | Ronak Mehta | Will LeVine |
Carey E. Priebe | Joshua T. Vogelstein <br>
[Microsoft Research](https://www.microsoft.com/en-us/research/): Weiwei Yang | Jonathan Larson | Bryan Tower | Chris White
![:scale 40%](images/neurodata_blue.png)
---
class:middle
| Biology Learning | Machine Learning |
| :--- | :---
| old | new
| little | big
| light | heavy
| free | expensive
| imprecise | precise
| energy efficient | hog
| data efficient | glutton
| remembers | usually forgets
| builds models | sometimes
| .r[extrapolates] | interpolates
---
### Motivating questions
1. Are biological and machine learning trying to do the same thing?
2. Do they use the same algorithms? Could they?
2. Can we talk about them using the same terminology?
3. Can we characterize their abilities using the same units?
--
They are both about .ye[learning], so....should they?
---
### What is learning?
--
<br>
"The acquisition of knowledge or skills through experience, study, or by being taught."
-- Google, 2020
--
"A computer .ye[program] is set to learn from an .ye[experience] E with respect to some .ye[task] T and some .ye[performance measure] P if its performance on T as measured by P .ye[improves] with experience E."
-- Tom Mitchell, 1997
--
".ye[$f$] learns from .ye[data] $\mathbf{Z}_n$ w.r.t. .ye[task] $t$ when its .ye[performance] at $t$ improves due to $\mathbf{Z}_n$."
-- jovo, 2020
---
<!-- \mathbb{E}\left[\frac{R(f(\bold{Z}_n))}{R(f(\bold{Z}_0))}\right] = \frac{\mathbb{E}[R(f(\bold{Z}_n))]}{R(f(\bold{Z}_0))} -->
### What are the data?
The data are determined by physical implementation of the system:
- .ye[Measurement space]: $\mathcal{Z}$, determined by available sensors
- visual, auditory, tactile, text, vectors, networks, etc.
- also could be priors, inductive bias of of the hypotheses, estimation bias of the algorithm, pre-training, etc. or any combination thereof
- .ye[Action space]: $\mathcal{A}$, determined by available actuators
- →, ←, ↑, ↓, etc.
<!-- , {reject, fail to reject}, $\mathbb{R}$ -->
- .ye[Query space]: $\mathcal{Q}$, determined by system's "interface"
- in which cluster is $z$? what is this object? etc.
Classification Example
- $z_i = (x_i,y_i)$ where $\mathcal{X}=\mathbb{R}^p$ and $\mathcal{Y}=\lbrace 0,1\rbrace$
- $a_i \in \lbrace 0, 1 \rbrace = \mathcal{Y}$ are class labels
- $q_i \in \mathcal{X}=\mathbb{R}^p$ are possible feature vectors with unknown class labels
<!-- TODO we play fast and loose with whether the dataset is in \mathcal{D} vs \mathcal{Z}. i'd like to be consistent -->
<!-- TODO in XOR experiments, replace purple with orange -->
<!-- TODO@jovo the supertask learning slides are too complex, i'll simplify dramatically -->
<!-- TODO@jovo move some supertask slides to appendix -->
<!-- TODO@jovo reorganize compositional hypothesis slides -->
<!-- TODO@jovo more about internal models, generalization, motivation, etc. -->
<!-- TODO@jovo 2nd what is learning slide uses old notation -->
<!-- TODO@jovo maybe update xor/nxor figure to show forgetting instead of RF? -->
---
### What is $f$?
We get to choose the learning algorithm $f$:
- $\vec{\mathcal{Z}} = \bigcup_{n = 0}^{\infty}\mathcal{Z}^n$, a .ye[data corpus] $\mathbf{Z}_n= \lbrace Z_1, Z_2, \ldots, Z_n \rbrace$, and $\mathcal{Z}^0 = \emptyset$ is the empty set,
meaning no data
- $\mathcal{H} = \lbrace h : \mathcal{Q} \rightarrow \mathcal{A} \rbrace$, where a .ye[hypothesis] $h$ takes an action on the basis of a query
- $f: \vec{\mathcal{Z}} \times \mathcal{Z} \to \mathcal{H}$, a learning .ye[algorithm] with the first parameter being the data set and the second parameter being a
hyperparameter/initilization for the algorithm
- For notational convenience, we will suppress the second parameter, as it is always present
---
### Supervised machine learning example
- $\mathbf{Z}_n = (X_1, Y_1), \ldots, (X_n, Y_n)$
- $h$ is *RandomForestClassifier.predict*
- $f$ is *RandomForestClassifier.fit*
---
### What is the model?
- We care about future performance on unseen queries, not past performance
- To make any out-of-data claims requires .ye[assumptions]
- .ye[data] $\mathbf{Z}\_n$ is sampled according to $P\_{\mathbf{Z}\_n}$ and observed in $\mathcal{Z}^n$
- $\mathcal{P}\_{Z} = (\mathcal{P}\_{\mathbf{Z}\_n})\_{n = 1}^{\infty}$ is the data model, where $\mathcal{P}\_{\mathbf{Z}\_n} = \lbrace P\_{\mathbf{Z}\_n} \rbrace$ is the family of
distributions that characterizes the $n$ samples
- A .ye[query], $q \in \mathcal{Q}$ is sampled iid from some true but unknown distribution $P_Q \in \, \mathcal{P}_Q$
- An optimal .ye[action], $a \in \mathcal{A}$ given $q$, is sampled iid from some true but unknown distribution $P\_{A \mid Q} \in \, \mathcal{P}_{A \mid Q}$
- $\mathcal{P} = \lbrace P\_{\mathbf{Z}\_n} \otimes P\_{A, Q} \rbrace$ is the task model, which is the set of joint .ye[distribution] over samples, queries, and optimal actions
- $\mathcal{P}$ is called the .ye[statistical model]
---
### What is performance?
- .ye[Loss],
e.g., 0-1 loss: $ \ell(a, a') := \mathbb{I}[a \neq a'] $
- .ye[Risk]
e.g., expected loss: $ R(h) := \ \mathbb{E}\_{Q, A}[\ell(h(Q), A)] $
- For an algorithm, the risk is $ R(f(\bold{Z}_n)) $
- The .ye[generalization error] of the algorithm $f$ given the data set $\bold{Z}\_n$
$$\mathbb{E}_P[R(f(\bold{Z}_n))]$$
---
### What is the setting?
$$s = (\mathcal{Q}, \mathcal{A}, \vec{\mathcal{Z}}, \mathcal{P}, \mathcal{H}, R, \mathcal{F})$$
- $\mathcal{Q}$ is the query space
- $\mathcal{A}$ is the action space
- $\vec{\mathcal{Z}}$ is the data set space
- $\mathcal{P}$ is the task model
- $\mathcal{H}$ is the hypothesis space
- $R$ is the risk function
- $\mathcal{F}$ is the algorithm space
---
### What is the task?
Given a setting $s$ and $n$ samples $\mathbf{Z}\_n$ drawn according to $P$, the task $t$ is to find the algorithm that minimizes the generalization error,
$$f\_n^* = \text{argmin}\_{f \in \mathcal{F}}\mathbb{E}[R\left(f(\mathbf{Z}\_n)\right)]$$
---
### What is performance?
- For an algorithm, the risk is $ R(f(\bold{Z}_n)) $
- .ye[Performance] is defined as expected risk (or .ye[generalization error]) minus the optimal (Bayes) risk:
$$\mathcal{E}_f^t(\mathbf{Z}\_A) := \mathbb{E}_P[R(f(\bold{Z}_A))] - R^* $$
- Performance prior to acquiring data $\mathbf{Z}_n$ is $\mathcal{E}\_f^t(\mathbf{Z}\_0)$, where $\mathbf{Z}\_0 = \varnothing$ is the empty set, meaning no data
---
### What is learning?
.ye[$f$] learns from .ye[data] $\mathbf{Z}_n$ with respect to .ye[task] $t$ when its .ye[performance] at $t$ improves due to $\mathbf{Z}_n$,
Define .ye[learning efficiency]: $$LE^t(\mathbf{Z}\_A, \mathbf{Z}\_B, f) := \frac{\mathcal{E}_f^t(\mathbf{Z}\_A)}{\mathcal{E}_f^t(\mathbf{Z}\_B)}$$
<br>
This tells us whether we learn from $\mathbf{Z}\_A$ or $\mathbf{Z}\_B$.
$f$ learns from $\mathbf{Z}_n$ with respect to task $t$ when $LE^t(\mathbf{Z}\_0, \mathbf{Z}\_n, f) > 1$.
---
### What is transfer learning?
- Given .ye[side information]
- other data, pre-trained model, priors
- Side information is sampled according to some distribution $P_{0}$, from $\mathcal{Z}_0$
- Task data is sampled according to $P_{1}$, also from $\mathcal{Z}_1$
- We get data $\mathbf{Z}_n$ to be the combined data, $\mathbf{Z}_n = \lbrace (Z_i, S_i) \rbrace$, for $i \in [n]$, where $S\_i = s_j$ if $Z\_i$ is sampled according to $P_j$
- $j = 0, 1$, and the $s_0$ can be considered as the null setting corresponding to side information where we only know about the data space
---
### What is transfer learning?
- Let $\mathbf{Z}^t_n = \lbrace (Z_i, S_i) \in \mathbf{Z}_n: S_i = s(t) \rbrace$, $i \in [n]$, $s(t)$ is the setting of task $t$
- Transfer learning algorithm $f$ takes in data of the form $(Z, S)$, where $Z$ is observed in $\mathcal{Z}_0\cup\mathcal{Z}_1$
- The model is now the set $\lbrace P\_{\mathbf{Z}\_n} \otimes P\_{Q, A} \rbrace$, but $P\_{\mathbf{Z}\_n}$ is now a distribution over the new data $\mathbf{Z}\_n$, but $P\_{Q, A}$ is still a
distribution over the queries and actions from $s\_1$
- Learning efficiency is now $LE^{t}(\mathbf{Z}_n^t, \mathbf{Z}_n, f) = \frac{\mathcal{E}_f^t(\mathbf{Z}_n^t)}{\mathcal{E}_f^t(\mathbf{Z}_n)}$
$f$ .ye[transfer learns] from $\mathbf{Z}\_{n}^0$ with respect to task $t$ when $LE^t(\mathbf{Z}_n^t, \mathbf{Z}_n, f) > 1$
---
### What is multitask learning?
- Now we have an environment of tasks $\mathcal{T}\_m = \lbrace t_1, t_2, \cdots, t_m \rbrace$ , each with their own setting $\mathcal{S}\_m = \lbrace s_1, s_2, \cdots, s_m \rbrace$ along
with some side information $\mathbf{Z}_n^0$
- Make same definitions as transfer learning
- $\mathbf{Z}_n^t$ is data associated with task $t$
- Learning efficiency for task $t$: $LE^{t}(\mathbf{Z}_n^t, \mathbf{Z}_n, f) = \frac{\mathcal{E}_f^{t}(\mathbf{Z}_n^t)}{\mathcal{E}_f^{t}(\mathbf{Z}_n)}$
---
### What is multitask learning?
- Define .ye[weak] multitask learning efficiency as:
$$ \text{WMLE}\_n(f) := \sum\_{t \in \mathcal{T}_m}w_t \cdot LE^{t}(\mathbf{Z}_n^t, \mathbf{Z}_n, f) $$
- $\lbrace w_t: t \in \mathcal{T}_m\rbrace$ is a set of weights that form a probability distribution
- Define .ye[strong] multitask learning efficiency as:
$$ \text{SMLE}\_n(f) := \min\_{t \in \mathcal{T}_m} LE^{t}(\mathbf{Z}_n^t, \mathbf{Z}_n, f) $$
$f$ (strongly or weakly) .ye[multitask learns] from $\mathbf{Z}\_{\mathbf{n}}$ with respect to tasks $\mathcal{T}_m$ if $(\text{S or W})\text{MLE}_n(f) > 1$
---
### What is efficient learning?
- Let .ye[$c$] be the upper bound on the .ye[computational cost] of updating $h$ given a new $Z \in \mathcal{Z}$ and choosing an $a \in \mathcal{A}$.
- Let $\mathcal{F}_e = \lbrace f : \vec{\mathcal{Z}} \to \mathcal{H}$ such that $f \in poly(n,c) \rbrace$
$f$ .ye[efficiently learns] from $\mathbf{Z}\_n$ with respect to task $t$ when $LE^t(\mathbf{Z}_n^t, \mathbf{Z}_n, f) > 1$ and $f \in \mathcal{F}_e$.
---
### What is lifelong learning?
- Same set up as multitask learning, except now we have a sequence of tasks $\mathcal{T}_m$, called a .ye[curriculum], rather than the set, or environment, of tasks as before
- Same definitions and quantities (e.g. $\mathbf{Z}_n^t$, strong and weak learning, etc.)
$f$ (strongly, weakly) .ye[lifelong learns] from $\mathbf{Z}\_{n}$ with respect to tasks $\mathcal{T}_m$ when $(\text{S or W})\text{MLE}_n(f) > 1$ and $f \in o(n^2,c) $
---
### What is lifelong cheating?
- Store every sample you've ever seen
- Every time we are faced with a new $z$, $q$, or $t$, just update everything in batch mode
- Now just run your favorite multitask $f$
- Doing so consumes $\mathcal{O}(n^2)$ resources because $ \sum_{i =1}^n i = n^2$
- So, to differentiate lifelong learning from multitask learning requires a particularly efficient algorithm
- $f$ must consume less than quadratic resources as a function of $n$, $f \in o(n^2,c) $
---
### What is forward learning?
- Let $n\_t$ be the last occurence of task $t$ in $\mathbf{Z}\_n$
- Let $\mathbf{Z}\_n^{< t} = \lbrace Z\_1, Z\_2, \ldots, Z\_{n_t} \rbrace$
- .ye[Forward] learning efficiency is the improvement on task $t$ resulting from all data .ye[preceding] task $t$
$$FLE^t\_{\mathbf{n}}(f) := LE^t(\mathbf{Z}_n^t, \mathbf{Z}_n^{< t}, f)$$
$$ \text{(weak) } \text{FLE}\_{\mathbf{n}}(f) := \sum\_{t \in \mathcal{T}\_m} w\_t \cdot \text{FLE}\_{\mathbf{n}}^t(f) $$
$$ \text{(strong) } \text{FLE}\_{\mathbf{n}}(f) := \min\_{t \in \mathcal{T}\_m} \text{FLE}\_{\mathbf{n}}^t(f) $$
<br>
$f$ (strongly, weakly) .ye[forward learns] if $FLE_{\mathbf{n}}(f) > 1$ and $f \in o(n^2,c)$
---
### What is backward learning?
.ye[Backward] learning efficiency is the improvement on task $t$ resulting from all data .ye[after] task $t$
$$ BLE^t\_{\mathbf{n}}(f) := LE^t(\mathbf{Z}_n^{< t}, \mathbf{Z}_n, f) $$
$$ \text{(weak) } BLE\_{\mathbf{n}}(f) := \sum\_{t \in \mathcal{T}\_m} w\_t \cdot BLE\_{\mathbf{n}}^t(f) $$
$$ \text{(strong) } BLE\_{\mathbf{n}}(f) := \min\_{t \in \mathcal{T}\_m} BLE\_{\mathbf{n}}^t(f) $$
<br>
$f$ (strongly, weakly) .ye[backward learns] if $BLE_{\mathbf{n}}(f) > 1$ and $f \in o(n^2,c)$
---
### Learning efficiency factorizes
$$ LE^t(\mathbf{Z}_n^t, \mathbf{Z}_n, f) := FLE^t_n(f) \times BLE^t_n(f) $$
$$ LE^t(\mathbf{Z}_n^t, \mathbf{Z}_n, f) = LE^t_n(\mathbf{Z}_n^t, \mathbf{Z}_n^{< t}, f) \times LE^t_n(\mathbf{Z}_n^{< t}, \mathbf{Z}_n, f) $$
$$ \frac{\mathcal{E}_f^t(\mathbf{Z}_n^t)}{\mathcal{E}_f^t(\mathbf{Z}_n)} = \frac{\mathcal{E}_f^t(\mathbf{Z}_n^t)}{\mathcal{E}_f^t(\mathbf{Z}_n^{< t})} \times
\frac{\mathcal{E}_f^t(\mathbf{Z}_n^{< t})}{\mathcal{E}_f^t(\mathbf{Z}_n)} $$
<br>
We therefore have a single metric to quantify transfer.
---
### What is progressive learning?
$f$ weakly/strongly .ye[progressively] learns from $\mathbf{Z}\_{\mathbf{n}}$ with respect to tasks $\mathcal{T}_m$ when
it weakly/strongly learns both forward and backward
.center[$ \min \left(FLE\_{\mathbf{n}}(f),BLE\_{\mathbf{n}}(f) \right) > 1 $ and $f \in o(n^2,c) $
]
<!-- .center[$\left( FLE\_{\mathbf{n}}^t(f) > 1 \right) \times \left( BLE\_{\mathbf{n}}^t(f) > 1 \right) , \ \ t \in [T]$
] -->
---
### A taxonomy of approaches
| Par. | → | ← | space | time | Examples
| :---: | :---: | :---: | :---:| :---: |
| par | + | - | 1 | n | O-EWC, SI, TL
| par | + | - | T | n | SI
| par | + | - | T | nT+T<sup>2</sup>| EWC
| par | + | + | 1 | nT<sup>a</sup>, a ≤ 2 | TL + replay
| semipar | + | 0 | T | n T | ProgNN, DEN
| semipar | + | + | T | n T<sup>2</sup> | Sequential Multitask
| semipar | + | + | T | nT | ProgL Networks
| nonpar | + | + | n | nT | ProgL Forests
<!-- parametric, replay, space, time, forward, backwards, examples -->
<!-- apples to apples comparisons are only possible within a row of the table -->
---
### Some nuances
- Sometimes, $f$ might not know that the task has changed
- This framework cannot easily deal with distribution drift, or reinforcement learning
---
name:rep
### Outline
- Learning
- [Ensembling](#rep)
- [Experiments](#exp)
- [Theory](#theory)
- [Brains](#neuro)
- [Discussion](#disc)
---
### Learning Taxonomy
![:scale 100%](images/learning-taxonomy.svg)
---
### Ways Tasks can Differ
| Component | Notation | Examples |
| :--- | :--- | :---
| Query Space | $\mathcal{Q}$ | new keyboard introduced
| Action Space | $\mathcal{A}$ | class incremental, task incremental
| Measurement Space | $\mathcal{Z}$ | another modality
| Statistical Model | $\mathcal{P}$ | Gaussian to Log-Gaussian
| Hypotheses | $\mathcal{H}$ | linear functions
| Risk | $R$ | expected loss
| Algorithm Space | $\mathcal{F}$ | SVM
| Distribution | $P$ | mean shift
| Task Awareness | $T_i$ | {aware, oblivious, ambivalent}
$2^8 \times 3 \approx 800$ ways tasks can differ.
---
name:rep
### Outline
- [Learning](#learn)
- Ensembling
- [Experiments](#exp)
- [Theory](#theory)
- [Brains](#neuro)
- [Discussion](#disc)
---
### Composable Hypotheses
.center[ .ye[$h(\cdot) := w \circ v \circ u (\cdot) = w(v(u(\cdot)))$]]
- Let $u$ be .ye[transformer] data to a new representation,
$$ u : \mathcal{Q} \to \tilde{\mathcal{Q}}$$
- Let $v$ be .ye[voter] which operate on the transformed data outputs votes (score functions, posteriors) on all possible actions
$$ v : \tilde{\mathcal{Q}} \to \mathcal{V}$$
- Let $w$ be .ye[decider] which decides which actions to take on the basis of the votes
$$ w : \mathcal{V} \to \mathcal{A}$$
---
![:scale 100%](images/single_decomposable_hypothesis.png)
<!-- TODO@ali: can we use an svg here? or a higher res png if you can't get a vector graphic? -->
---
### Simple Examples
- Linear Discriminant Analysis (shallow)
- $u$: projection onto a line
- $v$: fraction of points per over/under threshold
- $w$: maximum a posteriori class
--
- Decision Tree (deep)
- $u$: union of polytopes
- $v$: fraction of points per class per leaf node
- $w$: maximum a posteriori class
---
### Predictive Ensembling
- Ensemble votes from multiple voters in a decider
$$
w \circ
\begin{bmatrix}
v_1 \circ u_1 \\\\
v_2 \circ u_2 \\\\
\vdots \\\\
v_m \circ u_m
\end{bmatrix}
$$
---
![:scale 100%](images/predictive_ensembling.png)
---
#### Predictive Ensembling Example
- Decision Forest
- $u_b$ for $B$ trees: union of overlapping polytopes
- $v_b$ for $B$ trees: fraction of points per class per leaf node
- $w$: maximum a posteriori class averaging over trees
---
### Key Idea
- .ye[Different transformers can composed with voters]
- Learn many different transformers $u_t(\cdot)$'s
- For each $u\_t$, learn voter per task $v\_{t,t'}$'s
- Use the decider to weight the various options
- This is .ye[ensembling representations].
### Notes
- We learn new representation for each task.
- Dimensionality of internal representation grows linearly with number of tasks.
---
### Representational Ensembling
- Ensemble representations from multiple transformers in a voter
- Assume $m$ transformers and $n$ voters
- Let $u =
\begin{bmatrix}
u_1 \\\\
u_2 \\\\
\vdots \\\\
u_m
\end{bmatrix}$, and
$
w \circ
\begin{bmatrix}
v_1 \circ u \\\\
v_2 \circ u \\\\
\vdots \\\\
v_n \circ u
\end{bmatrix}
$
---
![:scale 100%](images/representational_ensembling.png)
---
#### Representational Ensembling Examples
- Uncertainty Forests
- $u$: tree structures
- $v$: posterior estimators
- $w$: max
- Deep Nets
- $u$: "backbone" (all but last layer)
- $v$: softmax layer
- $w$: max
---
### Composable Learning
<br>
| Scenario | Composition
| :--- | :---
| Single task learning | $ h(\cdot) = w \circ v \circ u (\cdot)$
| Multiple independent task learning | $ h_t(\cdot) = w_t \circ v_t \circ u_t (\cdot)$
| Single task ensemble learning |$ h(\cdot) = w \circ \bigcup_t [ v_t \circ u_t (\cdot)] $
| Multitask learning | $ h_t(\cdot) = w_t \circ v \circ \bigcup_t u_t (\cdot)$
| .ye[Multitask ensemble representation learning] | $ h\_t(\cdot) = w\_t \circ \bigcup\_{t'} [v\_{t,t'} \circ u\_{t'} (\cdot) ] $
---
### Lifelong Learning Schema
![:scale 100%](images/learning-schemas.png)
- Any learner with an explicit internal representation is ok,
- e.g., decision trees, decision forests, deep networks
<!-- - SVM's are not obviously -->
---
### General Representations
- Transformers learn representations
- We desire representations that are sufficient for one task, and useful for other tasks
- Decision trees, decision forests, and deep nets (with ReLu nodes) .ye[partition] feature space into polytopes
![:scale 100%](images/deep-polytopes.png)
<!-- <img src="images/deep-polytopes.png" style="width:500px;"/> -->
---
### Partition and Vote
<!-- TODO@ali make this slide-->
- Given query space $\mathcal{Q}$, say $\mathbb{R}^d$ and $n$ samples, for example $(x\_i, y\_i)_{i = 1}^{n}$, where $x_i \in \mathbb{R}^d$ and $y_i$ could be labels
- Transfomer $u$ partitions $\mathcal{Q}$, mapping $x_i$ to its corresponding cell in the partition
- Voter $v$ scores actions in each cell (e.g. empirical posterior distribution) given how they are populated by transformer
- For a given test query $x$,
- $u$ maps $x$ to its cell
- $v$ votes on actions in this cell
- decider $w$ chooses action based on $v$'s votes (e.g. arg max)
---
### Ensemble, Partition and Vote
- Get a transformer class $U$, with each $u \in U$ partitioning $\mathbb{R}^d$ differently
- Get a voter class $V$, with each $v \in V$ voting differently on each cell in a given partition, and make it vote on every partition
- Decider $w$ ensembles the votes on actions from each voter to decide on an action given a test query $x$
---
### Lifelong Learning Algorithm
For each new task,
1. learn a new representation function,
2. apply it to all data from all tasks: the updated representation for everything is the composition of this new representation with existing representations.
4. update all decision rules using this representation.
Notes:
- This linearly increases representation capacity.
- Without increasing representation capacity, performance on all tasks will necessarily drop to chance levels eventually as number of tasks increases.
- Thus, fixed capacity systems can only lifelong learn insofar as they are inefficient (unnecessarily big) for individual tasks.
<!-- TODO@jv: somewhere must introduce the concept of adjusting representations -->
---
### Pseudocode
- Given $\color{magenta}{j-1}$ transformers learned from the previous $\color{magenta}{j-1}$ datasets and a new $\color{yellow}{j^{th}}$ dataset with task label $\color{yellow}{t_j}$, do:
- learn a new transformer using $\color{yellow}{j^{th}}$ data
- .magenta[reverse transfer update] for each of the $\color{magenta}{j-1}$ previous tasks:
1. transform a subset of the data through the $\color{yellow}{j^{th}}$ transformer
(this requires having stored some of the data)
3. learn a new voter using the $\color{yellow}{j^{th}}$ representation of data
4. update decision rules by appending this additional voter
- .ye[forward transfer update] for all data associated with $\color{yellow}{j^{th}}$ task:
1. transform a subset of the data through the $\color{yellow}{j^{th}}$ transformer
2. transform through each of the $\color{magenta}{j-1}$ existing transformers
3. learn a new voter for all $j$ transformers
4. make decision rule by averaging over $j$ voters
---
name:results
### Outline
- [Learning](#learn)
- [Ensembling](#rep)
- Experiments
- [Theory](#theory)
- [Brains](#neuro)
- [Discussion](#disc)
---
### A Transfer Example
- .ye[XOR]
- Samples in the (0,0) and (1,1) quadrants are purple
- samples in the (0,1) and (1,0) quadrants are green
- .lb[N-XOR]
- Samples in the (0,0) and (1,1) quadrants are green
- samples in the (0,1) and (1,0) quadrants are purple
- Optimal decision boundaries for both problems are coordinate axes
<img src="images/gaussian-xor-nxor.svg" style="width:475px" class="center"/>
---
### XOR vs NXOR Transfer Efficiency
![:scale 100%](images/xor-te.svg)
---
### Lots of Transfer Efficiency
![:scale 100%](images/lotsa-te.svg)
<!--
### Different # of Classes
<img src="images/spiral-all.png" style="height:500px;"> -->
<!-- ## Consider an example -->
---
### CIFAR 10x10
.pull-left[
- *CIFAR 100* is a popular image classification dataset with 100 classes of images.
- 500 training images and 100 testing images per class.
- All images are 32x32 color images.
- CIFAR 10x10 breaks the 100-class task problem into 10 tasks, each with 10-class.
]
.pull-right[
<img src="images/l2m_18mo/cifar-10.png" style="position:absolute; left:450px; width:400px;"/>
]
<!--
### Forward Transfer Efficiency
- y-axis indicates .ye[forward transfer efficiency] (FTE),
- which is the ratio of "single task error" to "error using past tasks"
- each algorithm has a line
- if the line .ye[increases], that means it is doing "forward transfer"
-->
---
Lifelong Forests and Networks consistently demonstrate .ye[forward transfer] for every task.
![:scale 100%](images/cifar-100-FTE.svg)
- left: resource building
- right: resource recruiting
<!--
### Backward Transfer Efficiency
- y-axis indicates .ye[backward transfer efficiency] (BTE),
- which is the ratio of "single task error" to "error using future tasks"
- each task will have a line
- if the line .ye[increases], that means it is doing "backward transfer"
-->
---
Lifelong Forests and Networks .ye[uniquely exhibits backward transfer].
![:scale 100%](images/cifar-100-BTE.svg)
- left: resource building
- right: resource recruiting
---
### L2F & L2N transfer on .ye[every task]
![:scale 60%](images/TE.svg)
---
### Language Identification
- 8,194,317 sentences from wikipedia (downloaded from facebook).
- 156 languages
- Trained using unsupervised FastText embedding
- words, 2-4 char n-grams embedded into 16 dimensions
- selected 30 languages
- break into batches of 3 "related" languages
![:scale 100%](images/30-languages.png)
---
### Backward Transfer
![:scale 60%](images/language.svg)
<!-- Note RTE >5 for task 4.
-->
---
### Web-Search Categorization
.pull-left[
- Same data as above
- labels now correspond to Microsoft Bing "dominant type"
- 10k training
- 1k testing entities
- 20 classes
- each with ≥11k samples
- 4 classes per task
]
.pull-right[
![:scale 100%](images/bing-dominant-types.png)
]
---
### Backward Transfer
![:scale 60%](images/web.svg)
---
## Outline
- [Learning](#learn)
- [Ensembling](#rep)
- [Experiments](#exp)
- Theory
- [Brains](#neuro)
- [Discussion](#disc)
---
### What do classifiers do?
<br>
learn: given $(x_i,y_i)$, for $i \in [n]$, where $y \in \lbrace 0,1 \rbrace$
1. partition feature space into "parts",
2. compute plurality of points in each part.
predict: given $x$
2. find its part,
3. report the plurality vote in its part.
---
### What can regressors do?
<br>
learn: given $(x_i,y_i)$, for $i \in [n]$, where $y \in \mathbb{R}$
1. partition feature space into "parts",
2. compute average of points in each part.
predict: given $x$
2. find its part,
3. report the average vote in its part.
---
### The fundamental theorem of statistical pattern recognition
If each part is:
1. small enough, and
2. has enough points in it,
then given enough data, one can learn *perfectly, no matter what*!
$$\mathcal{E}\(f_n) \rightarrow \mathcal{E}^*,$$
where $\mathcal{E}^*$is Bayes optimal.
-- Stone, 1977
<!-- NB: the parts can be overlapping (as in kNN) or not (as in histograms) -->
---
### The fundamental .ye[theorem] of transfer learning
If each cell is:
- small enough, and
- has enough points in it,
then given enough data, one can .ye[transfer learn] *no matter what*!