-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbody_short_backup.tex
1119 lines (1034 loc) · 95.1 KB
/
body_short_backup.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%!TEX root=separability.tex
%====================================================================
\section{Introduction}
\label{sec:intro}
A common approach to discovery in biology
is to construct experiments or analyses that directly
contrast two specific classes of biological objects.
Examples of this include examining patient samples
contrasting tumor versus normal tissue~\cite{tang2017gepia},
studying the differences in molecular effects
of two competing drug treatments~\cite{lamb2006connectivity},
or characterizing differentially expressed genes
versus genes with unaltered gene expression
in a carefully designed experiment~\cite{liberzon2015molecular}.
As is often the case with genomics, these
biological objects (e.g., tissue samples or drug experiments)
are frequently represented by high dimensional
numeric feature vectors (e.g., tens of thousands of transcript abundance measurements).
To understand the mechanisms that determine
these object classes,
researchers often employ statistical and
machine learning tools to identify a
manageable subset of features that
accentuate or explain the differences between them.
We term this the {\em separability problem}.
Often, researchers address this problem
by ranking each individual feature
using statistical tests~\cite{dudoit2002statistical}
and univariate classifiers~\cite{lai2006comparison}.
These approaches scale linearly in the number of features,
but do not offer insight into the interplay
between important features that can provide a better
characterization of what
distinguishes the two object classes.
Alternatively, larger subsets of separating
features can be identified by more complex
machine learning approaches, such as multivariate
regression with LASSO regularization~\cite{peng2010regularized}
or pattern mining from random forest models~\cite{breiman2001random}.
However, running such methods on datasets with
tens of thousands of features is extremely time-intensive.
Motivated by these observations,
we sought to build a tool that strikes a balance between the
two above-mentioned ``extreme''
strategies, enabling us to retrieve more than just the best {\em singular} features, and doing so in an {\em interactive} manner.
We present \genviz,
an interactive data exploration tool to address the separability problem.
Given a matrix of values for object features
and two labeled object sets, \genviz rapidly identifies
the top ranking pairs of features
that most clearly separate the objects of
the different classes.
We show that feature pairs identified by \genviz often more significantly
discriminate between the object classes than the corresponding
best ranking individual features, even after accounting for the larger search space.
Thus, by focusing on separating feature pairs, \genviz
offers researchers the ability to gain additional
insight beyond singular features, without the prolonged duration
needed to train a complex machine learning model.
Moreover, identified feature pairs can be effortlessly
visualized in two-dimensional graphical interfaces,
leading to easy interpretation of
the interplay between the features and the object classes.
\begin{figure*}[t]
\centering
\includegraphics[width=0.9\linewidth]{fig/workflow2.eps}
\vspace{-15pt}
\caption{\genviz Workflow.
Given (left) a feature-object matrix and green positive and red negative class labels on the objects, \genviz (center) evaluates all pairs of features using several optimizations to identify (right) the top feature pair and its corresponding visualization that best separates the object classes.}
\label{fig:workflow}
\vspace{-15pt}
\end{figure*}
Quickly identifying top feature pairs
in typical datasets
is a challenging task due to some fundamental limitations:
first, the number of feature pairs can be large;
second, the feature-object matrix must be examined
multiple times to evaluate the separability of
each feature pair;
and third, we cannot compute the result in advance, since
object sets may be provided on-the-fly.
For example, if objects corresponded to experimental
conditions, and features were gene expression measurements,
having about $20K$ gene features
means that we need to evaluate nearly $200M$ feature pairs,
and for each feature pair we need to at least scan all the objects, if using a na\"ive approach.
Therefore, sophisticated methods are required
to make the task of ranking feature pairs interactive.
In developing \genviz,
we identified a measure to score the separability
of a given feature pair for two object classes.
This measure is conducive to
optimization techniques for rapid ranking of feature
pairs, while providing an intuitive way
to interpret
the resulting visualizations.
Satisfying both of these requirements
enables \genviz to support interactive separability analysis
in a manner that promotes the generation and prioritization
of hypotheses for further investigation.
Using this separability measure, we develop a suite of novel
optimizations to make \genviz interactive for realistic data set sizes,
targeting: {\em (a)} elimination of repeated computation;
{\em (b)} pruning feature pairs during early execution;
{\em (c)} sampling to further reduce running time;
and {\em (d)} cleverly traversing the search space
of feature pairs for improved efficiency.
\tr{While similar optimizations have been used in other settings,
e.g., sampling for approximate computation\needcite{},
the techniques need to be adapted for the separability problem.}
We applied \genviz to two genomic datasets.
In one, called \lincs, we find pairs of genes
whose expression discriminates between perturbagen experiments
involving different drug treatments, and in the other, called \msig,
we find pairs of annotations (such as pathway membership)
that separate differentially expressed
cancer genes from others.
Both of these applications are on
objects with high-dimensional feature vectors,
so it is computationally expensive to score the separability
for all possible feature pairs.
With the carefully designed
separability metric of \genviz and its suite of sophisticated optimizations
that accelerates evaluation,
we are able to {\emph{accurately return the
highest ranking separating
feature pairs for both datasets within two minutes}}.
This reflects a \emph{180X} and \emph{400X} speedup
over a competitive baseline for the \msig and \lincs data sets (respectively).
With such speedups, \genviz enables researchers to quickly explore their data,
identify the strongest, most compelling features,
and form hypotheses about the interplay between them and with the object classes.
It also allows researchers to investigate different definitions
of the object classes on the fly, interactively,
and investigate alternative hypotheses,
as well as set up more in-depth,
longer machine learning-based analysis
that builds upon the \genviz results.
We further performed an in-depth analysis
for nine distinct drug treatments in the \lincs dataset and
found 1070 feature (gene) pairs that had significant separability scores and outperformed
their corresponding single features. These gene pairs were enriched
in literature support for known relationships between the genes and the drug,
as well as known interactions between the genes themselves.
%==================================================================
\section{Methods}
\label{sec:method}
We begin by formally defining the {\em separability} problem,
introducing our separability metric,
and finally detailing optimizations that enable
the rapid identification of the best separating feature pairs.
%==================================================================
\subsection{Problem Definition}\label{sec:prob}
Let $\mm$ be a feature-object matrix of size $m\times N$,
where each row is a feature and each column is an
object as shown in Figure~\ref{fig:workflow}.
One example feature-object matrix is one where
each object corresponds to a tissue sample
from a cancer patient and each feature corresponds to a gene,
where the $(i,j)^{th}$ entry represents the
expression level of the $i^{th}$ gene in the $j^{th}$
tissue sample.
We denote the $m$ features as $\ff=\{f_1,f_2,\cdots,f_m\}$
and $N$ objects as $\oo=\{o_1,o_2,\cdots,o_N\}$.
Each entry $\mm_{i,j}$ in $\mm$ corresponds
to the value of feature $f_i$ for object $o_j$ as illustrated in Figure~\ref{fig:workflow}.
We are also given two non-overlapping sets of objects,
one with a positive label, $\oo_+$ and the other with a negative label, $\oo_-$.
\eat{Both sets are disjoint subsets of all objects,
i.e., $\oo_+, \oo_-\subset \oo$, and $\oo_+\cap \oo_- = \emptyset$.}
In our example, tumor samples, $\oo_+$,
may be assigned the positive label,
and the healthy tissue samples, $\oo_-$,
the negative label.
The number of labeled objects, $n$, is equal to $|\widehat{\oo}|$ where $\widehat{\oo}=\oo_+\cup \oo_-$. \eat{$\widehat{\oo}$ be the union of positive and negative objects; $n$ is the total number of labeled objects, i.e., } Also, let $l_k$ be the label of object $o_k\in \widehat{\oo}$, i.e., $l_k=1$ if $o_k$ is positive and $l_k=-1$ if $o_k$ is negative.
\eat{Now, given $\mm$ and sets
$\oo_+$ and $\oo_-$ as depicted in Figure~\ref{fig:workflow}(a),}
\genviz aims to find feature pairs that
best separate the objects
in $\oo_+$ from those in $\oo_-$ using only those features,
and then output a visualization that demonstrates the separability
(see Figure~\ref{fig:workflow}).
(We will define the metric for separability subsequently.)
A feature pair that leads to a good ``visual'' separation
between the positive and the negative sets may be able
to explain or characterize their differences via a interesting,
non-trivial relationship among the features.
The overall workflow is depicted in Figure~\ref{fig:workflow}.
% Since we intend \genviz to be an interactive data exploration tool to be employed before more time-consuming machine learning methods, we allow for minor sacrifices to its accuracy in exchange for substantial improvements in its running time.
We now formally define the separability problem.
\begin{formulation}[Separability]\label{prob:separability}
Given a feature-object matrix $\mm$ and two labeled object sets $(\oo_+,\oo_-)$, identify the \topk feature pairs $(f_i,f_j)$ that separate $\oo_+$ from $\oo_-$ based on a given separability metric.
\end{formulation}
\noindent %Here a separability metric is a scoring function for any given feature pair, and \topk feature pairs refer to the $k$ feature pairs with the highest score under the separability metric.
We will describe our separability metric in Section~\ref{sec:metric}, and then discuss optimization techniques in Section~\ref{sec:opt}. The notation used in the description of the method is summarized in Supplementary Table~\ref{appT:notation}.
%==================================================================
\subsection{Separability Metric}\label{sec:metric}
Given a feature pair $(f_i,f_j)$ as axes, we can visualize the object sets $\oo_+$ and $\oo_-$ in a 2-D space, where each object corresponds to a point with x-value and y-value as the object's value on feature $f_i$ and $f_j$ respectively. A desirable (i.e., both interesting and interpretable) visualization would be one in which the objects are {\em linearly separated}, defined as follows. Two sets of objects, i.e., $\oo_+$ and $\oo_-$, are said to be \emph{linearly separable}~\cite{medin1981linear} if there exists at least one straight line such that $\oo_+$ and $\oo_-$ are on opposite side of it.
We focus on metrics that capture this linear separation, since it corresponds to an intuitive 2-D visualization.
\eat{Given a feature pair $(f_i,f_j)$, we can represent a line $\ell$ as $w_i\cdot x + w_j\cdot y +w_0 =0$, where $x$ and $y$ represent an object's value on feature $f_i$ and $f_j$ respectively, and $w_0$, $w_i$ and $w_j$ are coefficients, where $w_j>0$.
\eat{
\begin{equation}\label{eqn:line}
\begin{split}
\ell: \hspace{4mm} w_i\cdot x + w_j\cdot y +w_0 =0
\end{split}
\end{equation}
}
}
Given a feature pair $(f_i,f_j)$ and a line $\ell$, we can predict the label of an object $o_k$, denoted as $\eta_{i,j}^{\ell,k}$, using Equation~\ref{eqn:est_label} below, where $w_0$, $w_i$ and $w_j$ are coefficients of $\ell$ and $w_j>0$:
%based on the separating line in Equation~\ref{eqn:line}. Let $\eta_{i,j}^{\ell,k}$ be the predicted label of $o_k$, calculated using Equation~\ref{eqn:est_label} below:
\begin{equation}\label{eqn:est_label}
{\sf Predicted \hspace{1mm} Label:}\hspace{3mm} \eta_{i,j}^{\ell,k}=sign(w_i\cdot \mm_{i,k} + w_j\cdot \mm_{j,k} +w_0)
\end{equation}
\noindent If $o_k$ lies above the line $\ell$, i.e., $o_k$ has higher value on y-axis than the point on line $\ell$ with the same value on x-axis as $o_k$, then $\eta_{i,j}^{\ell,k}=1$; otherwise, $\eta_{i,j}^{\ell,k}=-1$. Let $\theta_{i,j}^{\ell,k}$ be the indicator variable denoting whether \eat{an object $o_k$ is correctly separated (i.e.,} the sign of the predicted label matches the real label $l_k$\eat{)}: if $\eta_{i,j}^{\ell,k}\cdot l_k = 1$, then $\theta_{i,j}^{\ell,k} = 1$; otherwise, $\theta_{i,j}^{\ell,k} = 0$.
\eat{
\begin{equation}\label{eqn:s_object}
\vspace{-5mm}
\small
\theta_{i,j}^{\ell,k}=\left\{
\begin{array}{ll}
1 \textit{\hspace{2mm} if \hspace{2mm}} \eta_{i,j}^{\ell,k}\cdot l_k = 1\\
0 \textit{\hspace{2mm} otherwise \hspace{2mm}}
\end{array}
\right.
\end{equation}
}
\eat{\noindent If there is a line $\ell$ such that for every object $o_k\in \widehat{\oo}$, the object is correctly separated (i.e., $\theta_{i,j}^{\ell,k}$ = 1), then we say $\oo_+$ and $\oo_-$ are linearly separable.}
% \begin{equation}\label{eqn:linear}
% \eta_{i,j}^{\ell,k} = \left\{
% \begin{array}{ll}
% 1 \textit{\hspace{2mm} if \hspace{2mm}} o_k\in \oo_+, \textit{ i.e., } l_k=1 \\
% -1 \textit{\hspace{2mm} if \hspace{2mm}} o_k\in \oo_-, \textit{ i.e., } l_k=-1
% \end{array}
% \right.
% \end{equation}
\genviz's separability metric captures \eat{For \genviz, we rank feature pairs by a separability metric based on }{\em how well the objects in the feature pair's 2-D visualization can be linearly separated}, formally defined next.
%The basic intuition is that the users can easily recognize and interpret linear separability in a 2-D visualization. Thus if a visualization is perfectly linearly separated, it should have high score in our proposed separability metric; otherwise, we would find the best separating line $\ell$ in the visualization and report the largest number of correctly separated objects as the separability score.
Given a feature pair $(f_i, f_j)$ and a line $\ell$, the separability score of the line (denoted $\theta_{i, j}^\ell$) is defined as the sum of the indicators ($\theta_{i,j}^{\ell,k}$) for all objects: $\theta_{i,j}^{\ell}= \sum_{k}{\theta_{i,j}^{\ell,k}}$.
\eat{
as shown in Equation~\ref{eqn:s_line}:
\begin{equation}\label{eqn:s_line}
\hspace{3mm} \theta_{i,j}^{\ell}= \sum_{k}{\theta_{i,j}^{\ell,k}}
\end{equation}
}
\noindent Figure~\ref{fig:brute_force} shows separability scores $\theta_{i, j}^\ell$ for different separating lines. For example, the separating line with $\theta_{i, j}^\ell=12$ correctly separates six green points and six red points. The final separability score for a feature pair $(f_i,f_j)$ (denoted as $\theta_{i, j}$) is defined as the best separability score $\theta_{i, j}^{\ell}$ among all possible lines $\ell$. Accordingly, we define the overall separability error of the feature pair as $err_{i,j}=n-\theta_{i, j}$.
%As shown in Equation~\ref{eqn:s_viz},
%\begin{equation}\label{eqn:s_viz}
%\hspace{3mm} \theta_{i,j}= \max_{\ell}\{\theta_{i,j}^{\ell}\} %{\sf Separability \hspace{1mm} %Score:}
%\end{equation}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%NEED TOGGLE%%%%%%%%%%%%%%%%%%%%%%
% \begin{figure}[h]
% \centering
% \begin{subfigure}{0.235\textwidth}
% \centering
% \includegraphics[width=\linewidth]{fig/metrics.pdf}
% \vspace{-5mm}
% \caption{Brute Force}%{$\theta_{i,j}^{\ell}$ with Different Lines[e]}
% \label{fig:brute_force}
% \end{subfigure}
% \begin{subfigure}{.235\textwidth}
% \centering
% \includegraphics[width=\linewidth]{fig/rocchio.pdf}
% \vspace{-5mm}
% \caption{Rocchio-Based}%{$\theta_{i,j}^{\lhat}$ with Representative Line[f]}
% \label{fig:rocchio}
% \end{subfigure}
% \caption{Different methods to Calculate Separability Score $\theta_{i,j}$}
% \label{fig:metric}
% \end{figure}
\stitle{Brute Force Calculation of $\theta_{i,j}$.} As suggested in Figure~\ref{fig:brute_force}, the simplest way to calculate $\theta_{i,j}$ is to first enumerate all possible separating lines $\ell$ and calculate $\theta_{i,j}^\ell$ for each of them. \eat{This is infeasible as there are an infinite number of possible lines. However, }We can easily trim down the search space to $O(n^2)$ lines by linking the points corresponding to every two objects in the 2-D plane. This is because the results of all other possible lines can be covered by these $O(n^2)$ lines~\cite{vapnik1998statistical}. Nevertheless, it is still very time-consuming to consider $O(n^2)$ lines for each feature pair $(f_i,f_j)$.
\eat{
\begin{table}[t!]
\centering
\small
\begin{tabular}{|c|c|c|c|}
\hline
Symb. & Description & Symb. & Description\\
\hline
\hline
$\mm$ & feature-object matrix & $\ff$ & feature set in $\mm$ \\
\hline
$f_i$ & feature $i$ in $\ff$ & $m$ & number of features in $\ff$\\
\hline
$\oo$ & object set in $\mm$ & $N$ & number of objects in $\oo$\\
\hline
$\oo_+$ & positive object set & $\oo_-$ & negative object set\\
\hline
$\widehat{\oo}$ & labelled object set & $n$ & number of labelled objects in $\widehat{\oo}$\\
\hline
$o_k$ & object $k$ in $\widehat{\oo}$ & $l_k$ & label of object $o_k$\\
\hline
$\ell$ & separating line in 2-D & $\lhat$ & representative line in 2-D\\
\hline
$\eta_{i,j}^{\ell,k}$ & predicted label of $o_k$ & $\theta_{i,j}^{\ell,k}$ & $o_k$ is correctly separated? \\
\hline
$\theta_{i,j}^{\ell}$ & \# correctly separated $o_k$ & $\theta_{i,j}$ & separability score\\
\hline
$\widehat{\mm}$ & $\mm$ after transformation & $\tilde{\theta}_{i,j}$ & estimated $\theta_{i,j}$\\
\hline
\end{tabular}
\caption{Notation used in this paper.}
\label{tbl:notation}
\vspace{-18pt}
\end{table}
}
\stitle{Rocchio-based Measure.} \eat{Rather than enumerating the $O(n^2)$ possible lines to find the one with maximum $\theta_{i,j}^{\ell}$, }We can speed up the process by selecting a single {\em representative line} $\lhat$ providing us with an estimate of the true separability score $\theta_{i, j}$\eat{estimated separability score, $\theta_{i,j}^{\lhat}$, that approximates the true linear separability score $\theta_{i, j}$}.
%as shown in Equation~\ref{eqn:s_viz_appr}, .
%\begin{equation}\label{eqn:s_viz_appr}
%\theta_{i,j} \leftarrow \theta_{i,j}^{\lhat} %\approx
%\end{equation}
In order to achieve a fast and reliable estimate, we select our representative line based on Rocchio's algorithm~\cite{rocchio1971relevance}.
%As we will show later in Section~\ref{sec:exp}, $\theta_{i,j}^{\lhat}$ is comparable to $\theta_{i,j}$ when using Rocchio-based representative line $\lhat$. Let us describe in detail about the Rocchio-based representative line.
%The intuition behind using Rocchio's algorithm is that each object's predicted label should be the class label of its nearest class centroid.
Let us denote the centroids of positive objects $\oo_+$ and negative objects $\oo_-$ for a given $(f_i,f_j)$ as $\mu_{i,j}^+=(\mm_i^+,\mm_j^+)$ and $\mu_{i,j}^-=(\mm_i^-,\mm_j^-)$ respectively, where $\mm_i^+$ and $\mm_j^+$ are the values of the centroids of the positive objects on feature $f_i$ and $f_j$, and $\mm_i^-$ and $\mm_j^-$ are the values of the centroids of the negative objects on feature $f_i$ and $f_j$. The perpendicular bisector of the line joining the two centroids is selected as the representative separating line $\lhat$ (see Figure~\ref{fig:rocchio}), with its coefficients corresponding to Equation~\ref{eqn:est_label} defined as {\small $w_i = \mm_i^+-\mm_i^-$}, {\small $w_j = \mm_j^+-\mm_j^-$}, and {\small $w_0 = -(\frac{(\mm_i^+)^2-(\mm_i^-)^2}{2}+\frac{(\mm_j^+)^2-(\mm_j^-)^2}{2})$}.
\eat{following the template of Equation~\ref{eqn:line} with coefficients as in Equation~\ref{eqn:rep_line}:
%\lhat: \hspace{4mm} (\mm_i^+-\mm_i^-)\cdot x + (\mm_j^+-\mm_j^-)\cdot y -(\frac{(\mm_i^+)^2-(\mm_i^-)^2}{2}+\frac{(\mm_j^+)^2-(\mm_j^-)^2}{2}) =0
\begin{equation}\label{eqn:rep_line}
\begin{split}
& w_i = \mm_i^+-\mm_i^- \\
& w_j = \mm_j^+-\mm_j^- \\
& w_0 = -(\frac{(\mm_i^+)^2-(\mm_i^-)^2}{2}+\frac{(\mm_j^+)^2-(\mm_j^-)^2}{2})
\end{split}
\end{equation}}
\eat{In Figure~\ref{fig:rocchio}, the ``Rocchio-based'' measure with the representative separating line $\theta_{i,j}^{\lhat}$ equals $13$ with one negative object (red point) mis-predicted as positive, while the true $\theta_{i,j}$ also equals $13$.}
\stitle{Brute-force vs. Rocchio-based.} Compared to the brute force calculation, the Rocchio-based measure is much more light-weight, but at the cost of accuracy in calculating $\theta_{i,j}$. Intuitively, the representative line is a reasonable proxy to the best separating line since the Rocchio-based measure \eat{computes the centroids of the two classes as their representatives and }assigns each object to its nearest centroid. We further empirically demonstrate that $\theta_{i,j}^{\lhat}$ is a good proxy for $\theta_{i,j}$ in Section~\ref{sec:exp_comp}. Thus, we will focus on the Rocchio-based measure subsequently, removing $\lhat$ (or $\ell$) from the superscripts where it appears, and using $\theta_{i,j}$ and $\theta_{i,j}^{\lhat}$ interchangeably.
% \xagp{you have used Rocchio's alg/measure/representative line. Maybe avoid some of them for consistency.}
%==================================================================
\subsection{Proposed Suite of Optimizations}\label{sec:opt}
\begin{figure}[t!]
\centering %%% not \center
\vspace{-2mm}
\subfigure[Brute Force]{\label{fig:brute_force}\includegraphics[width=.2\textwidth]{fig/metric.eps}}
\subfigure[Rocchio-Based]{\label{fig:rocchio}\includegraphics[width=.2\textwidth]{fig/rocchio.eps}}
\vspace{-5mm}
\caption{Calculating Separability Score $\theta_{i,j}$. The scored separating line can be defined using (a) brute force (few sample lines are shown) or (b) the representative line from a Rocchio-based measure based on the object class centroids (white circles).}
\vspace{-5mm}
\label{fig:metric}
\end{figure}
In this section, we first analyze the time complexity of identifying the \topk feature pairs using the Rocchio-based measure, and then propose several optimization techniques to reduce the complexity.
\stitle{Time Complexity Analysis.} For\eat{In order to calculate the separability of} a given feature pair $(f_i, f_j)$, if we have already calculated the class centroids for each feature, the separating line $\lhat$ can be calculated in $O(1)$\eat{ time using Equation\eat{~\ref{eqn:rep_line}} for $w_i$, $w_j$ and $w_0$}. We can then calculate the number of correctly separated objects $\theta_{i,j}$ via\eat{ by evaluating all objects with respect to the line, i.e.,} $O(n)$ evaluations. Since there are $O(m^2)$ feature pair candidates, the total time complexity is $O(m^2n)$, which can be very large,
since $m$ and $n$ are typically large.
\stitle{Optimizations: Overview.} To reduce the time complexity, we introduce two categories of optimizations:
those that reduce the amount of time
for fully evaluating a given feature pair (Section~\ref{ssec:trans},~\ref{ssec:earlyT}) and
those that reduce the number of feature pairs
that require full evaluation (Section~\ref{ssec:sampling},~\ref{ssec:traversal}).
In the following, we refer to these optimizations as {\em modules} to indicate that they
can be used in any combination---however, in reality, careful engineering is necessary
to ``stitch'' these modules together to multiply the effects of the optimizations.
\eat{The first optimization module within the first category, }\trans module (Section~\ref{ssec:trans}) reduces redundant calculations across feature pairs by mapping the feature-object matrix $\mm$ into a new space that enables faster evaluation of object labeling\eat{, and thus faster calculation of separability score}. \eat{The second optimization module, }\earlyT module (Section~\ref{ssec:earlyT}) takes advantage of the fact that evaluation of a poorly separating feature pair can be terminated early without having to evaluate the separability of all $n$ objects.
\eat{The first optimization module within the second category, }\sampling module
(Section~\ref{ssec:sampling}) first identifies likely \topk feature pair candidates by evaluating their separability on a sampled subset of all objects, and then conducts full evaluations only on these feature pair candidates.
Finally, \traversal module (Section~\ref{ssec:traversal})
reduces the number of feature pairs checked by greedily
choosing feature pairs based on the separability of the
corresponding single features. These optimization modules can be used on their own or combined with each other.
In Section~\ref{sec:exp}, we will show how these optimizations modules greatly reduce the running time of finding the \topk separating feature pairs without significantly affecting the accuracy.
%==================================================================
\subsubsection{Pre-Transformation for Faster Feature Pair Evaluation} \label{ssec:trans}
We observe that there is massive redundancy across $\theta_{i,j}$'s computation of different feature pairs. Motivated by this, we propose the \trans optimization module which will pre-calculate some common computational components once across different features and reuse these components in evaluating the separability for each different feature pair. This \trans module transforms the original $\mm_{i,k}$ matrix into another space $\widehat{\mm}_{i,k}$ using the identified common feature pair components and updates the separability score equation accordingly. Specifically, with this transformation of the feature-object matrix $\widehat{\mm}_{i,k}$,
evaluating whether an object was correctly separated is simplified as: if $sign(\widehat{\mm}_{i,k} + \widehat{\mm}_{j,k}) = 1$, then $\theta_{i,j}^{k}=1$; otherwise, $\theta_{i,j}^{k}=0$. Details and an example can be found in Supplementary Note~\ref{app:trans} and Supplementary Figure~\ref{appF:transform}.
\eat{
Let us review the process of computing the separability $\theta_{i,j}$. Given a feature pair $(f_i,f_j)$ and the corresponding positive and negative centroids, {\em (a)} we first compute $w_0$, $w_i$ and $w_j$ for $\lhat$\eat{ based on Equation~\ref{eqn:rep_line}}. Next, for each object $o_k$, {\em (b)} we obtain the predicted label $\eta_{i,j}^{\lhat,k}$ according to Equation~\ref{eqn:est_label}. This step requires two multiplications and three additions. Finally, {\em (c)} we calculate $\theta_{i,j}^{k}$ and the separability $\theta_{i,j}$ based on Equations in Section~\ref{sec:metric}\eat{Equation~\ref{eqn:s_object} and \ref{eqn:s_line} respectively}. This whole process is repeated for every feature pair candidate. However, there is massive redundancy across the processing of different feature pairs. For instance, when calculating the separability for two different feature pairs $(f_i,f_j)$ and $(f_i,f_{j'})$ with a common $f_i$, $w_i$ is in fact shared, and calculation of $w_i\cdot \mm_{i,k}$ in Equation~\ref{eqn:est_label} is repeated for each object $o_k$.
\tr{\sinha{is there another work we can cite that has done this?} \silu{Aditya's another paper~\cite{parameswaran2012crowdscreen} did similar trick, but in different problem. Do we want to cite it?}\agp{I think it's too far off.}}
Given this, we propose the \trans optimization module which will pre-calculate some common computational components once across different features and reuse these components the separability for each feature pair to eliminate the repeated computation. This \trans module transforms the original $\mm_{i,k}$ matrix into another space using our identified common feature pair components and updates the separability score equation accordingly.
For each feature $f_i$, we find the average values of the positive and negative objects for that feature, $\mm_i^+$ and $\mm_i^-$ respectively, and then we pre-transform $\mm_{i,k}$, i.e., the value of object $o_k$ on the feature $i$, to $\widehat{\mm}_{i,k}=$ $\Big( (\mm_i^+-\mm_i^-)\cdot \mm_{i,k}-\frac{(\mm_i^+)^2-(\mm_i^-)^2}{2} \Big) \cdot l_k$.
\eat{
\begin{equation}\label{eqn:matrix_transform}
\widehat{\mm}_{i,k} = \Big( (\mm_i^+-\mm_i^-)\cdot \mm_{i,k}-\frac{(\mm_i^+)^2-(\mm_i^-)^2}{2} \Big) \cdot l_k
\end{equation}
}
The basic idea is to decompose
Equation~\ref{eqn:est_label} into two components,
with each one only related to a single feature.
This transformation incorporates
the class centroids into the matrix values,
obviating their integration later
for every feature pair that involves the given feature.
We\eat{Equation~\ref{eqn:matrix_transform}} also multiplies
in the class label of the object, $l_k$,
rather than repeating this multiplication every time the object is evaluated (see example in \transfig). With this transformation of the feature-object matrix,
evaluating whether an object was correctly separated is simplified as: if $sign(\widehat{\mm}_{i,k} + \widehat{\mm}_{j,k}) = 1$, then $\theta_{i,j}^{k}=1$; otherwise, $\theta_{i,j}^{k}=0$. \eat{from Equation~\ref{eqn:s_object} into Equation~\ref{eqn:s_object_transform}:
% \agp{confusing because $L$ is not there in Equation~\ref{eqn:s_object}}\silu{maybe add "Since our focus is on Rocchio-based measure, we replace the separating line $\ell$ in Equation~\ref{eqn:s_object} with the representing line $L$ in Equation~\ref{eqn:s_object_transform}."}
%, and get rid of checking the real label $l_k$ in Equation~\ref{eqn:s_object}
\begin{equation}\label{eqn:s_object_transform}
\theta_{i,j}^{k}=\left\{
\begin{array}{ll}
1 \textit{\hspace{2mm} if \hspace{2mm}} sign(\widehat{\mm}_{i,k} + \widehat{\mm}_{j,k}) = 1\\
0 \textit{\hspace{2mm} otherwise \hspace{2mm}}
\end{array}
\right.
\end{equation}
%\begin{equation}\label{eqn:est_label_transform}
%\eta_{i,j}^{\lhat,k}=\sign(\widehat{\mm}_{i,k} + \widehat{\mm}_{j,k})
%\end{equation}
After pre-transformation, we are now ready
to compute the separability score $\theta_{i,j}$.
Given a feature pair $(f_i,f_j)$, we can compute $\theta_{i,j}^{k}$
for each object $o_k$ based on Equation~\ref{eqn:s_object_transform}.}
Note that this step only involves one addition and one comparison and is performed only once for each feature.
Next, we can calculate overall separability score $\theta_{i,j} = \sum_{k}{\theta_{i,j}^{k}}$\eat{ like in Equation~\ref{eqn:s_line} without the pre-transformation}. Overall,
we not only eliminate the steps of computing
$w_0$, $w_i$ and $w_j$ for every feature pair,
but also reduce the cost of calculating
$\eta_{i,j}^{k}$ in Equation~\ref{eqn:est_label}.
In the following sections,
we will consider $\widehat{\mm}$ using pre-transformation instead of $\mm$, and calculate $\theta_{i,j}$ accordingly\eat{
and Equation~\ref{eqn:s_object_transform} instead of
Equation~\ref{eqn:est_label} and~\ref{eqn:s_object}}.}
%Hence, by conducting transformation in Equation~\ref{eqn:matrix_transform} and~\ref{eqn:s_object_transform}, we have mapped our problem to a new space.
%\begin{figure}[h]
%\centering %%% not \center
%\vspace{-2mm}
%\subfigure[Pre-Transformation]{\label{fig:transform}\includegraphics[width=.235\textwidth]{fig/transformation.pdf}}
%\subfigure[Early Termination ($\tau = 1$)]{\label{fig:earlyT}\includegraphics[width=.235\textwidth]{fig/earlyT.pdf}}
%\vspace{-5mm}
%\caption{Optimizations for Feature Pair Evaluation. (a) The \trans module is applied once to the original values in the feature-object matrix $\mm$ (above) using Equation~\ref{eqn:matrix_transform} to produce $\widehat{\mm}$ (below). (b) With the \earlyT module, a feature pair can be discarded from consideration for the \topk before all objects are evaluated (above) with even greater speedup potential after re-ordering the objects (below).}
%\vspace{-8mm}
%\label{fig:trans_term}
%\end{figure}
%\begin{example}[Transformation]
%Figure~\ref{fig:transform} illustrates the transformation for two features, $f_i$ and $f_j$. The top half depicts $\mm_{i,k}$ and $\mm_{j,k}$ before transformation, where green color represents a positive label and red color represents a negative label. In this example, the centroids of the positive and negative objects are $\mu_{i,j}^+=(5,7)$ and $\mu_{i,j}^-=(3,5)$ respectively. Hence, we can rewrite Equation~\ref{eqn:matrix_transform} as $\widehat{\mm}_{i,k} = (2\mm_{i,k}-8)\cdot l_k$ and $\widehat{\mm}_{j,k} = (2\mm_{i,k}-12)\cdot l_k$ for features $f_i$ and $f_j$ respectively. After calculation, we can obtain the values for $\widehat{\mm}_{i,k}$ and $\widehat{\mm}_{j,k}$ shown in the bottom half of Figure~\ref{fig:transform}.
%\end{example}
\tr{
\sinha{at this point, write a summary of the time complexity of calculating $\theta_{i,j}^{\lhat}$ for a feature pair i,j using transformation (include preprocessing separately)}
\silu{the big-O time complexity is not changed. More precisely, only with constant improvement. Preprocessing takes O(mn) as we discussed in Results section.}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%NEED TOGGLE%%%%%%%%%%%%%%%%%%%%%%
% \begin{figure}[h]
% \centering
% \begin{subfigure}{0.235\textwidth}
% \centering
% \includegraphics[width=\linewidth]{fig/transformation.pdf}
% \vspace{-5mm}
% \caption{Pre-Transformation}%{$\theta_{i,j}^{\ell}$ with Different Lines}
% \label{fig:transform}
% \end{subfigure}
% \begin{subfigure}{.235\textwidth}
% \centering
% \includegraphics[width=\linewidth]{fig/earlyT.pdf}
% \vspace{-5mm}
% \caption{Early Termination ($\tau = 1$)}%{$\theta_{i,j}^{\lhat}$ with Representative Line}
% \label{fig:earlyT}
% \end{subfigure}
% \caption{Different Methods to Reduce Running Time}
% \label{fig:metric}
% \end{figure}
\tr{
\stitle{Extension to Single Features.} For a feature pair $(f_i,f_j)$ where $i=j$, we can treat the corresponding $\theta_{i,i}$ as the separability score for a single feature $f_i$. Equation~\ref{eqn:s_object_transform} can be further reduced to Equation~\ref{eqn:s_object_transform_single}.
\begin{equation}\label{eqn:s_object_transform_single}
\theta_{i,i}^{k}=\left\{
\begin{array}{ll}
1 \textit{\hspace{2mm} if \hspace{2mm}} sign(\widehat{\mm}_{i,k}) = 1\\
0 \textit{\hspace{2mm} otherwise \hspace{2mm}}
\end{array}
\right.
\end{equation}
Hence, we can calculate the separability score $\theta_{i,i}$ very efficiently for each single feature, based on the transformed matrix $\widehat{\mm}$.
}
%==================================================================
\subsubsection{Early Termination} \label{ssec:earlyT}
Given a feature pair $(f_i,f_j)$,
we need to scan all the objects
to compute the separability score $\theta_{i,j}$\eat{ according to Equation~\ref{eqn:s_line}}.
However, since we only need to identify feature pairs in the \topk,
we can stop for each feature pair as soon as we can make that determination, without
scanning all objects; we call this the \earlyT module (Figure~\ref{fig:workflow}).
\stitle{High Level Idea.} We maintain a upper bound $\tau$
for the separability error $err_{i,j}$ of the \topk feature pairs.
Then, the lower bound of the separability
score can be denoted as ($n-\tau$)\eat{,
where $n$ is the total number of objects $\hat{\oo}$}.
Given a feature pair $(f_i,f_j)$, we start to scan the object list until the number of incorrectly classified objects exceeds $\tau$. If so, we can terminate early and prune this feature pair since it cannot be among the \topk.
Otherwise, $(f_i,f_j)$ is added to the \topk feature pair set and we update $\tau$ accordingly.
\stitle{Enhancement by Object Ordering.}
Although \earlyT has the potential to always reduce the running time, its benefits are sensitive to the ordering of the objects for evaluation. Since we terminate as soon as we find $\tau$ incorrectly classified objects, we can improve our running time if we examine ``problematic'' objects that are unlikely to be correctly classified relatively early.
For this\eat{To evaluate ``problematic'' objects earlier},
we order the objects in descending order of
the number of single features $f_i$ that incorrectly
classify the object $o_k$, i.e., $\widehat{\mm}_{i,k}\leq0$.
Thus, the first object evaluated is the one that is incorrectly
classified by the most single features. The benefit of this strategy is illustrated with an example in Figure~\ref{fig:earlyT}.
\eat{We study the impact of this strategy in Section~\ref{sec:exp}.}
\tr{\cb{note, 0's are treated as mis-labels and ties are broken randomly.}\agp{Probably not super important to clarify this.}}
\begin{figure}[t!]
\centering
\vspace{-2mm}
\subfigure[\earlyT Example]{\label{fig:earlyT}\includegraphics[height=.165\textwidth]{fig/earlyT.eps}}
\subfigure[\sampling Example]{\label{fig:sampling}\includegraphics[height=.165\textwidth]{fig/sampling.eps}}
% \includegraphics[width=0.8\linewidth]{fig/sampling.pdf}
\vspace{-5mm}
\caption{Optimization Module Examples. (a) When evaluating a feature pair with \earlyT module, the transformed $\widehat{\mm}$ scores are scanned left to right and each incorrectly classified object is marked in blue. Without object ordering (above), evaluation terminates after five checked objects. When objects are reordered by the most ``problematic'' (below), the feature pair is rejected after checking only the first two objects. (b) To calculate the \topthree feature pairs with \sampling, the confidence interval of $\theta_{i,j}$ is calculated for every feature pair evaluated on the sample set $\sss$ (above). The $3^{rd}$ interval lower bound $\zeta$ is obtained (red dotted line), and all feature pairs with a larger upper bound are designated as candidates for validation (blue intervals). The selected candidates (center box) are evaluated on the whole object set $\hat{\oo}$ to compute the exact $\theta_{i,j}$ and pick the \topthree (right box).}
\vspace{-5mm}
\label{fig:etermANDsampling}
\end{figure}
%==================================================================
\subsubsection{Sampling-based Estimation} \label{ssec:sampling}
One downside of the \earlyT module
is that the improvement in the running time is highly data-dependent.
Here, we propose a stochastic method, called \sampling,
that reduces the number of examined objects.
Instead of calculating $\theta_{i,j}$ over the
whole object set $\hat{\oo}$, \sampling works on a
sample set drawn from $\hat{\oo}$.
\tr{
\sinha{use $\theta_{i,j}^{\lhat}$ instead of $\theta_{i,j}$}
\cb{we introduced this simplification at the end of Section \ref{sec:metric}}\agp{yes, but you used it in 2.3.1.}
\silu{Maybe we can add one sentence here to remind the readers? "Recall that we use $\theta_{i,j}^{\lhat}$ and $\theta_{i,j}$ interchangeably throughout the paper."}
}
\stitle{High Level Idea.} \sampling primarily consists of two phases: {\em candidate generation} and {\em validation}
(Figure~\ref{fig:sampling}).
In phase one, we estimate the confidence interval of $\theta_{i,j}$
for each feature pair using a sampled set of objects and
generate the candidate feature pairs for full evaluation
based on where their confidence intervals lie.
If the confidence interval overlaps with the
score range of the current \topk, then it is selected for evaluation.
In phase two (lower half of Figure~\ref{fig:sampling}),
we evaluate only the feature pairs in the candidate set,
calculating $\theta_{i,j}$ over the whole object set, $\hat{\oo}$,
to obtain the final \topk feature pairs.
Unlike our previous optimizations, \sampling returns
an approximation of the \topk ranking feature pairs.
\eat{In Section~\ref{sec:exp_comp}, we study how the greatly
reduced running time of \sampling affects the accuracy of the \topk pairs.} %We will describe in detail in Section~\ref{sssec:generate} and \ref{sssec:validate}.
\stitle{Candidate Generation.} %\label{sssec:generate}
Let $\sss$ be a sample set drawn uniformly from $\hat{\oo}$. Given a feature pair $(f_i,f_j)$, let $\theta_{i,j}(\sss)$ be the number of correctly separated objects in $\sss$. \eat{Let $\tilde{\theta}_{i,j}$ be the estimated separability score of all objects based on the sample set $\sss$. First, }We can estimate $\tilde{\theta}_{i,j}$ from $\theta_{i,j}(\sss)$ using $\tilde{\theta}_{i,j} = \frac{\theta_{i,j}(\sss)}{|\sss|} \cdot n$\eat{Equation~\ref{eqn:est_score}} by assuming the ratio of correctly separated samples in $\sss$ is the same as that in $\hat{\oo}$.
Using Hoeffding's inequality~\cite{hoeffding1963probability}, we have that by selecting $\Omega(\frac{1}{\epsilon^2})$ samples, that $\theta_{i,j}$ is in the confidence interval $[\tilde{\theta}_{i,j}-\epsilon n, \tilde{\theta}_{i,j}+\epsilon n]$ with high probability (details in Supplementary Note~\ref{app:sampling}). Since the sample size $|\sss|$ is independent of the number of objects, this module helps \genviz scale to datasets with large $n$.
\eat{
\begin{equation}\label{eqn:est_score}
\tilde{\theta}_{i,j} = \frac{\theta_{i,j}(\sss)}{|\sss|} \cdot n
\end{equation}}
\eat{
\noindent Next, we show that with a constant sample set size $|\sss|$, $|\theta_{i,j}-\tilde{\theta}_{i,j}|$ is bounded by a small value with high probability, using Hoeffding's inequality~\cite{hoeffding1963probability}. \eat{We formally quantize the sample set size in Theorem~\ref{them:est}.}
\begin{theorem}[Estimation Accuracy]\label{them:est}
By considering $\Omega(\frac{1}{\epsilon^2}\cdot \log(\frac{1}{\delta}))$ samples, with probability at least $1-\delta$, we have $|\frac{\theta_{i,j}(\sss)}{|\sss|}-\frac{\theta_{i,j}}{n}| \leq \epsilon$, i.e., $|\tilde{\theta}_{i,j}-\theta_{i,j}|\leq \epsilon n$.
\end{theorem}
\tr{\begin{proof}
Based on Hoeffding's inequality:
$$Pr(|\frac{\theta_{i,j}(\sss)}{|\sss|}-\frac{\theta_{i,j}}{n}| \geq \epsilon) \leq 2e^{-2|\sss|\epsilon^2}$$
Hence, when $|\sss| = \frac{1}{2\epsilon^2}\cdot \log(\frac{2}{\delta}) = \Omega(\frac{1}{\epsilon^2}\cdot \log(\frac{1}{\delta}))$, we have:
$$
Pr(|\frac{\theta_{i,j}(\sss)}{|\sss|}-\frac{\theta_{i,j}}{n}| \geq \fepsilon) \leq \delta \ \ \ \ \ \ \ \ \ \ \ \square
$$
\end{proof}
}
\noindent We can treat $\log(1/\delta)$ as a constant, e.g., by setting $\delta = 0.05$. Thus, Theorem~\ref{them:est} essentially states that with only $\Omega(\frac{1}{\epsilon^2})$ samples, with probability 95\%, the confidence interval for $\theta_{i,j}$ is $[\tilde{\theta}_{i,j}-\epsilon n, \tilde{\theta}_{i,j}+\epsilon n]$. Note that the sample size $|\sss|$ only depends on $\epsilon$ and is independent of the number of objects, $n$. Hence, the \sampling module helps \genviz scale to datasets with large $n$.
}
%For instance, when setting $\epsilon=0.05$, the sample size is fixed to be $400$.
\eat{With Theorem~\ref{them:est}, }Following the top half of Figure~\ref{fig:sampling}, we can first calculate the confidence interval of $\theta_{i,j}$ for each feature pair $(f_i,f_j)$. Next, we compute the lower bound of $\theta_{i,j}$ for the \topk feature pairs, denoted as $\zeta$ as shown by the red dotted line. Finally, we can prune feature pairs away whose upper bound is smaller than $\zeta$, keeping the candidate set $\cc$ of feature pairs depicted by blue intervals. These feature pairs $\cc$ will be further validated in phase two, i.e., candidate validation. Typically, $|\cc|$ will be orders of magnitude smaller than $m^2$, the original search space for all feature pairs.
%We will verify this later in our evaluations (Section~\ref{sec:exp}).
\stitle{Candidate Validation.} %\label{sssec:validate}
\eat{Since we only estimated the separability scores $\theta_{i,j}$ for our feature pair candidates in phase one, }We re-evaluate all of the candidates generated from phase one to produce our final feature pair ranking. This \eat{phase two }evaluation is performed using the whole object set $\hat{\oo}$ and the \topk feature pairs are reported (lower half of Figure~\ref{fig:sampling}).
\tr{\sinha{are we using early termination and object ordering in phase two?} \silu{no. We tried both and found that early termination is not helping most of time since 1) the feature pair candidates, which perform well in phase 1, are indeed the ones that are hard to prune and 2) \earlyT incurs some overhead itself.}
}
\eat{For each feature pair $(f_i,f_j)\in \cc$, we calculate the separability score $\theta_{i,j}$ and find the \topk feature pairs (lower half of Figure~\ref{fig:sampling}).}
\stitle{Enhancement by Candidate Ordering.}
In Section~\ref{ssec:earlyT} we proposed an enhancement
that allows us to terminate computation early by
manipulating the order of the objects;
here we similarly found a way to reduce the running
time by changing the order in which feature pair candidates
are validated in phase two.
Instead of directly validating each feature pair candidate,
we first order the candidates in descending order according to the upper bound of each candidate's confidence interval.
Then, we sequentially calculate the full separability
score $\theta_{i,j}$ for each feature pair, and update $\zeta$ correspondingly.
Recall that $\zeta$ is the current estimate of the lower bound of $\theta_{i,j}$ for the \topk feature pairs. Finally, we terminate our feature
pair validation when the next feature pair's upper bound smaller than the current value of $\zeta$ (Figure~\ref{fig:candidate_ordering}).
\begin{figure}[h]
\centering
\vspace{-5mm}
\includegraphics[width=0.9\linewidth]{fig/candidate_ordering.eps}
\vspace{-6mm}
\caption{Candidate Ordering Enhancement. (a) Feature pair candidates are sorted by the upper bounds of their confidence intervals (solid red boundary), and the lower bound of the \topthree feature pairs, i.e., $\zeta$, is set (red dotted line). (b,c,d) For each feature pair, we calculate $\theta_{i,j}$ (filled blue circle) using all objects and update $\zeta$ if necessary. Note that $\zeta$ is increased in (d) after evaluating the third feature pair and since $\zeta$ is larger than the upper bound of the fourth feature pair, candidate validation can terminate and return the top ranking pairs.}
\vspace{-5mm}
\label{fig:candidate_ordering}
\end{figure}
\tr{
\sinha{ some readers will be confused by higher values being on the left}}
%==================================================================
\subsubsection{Search Space Traversal} \label{ssec:traversal}
The optimizations discussed so far check fewer
than $n$ objects for each feature pair and reduce
the number of feature pairs
for full evaluation. Our \traversal module
aims to reduce the number of feature pairs
considered from $m^2$ to a smaller number.
Instead of examining each feature pair, we only examine a limited number of feature pairs, but in an optimized traversal order. The number of examined feature pairs, $\chi$, determines a trade-off between efficiency and accuracy. Fewer feature pairs checked will result in faster running times, though at the cost of accuracy to the \topk.
\eat{The full search space for the feature pairs is the upper triangle in Figure~\ref{fig:traversal}(a), where each row represents $f_i$ and each column represents $f_j$.}
The order of the feature pairs must be determined carefully and we propose two alternative orderings based on the ranking of single features by their separability scores $\theta_{i,i}$. The first traversal order, called \emph{horizontal traversal}, prioritizes feature pairs that have at least one high ranking single feature in the considered feature pair. The second order, called \emph{vertical traversal}, prioritizes feature pairs where both features have high single feature scores rankings.
%\squishlist
%\item \emph{Horizontal traversal:} for each feature $f_i$, pair it with each other feature $f_j$, where $j\geq i$, to obtain $(f_i,f_j)$. Repeat for each $f_i$, $1 \leq i\leq m$.
%\item \emph{Vertical traversal:} for each feature $f_j$, pair it with each other feature $f_i$, where $i\leq j$, to obtain $(f_i,f_j)$. Repeat for each $f_j$, where $1 \leq j\leq m$.
%\item \emph{Diagonal traversal (Figure~\ref{fig:traversal}(c)):} for a fixed value $v$, pair $(f_i,f_{v-i})$, where $1\leq i<v$. Repeat it for each $v$, where $2 \leq v\leq 2m$.
%\squishend
%\noindent
See Supplementary Figure~\ref{appF:traversal} for more details and an example.
\eat{
\stitle{Enhancement by Feature Ordering.} We propose to first sort features based on single features' separability scores $\theta_{i,i}$. Single features with good separability individually are ranked higher. Let $\{f_1^{'} f_2^{'},\cdots,f_m^{'}\}$ be the sorted single-feature set. Next, we discuss the intuition behind the two different traversal order schemes. Consider partitioning the sorted single-feature list into three categories based on the single-feature separability score $\theta_{i,i}$: {\em good, average,} and {\em poor}.
\squishlist
\item \emph{Horizontal traversal:} the intuition is that there is at least one {\em good} or {\em average} single feature in each \topk feature pair $(f_i,f_j)$.
\item \emph{Vertical traversal:} the intuition is that both features in each \topk feature pair $(f_i,f_j)$ must be categorized as {\em good} or {\em average} individually.
%\item \emph{Diagonal traversal:}
\squishend
\noindent We examine the impact of these traversal schemes in our evaluation.
%The intuition behind horizontal traversal is that there is at least one {\em top} or {\em median} single features in the \topk feature pairs $(f_i,f_j)$; while the intuition behind horizontal traversal is that both features in \topk feature pairs $(f_i,f_j)$ must rank {\em top} or {\em median} by their own.
\begin{figure}[h]
\centering
\vspace{-5mm}
\includegraphics[width=0.7\linewidth]{fig/traversal.eps}
\vspace{-5mm}
\caption{\traversal Orderings. \genviz is able to examine the feature pair search space (a) horizontally or (b) vertically.}
\vspace{-5mm}
\label{fig:traversal}
\end{figure}
}
%Typically, $K$ is not large, e.g., {\sc Top-1000}, while $m$ is large, e.g., {\sc 20000}.
%\begin{example}[\traversal]
%Assume $m=2\times 10^4$. Initially, the number of possible feature pairs is roughly $\frac{m^2}{2}=2\times 10^8$. However, if we limit the number of considered feature pairs to $\chi=10^7$, we reduce our search space to $\frac{1}{20}$ of the total number of feature pairs. We order the single features by their individual separability scores. In horizontal traversal, only feature pairs with at least one individual feature ranked in the top 500 will be considered; while vertical traversal will consider only feature pairs with both individual features ranked better than 2000.
%the "worst" $f_i$ ranks around 500, while $f_j$ can be any feature in $\ff$. On the other hand, in vertical traversal, the "worst" $f_i$ (and $f_j$) ranks around 2000.
%\label{examp:traversal}
%\end{example}
%==================================================================
\section{Results}
\label{sec:exp}
In this section, we illustrate that \genviz
rapidly identifies meaningful, significant, and interesting
separating feature pairs in real biological datasets.
First, we describe the datasets and the algorithms used
in our evaluation.
Each algorithm that we evaluate
represents a combination of optimization modules
for ranking \topk feature pairs using our
Rocchio-based measure---we report the running time
and accuracy
of the algorithms.
Second, we compare the \topk feature pairs returned by \genviz with the corresponding \topk single features,
and examine their significance and support in existing publications.
Last, we present some sample visualizations to illustrate the separability of the object classes.
%Finally, we enumerate and visualize
% the \topk feature pairs returned by \genviz,
% compare them to the corresponding \topk single features,
% and examine their significance and
% support in existing publications.
%==================================================================
\subsection{Evaluation Setup}
\begin{table}[h]
\centering
\vspace{-5mm}
\small
\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
\hline
& $|\ff|$=$m$ & $|\oo|$=$N$ & $|\sss|$ & $\chi$ & \# of $\widehat{\oo}$ & avg($|\oo_+|$) & avg($|\oo_-|$) \\
\hline
\msig & 19,912 & 22,209 & 400 & $10^7$ & 10 & 295 & 21,914 \\
\hline
\lincs & 22,268 & 98,061 & 400 & $10^7$ & 40 & 165 & 97,897 \\
\hline
\end{tabular}
\caption{\scriptsize Dataset Statistics. For each dataset, the number of features $m$, objects $N$, sample size $|\sss|$ used by \sampling module, feature pairs $\chi$ examined by \traversal module, number of object sets: \# of $\widehat{\oo}$, average positive set size: avg($|\oo_+|$), and average negative set size: avg($|\oo_-|$).}
\label{tbl:dataset}
\vspace{-18pt}
\end{table}
\begin{table*}[t]
\centering
\small
\begin{tabular}{|c|c|c|c|c||c|c|}
%\hline
\hline
& \earlyT & \sampling & Candidate Ordering & \traversal & Approximation & Complexity\\
\hline
\baseline & no & no & no & Any & no &$O(m^2n)$\\
\hline
%\early & yes & yes & no & no & no & Any & no & $O(m^2n)$\\
% \hline
\earlyOrder & yes & no & no & Any & no & $O(m^2n)$\\
\hline
\samp & no & yes & no & Any & yes (guarantee) & $O(mn+m^2|\sss|+|\cc|n)$\\
\hline
\sampOpt & no & yes & yes & Any & yes (guarantee) & $O(mn+m^2|\sss|+|\cc|n)$\\
\hline
\horiz & no & yes & yes & Horizontal & yes (heuristic) & $O(mn+\chi|\sss|+|\cc|n)$\\
\hline
\vertic & no & yes & yes & Vertical & yes (heuristic) & $O(mn+\chi|\sss|+|\cc|n)$ \\
\hline
\end{tabular}
\caption{Selected Algorithms Using Different Optimization Modules. All algorithms, including the \baseline, are using \trans. In addition, \earlyT and \traversal are coupled with object ordering and feature ordering by default, respectively. For each algorithm (row), shows which optimization modules are employed, whether the algorithm is returning the exact answer or an approximation answer, and the running time complexity for that combination. The term ``guarantee'' (``heuristic'', resp.) indicates that the returned answer is with (without, resp.) stochastic guarantee. In addition, $m$ and $n$ are the number of features and objects, $\sss$ is the sampled set size, $\chi$ is the limit on the number of feature pairs considered, and $\cc$ is the number of generated feature pair candidates.}
\label{tbl:alg}
\vspace{-18pt}
\end{table*}
\stitle{Datasets.} We consider datasets from two
biological applications (see Table~\ref{tbl:dataset}):
(a) in \msig,
we find gene annotations such as pathways and biological processes
that separate the differentially expressed genes
from the undisturbed genes in specific cancer studies;
(b) in \lincs,
we find genes whose expression levels
can distinguish experiments in which specific drug treatments were administered from others.
In \msig,
we are given a feature-object
matrix with genes as the objects
and gene properties as the features.
Rather than being a 0/1 membership indicator matrix, the values of this feature-object matrix indicate the strength of the relationship between the gene and the set of genes that have been annotated with the gene property.
Matrix values are calculated using random walks ~\cite{blatti2016characterizing} on a heterogeneous network built from prior knowledge found in gene annotation and protein homology databases (see Supplementary Note~\ref{app:msig_matrix} for more details).
The positive genes for each dataset in \msig
are the set of differentially expressed genes (DEGs)
in a specific cancer study downloaded from the Molecular Signatures Database (MSigDB)~\cite{subramanian2005gene}.
Each of our tests is an application of
\genviz to such a dataset,
reporting pairs of properties
that separate DEGs of that cancer study (the ``positive'' set)
from all other genes (the ``negative'' set).
In \lincs, the feature-object matrix
contains expression values for different genes
(features) across many drug treatment experiments
(objects) conducted on the MCF7 cell line by the LINCS L1000 project~\cite{subramanian2017next}.
The values of the matrix are gene expression values as reported by the ``level-4' imputed z-scores measured in the L1000 project.
In each dataset, the positive object set includes
multiple experiments that used the same drug, at varying dosages and for varying durations. We applied \genviz on each dataset
so as to find the top pairs of genes (feature pairs)
whose expression values separate the \lincs experiments
relating to a single drug from all other \lincs experiments.
Note that the average number of positive objects
in any dataset is far fewer than the average number
of negative objects. To address this imbalance,
we adjust $\theta_{i,j}^{\ell}$ \eat{Equation~\ref{eqn:s_line}}
to a weighted sum form\eat{ as shown below}: $\theta_{i,j}^{\ell}= \sum_{o_k\in{\oo_-}}{\theta_{i,j}^{\ell,k}}+\frac{|\oo_-|}{|\oo_+|}\cdot\sum_{o_k\in{\oo_+}}{\theta_{i,j}^{\ell,k}}$.
%For each of the two applications, the numbers of features $m$, objects $N$, sample size $|\sss|$, and the examined feature pairs $\chi$ are shown in Table~\ref{tbl:dataset}. We tested on 10 different datasets for \msig and 40 different datasets for \lincs. For each positive object set, we treat all of the remaining objects as the negative set and list the average number of positive and negative objects in Table~\ref{tbl:dataset}.
\eat{
\begin{equation}\label{eqn:s_line_weighted}
\hspace{3mm} \theta_{i,j}^{\ell}= \sum_{o_k\in{\oo_-}}{\theta_{i,j}^{\ell,k}}+\frac{|\oo_-|}{|\oo_+|}\cdot\sum_{o_k\in{\oo_+}}{\theta_{i,j}^{\ell,k}} %{\sf Separability \hspace{1mm} Score\hspace{1mm} with\hspace{1mm} Line\hspace{1mm} \ell:}
\end{equation}
}
%\stitle{Metrics.} We focus on two major metrics in our test: separability score and time. Specifically, each algorithm outputs $k$ feature pairs with highest separability score, i.e., \topk feature pairs, and we call the smallest separability score among the \topk feature pairs as \topk separability score.
%, where $m$ and $n$ are constants depending on the size of the feature-object matrix, $|\sss|$ and $\chi$ are the predetermined sampled set size and limit on the number of feature pairs considered, and $|\cc|$, the number of feature pair candidates is variable and is generated during the algorithm
\stitle{Algorithms.}
We evaluated six combinations of our optimization modules\tr{\sinha{"modules" is somewhat misleading since it suggests separation during combination, while in reality I believe a combination of modules can weave together two ideas to the point that they are no longer "modular". Maybe add a clarification here that what we call modules really are "strategies" for optimization. Or replace "modules" with "strategies" throughout. This is under the assumption that I am right about modularity being lost during some combinations.}\agp{good point. my suggestion here would be to simply add a caveat early on in the optimization section, which i've done}}
from Section~\ref{sec:opt}, listed in Table~\ref{tbl:alg}.
For our baseline, we use the algorithm
with only the matrix pre-transformation optimization module (\trans).
The rightmost column of Table~\ref{tbl:alg} shows the varying time complexity
of the algorithms. Consider the \horiz as an example.
First, \trans takes $O(mn)$ time.
\tr{\cb{ note, scoring single features takes O(mn) time and uses only transformation optimization.}}
Then, \traversal requires a sorting over the feature set, taking $O(m\log m)$\tr{\footnote{Since $m\log m < nm$, we omit term $m\log m$ from the overall complexity.}} time. Finally, with \sampling over $\chi$ feature pairs,
the running time is reduced from $O(m^2n)$ time to $O(\chi|\sss|+|\cc|n)$ time, where the first and second term represent the time for candidate generation and candidate validation respectively. Note that $|\cc|$ is typically orders of magnitude smaller than $\chi$ in \horiz, as discussed in Section~\ref{ssec:sampling}.
Combinations of modules beyond the six reported were always
inferior to one of the ones shown in the sense that they returned the same \topk feature pairs and had a longer running time.
We implemented the algorithms in C++, and conducted the evaluations on a machine with 16 CPUs and 61.9 GB of RAM.
%==================================================================
\subsection{Comparison of Different Algorithms}
\label{sec:exp_comp}
In this section, we first
justify that Rocchio-based measure
is a good proxy for the best possible separating score
computed by a brute force method.
Then we compare the performance of the
algorithms in terms of the
running time and the separability of their \topthousand feature pairs.
%\begin{figure}[h]
%\centering %%% not \center
%\vspace{-5mm}
%\subfigure[Best Feature Pair Comparison]{\label{fig:brute_rocchio_ratio}\includegraphics[width=.235\textwidth]{fig/rocchio_brute_ratio.pdf}}
%\subfigure[Measure Comparison]{\label{fig:brute_rocchio_score}\includegraphics[width=.235\textwidth]{fig/rocchio_brute_score.pdf}}
%\vspace{-5mm}
%\caption{Comparison of Brute Force-based and Rocchio-based separability score. (a) For each of 10 datasets, we display the ratio of the true separability score between the best feature pair chosen by brute force and by the Rocchio-based method. (b) For each dataset, we display the ratio of the true separability score and the Rocchio-based separability score for the best feature pair selected using Rocchio-based method.}
%\vspace{-5mm}
%\label{fig:brute_rocchio}
%\end{figure}
%\begin{table}[t]
%\centering
%\small
%\begin{tabular}{|c|c||c|p{3cm}}
%\hline
%\hline
%& Modules Used & Result \\
%\hline
%\baseline & -- & Exact \\
% \hline
% \earlyOrder & \earlyT & Exact\\
% \hline
%\samp & \sampling & Guarantee \\
%\hline
%\sampOpt & \sampling, Candidate Ordering & Guarantee \\
%\hline
%\multirow{ 2}{*}{\horiz} & \sampling, Candidate Ordering & \multirow{ 2}{*}{Heuristic} \\
%& \traversal(Horizontal) & \\
%\hline
%\multirow{ 2}{*}{\vertic} & \sampling, Candidate Ordering & \multirow{ 2}{*}{Heuristic} \\
%& \traversal(Vertical) & \\
%\hline
%\end{tabular}
%\caption{\scriptsize Selected Algorithms Using Different Optimization Modules. All algorithms, including the \baseline, are using \trans. In addition, \earlyT is coupled with object ordering by default. Each row (algorithm) shows which optimization modules are employed, and whether the algorithm is returning the exact or approximation answer\eat{, and the running time complexity for that combination}. The term ``guarantee'' (``heuristic'', resp.) indicates that the returned answer is with (without, resp.) stochastic guarantee. See Supplementary Table~\ref{appT:time} for time complexity.\eat{In addition, $m$ and $n$ are the number of features and objects, $\sss$ is the sampled set size, $\chi$ is the limit on the number of feature pairs considered, and $\cc$ is the number of generated feature pair candidates.}}
%\label{tbl:alg}
%\vspace{-18pt}
%\end{table}
\tr{
\sinha{maybe naming algorithms A0, A1-A5.}
\sinha{describe table as tree?}
}
\stitle{Accuracy of Rocchio-based approximation.} As discussed in Section~\ref{sec:metric}, when using brute force, we need to consider $O(n^2)$ lines in order to find the best separating line $\ell^* \leftarrow \arg_\ell \max \{\theta_{i,j}^{\ell}\}$, with a time complexity of $O(n^2m^2)$ when considering all feature pairs. An alternative is to use Rocchio-based representative separating line $L$, dramatically reducing $O(n^2)$ lines considered to $O(1)$. Since the brute force method becomes computationally infeasible for datasets with large $n$, we compared the Rocchio-based measure to the brute force-based measure using
specially defined small object sets, $\widehat{\oo}$, for the 10 datasets in \msig. For this comparison, the up-regulated genes in each \msig test was defined as the set of positive objects and the down-regulated genes as the set of negative objects, resulting in an average number of 295 objects for each comparison. We call the brute force-based separability score the {\em true} separability score, since it examines all possible separating lines. We first find the best feature pair using Rocchio-based measure and the brute force based measure separately (potentially different feature pairs) and then calculate the ratio of the true separability scores of the Rocchio versus the brute force best feature pairs. We observe that the Rocchio-based method picks a best feature pair that has true separability score similar to the best pair picked by brute force, with the ratio of the two scores being better than 0.94 in all ten datasets (Supplementary Figure~\ref{appF:exp_sep} (a)). Second, for the best feature pairs identified by Rocchio-based method for the ten datasets, we calculate the ratio of the Rocchio-based separability score and the brute force-based separability score, and find the difference to be greater than 0.96 on average (Supplementary Figure~\ref{appF:exp_sep} (b)).
\stitle{Running Time.} Figure~\ref{fig:time} depicts the running times of our different selected algorithms. Each plotted box corresponds to one algorithm, representing the distribution of running times for finding the \topk feature pairs (by Rocchio score) for all datasets.
%(i.e., 10 gene set collections for \msig and 40 drug treatments for \lincs).
\begin{figure}[h]
\vspace{-5mm}
\centering %%% not \center
\subfigure[\msig]{\label{fig:msig_time}\includegraphics[width=.235\textwidth]{fig/msig_time.eps}}
\subfigure[\lincs]{\label{fig:lincs_time}\includegraphics[width=.235\textwidth]{fig/lincs_time.eps}}
\vspace{-5mm}
\caption{Running Time Comparison. A boxplot for each algorithm is shown with the median value appearing in matching color above. For each boxplot, whiskers are set to be $1.5\times$ the interquartile range, the outliers are shows as red dots, and the average is marked with as a black star. The number on the top shows the median running time for each algorithm.}
\vspace{-5mm}
\label{fig:time}
\end{figure}
First, let us compare the median running times among different algorithms. For \msig, the \baseline takes more than 2 hours, \earlyOrder takes less than 1 hour, \samp and \sampOpt take around 6 and 5 minutes respectively, while \horiz and \vertic both take only 1 minute on average. Overall, the optimizations result in a reduction of the running time by over {\bf $180\times$}. We next examine the effect of different modules on the running time. {\em (a) \earlyT:} we observe that the \earlyT module helps achieve a $2\times$ speed up, with the average number of checked objects (genes) reduced from $20K$ to $5K$ (Supplementary Table~\ref{appT:time}); {\em (b) \sampling:} the \sampling module helps reduce the running time dramatically, with $20\times$ reduction from \baseline to \sampOpt, since on average only $2M$ candidates are generated from all possible $200M$ feature pairs (Supplementary Table~\ref{appT:time});
{\em (c) \traversal:} the modules \horiz and \vertic achieve an additional $6\times$ speed-up compared to \sampOpt by terminating after only considering $\chi=10^7$ feature pairs, approximately $\frac{1}{20}$ of all possible feature pairs. This speedup of \horiz and \vertic is approaching the limit set by the feature ordering overhead (around $6s$) and the time for the \trans module (around $8s$) (Supplementary Table~\ref{appT:time}). The improvement over \sampOpt is not stronger since the candidate generation phase of \sampOpt is able to remove a vast amount of the feature pairs from full evaluation that would also be ignored by \horiz and \vertic (Supplementary Table~\ref{appT:time}).
Next, consider the log-scale interquartile range (IQR) of the running times for the different selected algorithms (Figure~\ref{fig:time}). We observe that \earlyOrder has the largest interquartile range, indicating that the \earlyT module, which tries to reduce the number of objects evaluated for each feature pair, is very dependent on the object set and feature values. As we discussed in Section~\ref{ssec:earlyT}, \earlyT has no guarantee on improving the running time. In fact, the algorithm can occasionally be worse than the \baseline as shown in Figure~\ref{fig:lincs_time} because \earlyT incurs additional overhead for checking the criteria for pruning and early termination when scanning the object list for each feature pair.
Similar results for \lincs are shown in Figure~\ref{fig:lincs_time} (see Supplementary Note~\ref{app:lincs}).
\eat{Similar results can be found in Figure~\ref{fig:lincs_time} for \lincs. Here, we observe over $400\times$ average decrease in the running time of finding the \topk feature pairs that separate the \lincs experiments of a single perturbagen from others. The greatest speedup comes with adding the \sampling module, where only $100K$ feature pair candidates, i.e., $|\cc|$, are checked out of all $250M$ feature pairs (\timetbl). For the selected algorithms with best running times, \horiz and \vertic, the pre-transformation and feature ordering overhead account for an average of $45+35=80s$ of the overall 104 and 94 median seconds respectively.}
% (b) the pruning of feature pairs in candidate generation phase is more powerful in \sampOpt, since on average the feature pairs considered in \horiz and \vertic is better than that in \sampOpt intuitively.
\stitle{Separability Quality.} In Supplementary Figure~\ref{appF:exp_sep} (a), we found the the accuracy of the baseline method which computes the Rocchio-based estimate of top-k features to be high. The \earlyT module is deterministic and produces the same \topk feature pairs as the baseline method only with optimized computation. The \sampling module, on the other hand, is stochastic and can only provide an approximation of the \topk feature pair ranking. Finally, the \traversal module is heuristic and may output \topk feature pairs that are very different from the ranking produced by the \baseline algorithm.
and since \baseline returns the true Rocchio-based separability score of each feature pair, we measured the quality of each selected algorithm by counting the number of common feature pairs returned in the \tophundred between the \baseline and the given algorithm. Figure~\ref{fig:separability} shows this separability quality comparison.
\begin{figure}[h]
\centering %%% not \center
\vspace{-5mm}
\subfigure[\msig]{\label{fig:msig_separability}\includegraphics[width=.235\textwidth]{fig/msig_accuracy.eps}}
\subfigure[\lincs]{\label{fig:lincs_separability}\includegraphics[width=.235\textwidth]{fig/lincs_accuracy.eps}}
\vspace{-5mm}
\caption{Separability Quality Comparison. Boxplots in the style of Figure~\ref{fig:time} comparing the number of feature pairs each method returned from the 100 best feature pairs of the Baseline.}
\vspace{-5mm}
\label{fig:separability}
\end{figure}
Let us first focus on \msig. \earlyOrder, as expected, has exactly the same separability quality as the \baseline. We also observe that the \samp and \sampOpt rankings are nearly identical to the \tophundred feature pairs of the \baseline, owing to the probabilistic guarantee described in Supplementary Note~\ref{app:sampling}. The \horiz and \vertic algorithms output a median of 92 and 48 feature pairs in common with \baseline, respectively, because of the heuristic \traversal module. In the \msig results, \horiz performs much better than \vertic, with the median much higher and the interquartile range much narrower, as shown in Figure~\ref{fig:msig_separability}. This suggests, as we hypothesized, that interesting separating feature pairs exist outside of only the combinations of the top single features as in \vertic. We repeated this quality analysis for \lincs and found that the \sampling based algorithms returned identical \tophundred feature pairs for all 40 datasets. The quality of the \traversal based algorithms was again lower, though the performance separation of the \horiz and the \vertic algorithms was not as large as for \msig.
\stitle{Takeaways.} If the accuracy is paramount,
\sampOpt is recommended; if the running time is paramount to the user, \horiz is recommended.
%The p-value from Fisher's exact test represents the significance of deviation from the null hypothesis, i.e., the predicted label is independent of the real label. This significance test is frequently used to characterize gene sets \needcite{}.
%==================================================================
\subsection{Feature Pair vs. Single Feature.}\label{sec:FPvSF}
In this section, we quantify the statistical significance of the top ranking results of the selected algorithms. We show that we often find separating feature pairs that are more significant than the best single separating feature. To assess the significance of a separating feature or feature pair, we first calculate the p-value using the one-sided Fisher's exact test on a $2\times2$ contingency table. This contingency table is constructed with the rows being the true positive and negative labels, the columns being the predicted positive and negative labels, and the values being the number of objects that belong to each table cell. Using the Fisher's exact test p-value, we assert that feature pairs can provide a better separability compared to single features, i.e., (a) feature pairs have stronger p-values compared to the corresponding individual features even after appropriate multiple hypothesis correction and (b) there exist high-ranked pairs of features that are poorly-ranked on their own as single features.
% \begin{figure}[h]
% \centering %%% not \center
% \vspace{-5mm}
% \subfigure[\msig]{\label{fig:msig_singleF}\includegraphics[width=.16\textwidth]{fig/msig_singleF.pdf}}
% \subfigure[\lincs]{\label{fig:lincs_singleF}\includegraphics[width=.31\textwidth]{fig/lincs_singleF.pdf}}
% \vspace{-5mm}
% \caption{Single Feature Corrected P-value Distribution vs. . For each test (x-axis), shows the significance ($-\log_{10} (corrected\_pval)$) of the \tophundred best single features (blue dots) for the (a) \msig and (b) \lincs datasets. We order the datasets by their best corrected single feature p-value, and discard the datasets where no single feature has corrected p-value better than $10^{-5}$.}
% \vspace{-5mm}
% \label{fig:singleF}
% \end{figure}
\stitle{Single Feature.} Finding \topk single features is a special case of finding feature pairs by setting $i=j$. For each single feature obtained, we compute the p-value with Fisher's exact test, denoted as $pval$. Next, we define the Bonferroni corrected p-value as $corrected\_pval= pval\times m \times n$, since there are $m \times n$ possible hypotheses, one for each possible single feature and separating line. We say a selected feature is significant if the corrected p-value is smaller than the threshold $10^{-5}$, i.e., $-\log_{10} (corrected\_pval)\geq5$. In Figure~\ref{fig:SF_FP}, we plot the distribution of the corrected p-value of the \tophundred features reported for each dataset in \msig and \lincs. We observe that 10 out of 10 datasets in \msig and 32 out of 40 datasets in \lincs have at least one significant single feature, and will focus on these datasets for further analysis. We observe very small p-values, $\leq 10^{-50}$, in the left part of Figure~\ref{fig:msig_singleF} and \ref{fig:lincs_singleF}, indicating that single features are sufficient to separate the object classes for several datasets well.
\begin{figure}[h]
\centering %%% not \center
\vspace{-5mm}
\subfigure[\msig]{\label{fig:msig_singleF}\includegraphics[width=.16\textwidth]{fig/msig_SF_FP.eps}}
\subfigure[\lincs]{\label{fig:lincs_singleF}\includegraphics[width=.31\textwidth]{fig/lincs_SF_FP.eps}}
\vspace{-5mm}
\caption{Single Feature Corrected P-value Distribution vs. Feature Pairs' Corrected P-value Distribution. For each test (x-axis), shows the significance ($-\log_{10} (corrected\_pval)$) of the \tophundred best single features (grey dots) and feature pairs (blue dots) for the (a) \msig and (b) \lincs datasets. We order the datasets by their best corrected single feature p-value, and discard the datasets where no single feature has corrected p-value better than $10^{-5}$.}
\vspace{-5mm}
\label{fig:SF_FP}
\end{figure}
%
%This shows the effectiveness of our Rocchio-based
% \begin{figure}[h]
% \centering %%% not \center
% \vspace{-5mm}
% \subfigure[\msig]{\label{fig:msig_FP}\includegraphics[width=.16\textwidth]{fig/msig_FP.pdf}}
% \subfigure[\lincs]{\label{fig:lincs_FP}\includegraphics[width=.31\textwidth]{fig/lincs_FP.pdf}}
% \vspace{-5mm}
% \caption{Feature Pairs' Corrected P-value Distribution. Plotted significance for the \tophundred best feature pairs (blue dots) for each test in the (a) \msig and (b) \lincs datasets. \sinha{Change Experiment ID -> Dataset ID} }
% \vspace{-5mm}
% \label{fig:FP}
% \end{figure}
\stitle{Feature Pair.} We next build the contingency tables and calculate the p-value for the \topk feature pairs. To correct for $m^2$ possible feature pairs and the $n^2$ possible ways to choose the separating lines for each feature pair, we apply a Bonferroni p-value correction to produce the $ corrected\_pval = pval\times m^2 \times n^2$. We plot the distribution of the corrected p-values for the \topk feature pairs in Figure~\ref{fig:SF_FP}. Once again, the threshold for defining a significant feature pair is set to $10^{-5}$. We find that 10 out of 10 datasets in \msig and 27 out of selected 32 datasets in \lincs have at least one significant feature pair by this metric. Visual comparison of the \tophundred single features to the \tophundred feature pairs (Figure~\ref{fig:SF_FP}) per dataset reveals several datasets where the
corrected p-values of the feature pairs are more significant than those of the best single features, even after accounting for the larger search space. Admittedly, this is not always the case, e.g., for five \lincs datasets no feature pair was found to be significant at $ corrected\_pval \leq 10^{-5}$ while at least one single feature did meet this threshold. Overall, this analysis suggests that rapid discovery of top feature pairs may identify more significant patterns in the given dataset than a traditional single-feature analysis does.
%, we observe that the corrected p-values of the feature pairs are often more significant and more tightly \sinha{this is not necessarily convincing. Could be fp's that share a feature.} distributed than the single feature corrected p-values. This indicates
%that the best feature pairs have better separability quality than the top ranking single features even after accounting for the larger search space. \sinha{What about the \lincs datasets for which no significant feature pair comes out?}
In the following, we further illustrate that feature pairs can also provide better and newer insights compared to single features.
%\begin{figure}[h]
%\centering %%% not \center
%\vspace{-1mm}
%\subfigure[\msig]{\label{fig:msig_histogram_diff}\includegraphics[width=.235\textwidth]{fig/histogram_msig_diff_pval.pdf}}
%\subfigure[\lincs]{\label{fig:lincs_histogram_diff}\includegraphics[width=.235\textwidth]{fig/histogram_lincs_diff_pval.pdf}}
%\vspace{-5mm}
%\caption{Histogram of $improv\_quot$. For the \toptwenty feature pairs from all runs from the (a) \msig and (b) \lincs datasets, distribution of the improvement of the feature pair significance over the corresponding single feature significance. The red line shows the significance threshold of 5.}
%\vspace{-3mm}
%\label{fig:histogram_diff}
%\end{figure}
%Remember that the correction factors for single feature and feature pair are different: $mn$ vs. $m^2n^2$.
\stitle{Improvement from Single Feature to Feature Pair.} Having computed the corrected p-value for each single feature and feature pair in the \tophundred for our datasets, we now examine the improvement of each feature pair from its two corresponding single features in terms of p-value. For each feature pair $(f_i,f_j)$, we define the improvement quotient as the ratio between the corrected p-value of $(f_i,f_j)$ and the better one of the corrected p-value of $f_i$ or $f_j$, i.e., $improv\_quot = \frac{corrected\_pval(f_i,f_j)}{\min(corrected\_pval(f_i),corrected\_pval(f_j))}$. We examined only the $improv\_quot$ for the \toptwenty feature pairs for each of the 10 runs in \msig and 32 runs in \lincs. We found that on average across these datasets, 9.3 of the \toptwenty feature pairs in \msig and 8 of the \toptwenty feature pairs in \lincs are more significant than their corresponding single features ($-\log_{10} (improv\_quot) > 5$). The distribution of the $improv\_quot$ is plotted in \eat{\histogramdiff} Supplementary Figure~\ref{appF_exp_hist}. Overall, these histograms show that there is a improvement from single features to some feature pairs in terms of the separability significance. Next, we will explore the improved feature pairs more carefully, commenting on their redundancy, reliability, and relevance.
% \begin{figure}[h]
% \centering %%% not \center
% \subfigure[\msig]{\label{fig:msig_FP}\includegraphics[width=.16\textwidth]{fig/msig_diff_pval.pdf}} # \tophundred
% \subfigure[\lincs]{\label{fig:lincs_FP}\includegraphics[width=.31\textwidth]{fig/lincs_diff_pval.pdf}}
% \caption{Top-100 Feature Pairs' Corrected P-value Distribution}
% \label{fig:FP}
% \end{figure}
\stitle{New Insights from Feature Pairs.} In order to assess the quality of the top ranking feature pairs, we focused on the \lincs data set where the objects are experimental treatments on the MCF7 breast cancer cell line with the same {\em drug} and the features are expression values for different {\em genes}. For the evaluations above, we used object sets for the 40 drugs with the largest number of LINCS experiments. For the following analysis, we refine our list to those that are common drugs and have at least 60 LINCS experiments on the MCF7 cell line. These nine drugs are vorinostat, trichostatin, estradiol, tamoxifen, doxorubicin, gemcitabine, daunorubicin, idarubicin, and pravastatin. For each chosen drug, we ran the \sampOpt algorithm of \genviz to rank the \topthousand feature (gene) pairs for separating the LINCS experiments of the drug from all other MCF7 experiments.
For all drugs, except pravastatin, all of the \topthousand ranked feature pairs were found to be significant ($-\log_{10} (corrected\_pval) > 5$). The average log significance of the \topthousand feature pairs correlated with the number of experiments ($r=0.91$) in the drug's positive object set (Table~\ref{tbl:selected_fps})\eat{, as expected}. As described in the Section~\ref{sec:FPvSF}, we are especially interested in feature pairs whose corrected p-value is better than the corrected p-values of their corresponding single features ($-\log_{10} (improv\_quot) > 0$). We found 1070 ``improved'' feature pairs with larger separability over their single feature among the top1000 of these evaluation drug sets. One drug, trichostatin, had especially strong single features and showed no feature pairs that significantly improved on them. The remaining seven drugs, however, benefited from the feature pair analysis yielding between 9 (tamoxifen) and 369 (doxorubicin) improved feature pairs (Table~\ref{tbl:selected_fps}).
\begin{table}[t!]
\centering
\small
\begin{tabular}{|l|c|c|c|c|}
\hline
Drug & NumExprs & Avg Signif & Top1000 Signif & Top1000 Improved\\
\hline
vorinostat & 904 & 235.5 & 1000 & 287\\
\hline
trichostatin & 689 & 277.1 & 1000 & 0\\
\hline
estradiol & 325 & 166.8 & 1000 & 203\\
\hline
tamoxifen & 122 & 105.8 & 1000 & 9\\
\hline