body_short_backup.tex

%!TEX root=separability.tex

%====================================================================
\section{Introduction}
\label{sec:intro}


A common approach to discovery in biology
is to construct experiments or analyses that directly
contrast two specific classes of biological objects.
Examples of this include examining patient samples
contrasting tumor versus normal tissue~\cite{tang2017gepia},
studying the differences in molecular effects
of two competing drug treatments~\cite{lamb2006connectivity},
or characterizing differentially expressed genes
versus genes with unaltered gene expression
in a carefully designed experiment~\cite{liberzon2015molecular}.
As is often the case with genomics, these
biological objects (e.g., tissue samples or drug experiments)
are frequently represented by high dimensional
numeric feature vectors (e.g., tens of thousands of transcript abundance measurements).
To understand the mechanisms that determine
these object classes,
researchers often employ statistical and
machine learning tools to identify a
manageable subset of features that
accentuate or explain the differences between them.
We term this the {\em separability problem}.
Often, researchers address this problem
by ranking each individual feature
using statistical tests~\cite{dudoit2002statistical}
and univariate classifiers~\cite{lai2006comparison}.
These approaches scale linearly in the number of features,
but do not offer insight into the interplay
between important features that can provide a better
characterization of what
distinguishes the two object classes.
Alternatively, larger subsets of separating
features can be identified by more complex
machine learning approaches, such as multivariate
regression with LASSO regularization~\cite{peng2010regularized}
or pattern mining from random forest models~\cite{breiman2001random}.
However, running such methods on datasets with
tens of thousands of features is extremely time-intensive.
Motivated by these observations,
we sought to build a tool that strikes a balance between the
two above-mentioned ``extreme''
strategies, enabling us to retrieve more than just the best {\em singular} features, and doing so in an {\em interactive} manner.

We present \genviz,
an interactive data exploration tool to address the separability problem.
Given a matrix of values for object features
and two labeled object sets, \genviz rapidly identifies
the top ranking pairs of features
that most clearly separate the objects of
the different classes.
We show that feature pairs identified by \genviz often more significantly
discriminate between the object classes than the corresponding
best ranking individual features, even after accounting for the larger search space.
Thus, by focusing on separating feature pairs, \genviz
offers researchers the ability to gain additional
insight beyond singular features, without the prolonged duration
needed to train a complex machine learning model.
Moreover, identified feature pairs can be effortlessly
visualized in two-dimensional graphical interfaces,
leading to easy interpretation of
the interplay between the features and the object classes.

\begin{figure*}[t]
 \centering
 \includegraphics[width=0.9\linewidth]{fig/workflow2.eps}
 \vspace{-15pt}
\caption{\genviz Workflow.
Given (left) a feature-object matrix and green positive and red negative class labels on the objects, \genviz (center) evaluates all pairs of features using several optimizations to identify (right) the top feature pair and its corresponding visualization that best separates the object classes.}
\label{fig:workflow}
\vspace{-15pt}
\end{figure*}

Quickly identifying top feature pairs
in typical datasets
is a challenging task due to some fundamental limitations:
first, the number of feature pairs can be large;
second, the feature-object matrix must be examined
multiple times to evaluate the separability of
each feature pair;
and third, we cannot compute the result in advance, since
object sets may be provided on-the-fly.
For example, if objects corresponded to experimental
conditions, and features were gene expression measurements,
having about $20K$ gene features
means that we need to evaluate nearly $200M$ feature pairs,
and for each feature pair we need to at least scan all the objects, if using a na\"ive approach.
Therefore, sophisticated methods are required
to make the task of ranking feature pairs interactive.

In developing \genviz,
we identified a measure to score the separability
of a given feature pair for two object classes.
This measure is conducive to
optimization techniques for rapid ranking of feature
pairs, while providing an intuitive way
to interpret
the resulting visualizations.
Satisfying both of these requirements
enables \genviz to support interactive separability analysis
in a manner that promotes the generation and prioritization
of hypotheses for further investigation.
Using this separability measure, we develop a suite of novel
optimizations to make \genviz interactive for realistic data set sizes,
targeting: {\em (a)} elimination of repeated computation;
{\em (b)} pruning feature pairs during early execution;
{\em (c)} sampling to further reduce running time;
and {\em (d)} cleverly traversing the search space
of feature pairs for improved efficiency.
\tr{While similar optimizations have been used in other settings,
e.g., sampling for approximate computation\needcite{},
the techniques need to be adapted for the separability problem.}

We applied \genviz to two genomic datasets.
In one, called \lincs, we find pairs of genes
whose expression discriminates between perturbagen experiments
involving different drug treatments, and in the other, called \msig,
we find pairs of annotations (such as pathway membership)
that separate differentially expressed
cancer genes from others.
Both of these applications are on
objects with high-dimensional feature vectors,
so it is computationally expensive to score the separability
for all possible feature pairs.
With the carefully designed
separability metric of \genviz and its suite of sophisticated optimizations
that accelerates evaluation,
we are able to {\emph{accurately return the
highest ranking separating
feature pairs for both datasets within two minutes}}.
This reflects a \emph{180X} and \emph{400X} speedup
over a competitive baseline for the \msig and \lincs data sets (respectively).
With such speedups, \genviz enables researchers to quickly explore their data,
identify the strongest, most compelling features,
and form hypotheses about the interplay between them and with the object classes.
It also allows researchers to investigate different definitions
of the object classes on the fly, interactively,
and investigate alternative hypotheses,
as well as set up more in-depth,
longer machine learning-based analysis
that builds upon the \genviz results.

We further performed an in-depth analysis
for nine distinct drug treatments in the \lincs dataset and
found 1070 feature (gene) pairs that had significant separability scores and outperformed
their corresponding single features. These gene pairs were enriched
in literature support for known relationships between the genes and the drug,
as well as known interactions between the genes themselves.

%==================================================================
\section{Methods}
\label{sec:method}
We begin by formally defining the {\em separability} problem,
introducing our separability metric,
and finally detailing optimizations that enable
the rapid identification of the best separating feature pairs.

%==================================================================
\subsection{Problem Definition}\label{sec:prob}


Let $\mm$ be a feature-object matrix of size $m\times N$,
where each row is a feature and each column is an
object as shown in Figure~\ref{fig:workflow}.
One example feature-object matrix is one where
each object corresponds to a tissue sample
from a cancer patient and each feature corresponds to a gene,
where the $(i,j)^{th}$ entry represents the
expression level of the $i^{th}$ gene in the $j^{th}$
tissue sample.
We denote the $m$ features as $\ff=\{f_1,f_2,\cdots,f_m\}$
and $N$ objects as $\oo=\{o_1,o_2,\cdots,o_N\}$.
Each entry $\mm_{i,j}$ in $\mm$ corresponds
to the value of feature $f_i$ for object $o_j$ as illustrated in Figure~\ref{fig:workflow}.

We are also given two non-overlapping sets of objects,
one with a positive label, $\oo_+$ and the other with a negative label, $\oo_-$.
\eat{Both sets are disjoint subsets of all objects,
i.e., $\oo_+, \oo_-\subset \oo$, and $\oo_+\cap \oo_- = \emptyset$.}
In our example, tumor samples, $\oo_+$,
may be assigned the positive label,
and the healthy tissue samples, $\oo_-$,
the negative label.
The number of labeled objects, $n$, is equal to $|\widehat{\oo}|$ where $\widehat{\oo}=\oo_+\cup \oo_-$. \eat{$\widehat{\oo}$ be the union of positive and negative objects; $n$ is the total number of labeled objects, i.e., } Also, let $l_k$ be the label of object $o_k\in \widehat{\oo}$, i.e., $l_k=1$ if $o_k$ is positive and $l_k=-1$ if $o_k$ is negative.

\eat{Now, given $\mm$ and sets
$\oo_+$ and $\oo_-$ as depicted in Figure~\ref{fig:workflow}(a),}
\genviz aims to find feature pairs that
best separate the objects
in $\oo_+$ from those in $\oo_-$ using only those features,
and then output a visualization that demonstrates the separability
(see Figure~\ref{fig:workflow}).
(We will define the metric for separability subsequently.)
A feature pair that leads to a good ``visual'' separation
between the positive and the negative sets may be able
to explain or characterize their differences via a interesting,
non-trivial relationship among the features.
The overall workflow is depicted in Figure~\ref{fig:workflow}.
% Since we intend \genviz to be an interactive data exploration tool to be employed before more time-consuming machine learning methods, we allow for minor sacrifices to its accuracy in exchange for substantial improvements in its running time.
We now formally define the separability problem.
\begin{formulation}[Separability]\label{prob:separability}
Given a feature-object matrix $\mm$ and two labeled object sets $(\oo_+,\oo_-)$, identify the \topk feature pairs $(f_i,f_j)$ that separate $\oo_+$ from $\oo_-$ based on a given separability metric.
\end{formulation}
\noindent %Here a separability metric is a scoring function for any given feature pair, and \topk feature pairs refer to the $k$ feature pairs with the highest score under the separability metric.
We will describe our separability metric in Section~\ref{sec:metric}, and then discuss optimization techniques in Section~\ref{sec:opt}. The notation used in the description of the method is summarized in Supplementary Table~\ref{appT:notation}.


%==================================================================
\subsection{Separability Metric}\label{sec:metric}

Given a feature pair $(f_i,f_j)$ as axes, we can visualize the object sets $\oo_+$ and $\oo_-$ in a 2-D space, where each object corresponds to a point with x-value and y-value as the object's value on feature $f_i$ and $f_j$ respectively. A desirable (i.e., both interesting and interpretable) visualization would be one in which the objects are {\em linearly separated}, defined as follows. Two sets of objects, i.e., $\oo_+$ and $\oo_-$, are said to be \emph{linearly separable}~\cite{medin1981linear} if there exists at least one straight line such that $\oo_+$ and $\oo_-$ are on opposite side of it.
We focus on metrics that capture this linear separation, since it corresponds to an intuitive 2-D visualization.
\eat{Given a feature pair $(f_i,f_j)$, we can represent a line $\ell$ as $w_i\cdot x + w_j\cdot y +w_0 =0$, where $x$ and $y$ represent an object's value on feature $f_i$ and $f_j$ respectively, and $w_0$, $w_i$ and $w_j$ are coefficients, where $w_j>0$.
\eat{
\begin{equation}\label{eqn:line}
\begin{split}
\ell: \hspace{4mm} w_i\cdot x + w_j\cdot y +w_0 =0
\end{split}
\end{equation}
}
}
Given a feature pair $(f_i,f_j)$ and a line $\ell$, we can predict the label of an object $o_k$, denoted as $\eta_{i,j}^{\ell,k}$, using Equation~\ref{eqn:est_label} below, where $w_0$, $w_i$ and $w_j$ are coefficients of $\ell$ and $w_j>0$:
%based on the separating line in Equation~\ref{eqn:line}. Let $\eta_{i,j}^{\ell,k}$ be the predicted label of $o_k$, calculated using Equation~\ref{eqn:est_label} below:
\begin{equation}\label{eqn:est_label}
{\sf Predicted \hspace{1mm} Label:}\hspace{3mm} \eta_{i,j}^{\ell,k}=sign(w_i\cdot \mm_{i,k} + w_j\cdot \mm_{j,k} +w_0)
\end{equation}
\noindent If $o_k$ lies above the line $\ell$, i.e., $o_k$ has higher value on y-axis than the point on line $\ell$ with the same value on x-axis as $o_k$, then $\eta_{i,j}^{\ell,k}=1$; otherwise, $\eta_{i,j}^{\ell,k}=-1$. Let $\theta_{i,j}^{\ell,k}$ be the indicator variable denoting whether \eat{an object $o_k$ is correctly separated (i.e.,} the sign of the predicted label matches the real label $l_k$\eat{)}: if $\eta_{i,j}^{\ell,k}\cdot l_k = 1$, then $\theta_{i,j}^{\ell,k} = 1$; otherwise, $\theta_{i,j}^{\ell,k} = 0$.
\eat{
\begin{equation}\label{eqn:s_object}
\vspace{-5mm}
\small
\theta_{i,j}^{\ell,k}=\left\{
 \begin{array}{ll}
  1 \textit{\hspace{2mm} if \hspace{2mm}} \eta_{i,j}^{\ell,k}\cdot l_k = 1\\
  0 \textit{\hspace{2mm} otherwise \hspace{2mm}}
 \end{array}
 \right.
\end{equation}
}
\eat{\noindent If there is a line $\ell$ such that for every object $o_k\in \widehat{\oo}$, the object is correctly separated (i.e., $\theta_{i,j}^{\ell,k}$ = 1), then we say $\oo_+$ and $\oo_-$ are linearly separable.}

% \begin{equation}\label{eqn:linear}
% \eta_{i,j}^{\ell,k} = \left\{
%  \begin{array}{ll}
%  1 \textit{\hspace{2mm} if \hspace{2mm}} o_k\in \oo_+, \textit{ i.e., } l_k=1 \\
%  -1 \textit{\hspace{2mm} if \hspace{2mm}} o_k\in \oo_-, \textit{ i.e., } l_k=-1
%  \end{array}
% \right.
% \end{equation}


\genviz's separability metric captures \eat{For \genviz, we rank feature pairs by a separability metric based on }{\em how well the objects in the feature pair's 2-D visualization can be linearly separated}, formally defined next.
%The basic intuition is that the users can easily recognize and interpret linear separability in a 2-D visualization. Thus if a visualization is perfectly linearly separated, it should have high score in our proposed separability metric; otherwise, we would find the best separating line $\ell$ in the visualization and report the largest number of correctly separated objects as the separability score.
Given a feature pair $(f_i, f_j)$ and a line $\ell$, the separability score of the line (denoted $\theta_{i, j}^\ell$) is defined as the sum of the indicators ($\theta_{i,j}^{\ell,k}$) for all objects: $\theta_{i,j}^{\ell}= \sum_{k}{\theta_{i,j}^{\ell,k}}$.
\eat{
as shown in Equation~\ref{eqn:s_line}:
\begin{equation}\label{eqn:s_line}
\hspace{3mm} \theta_{i,j}^{\ell}= \sum_{k}{\theta_{i,j}^{\ell,k}}
\end{equation}
}
\noindent Figure~\ref{fig:brute_force} shows separability scores $\theta_{i, j}^\ell$ for different separating lines. For example, the separating line with $\theta_{i, j}^\ell=12$ correctly separates six green points and six red points. The final separability score for a feature pair $(f_i,f_j)$ (denoted as $\theta_{i, j}$) is defined as the best separability score $\theta_{i, j}^{\ell}$ among all possible lines $\ell$. Accordingly, we define the overall separability error of the feature pair as $err_{i,j}=n-\theta_{i, j}$.

%As shown in Equation~\ref{eqn:s_viz},
%\begin{equation}\label{eqn:s_viz}
%\hspace{3mm} \theta_{i,j}= \max_{\ell}\{\theta_{i,j}^{\ell}\} %{\sf Separability \hspace{1mm} %Score:}
%\end{equation}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%NEED TOGGLE%%%%%%%%%%%%%%%%%%%%%%
% \begin{figure}[h]
% \centering
% \begin{subfigure}{0.235\textwidth}
% \centering
% \includegraphics[width=\linewidth]{fig/metrics.pdf}
% \vspace{-5mm}
% \caption{Brute Force}%{$\theta_{i,j}^{\ell}$ with Different Lines[e]}
% \label{fig:brute_force}
% \end{subfigure}
% \begin{subfigure}{.235\textwidth}
% \centering
% \includegraphics[width=\linewidth]{fig/rocchio.pdf}
% \vspace{-5mm}
% \caption{Rocchio-Based}%{$\theta_{i,j}^{\lhat}$ with Representative Line[f]}
% \label{fig:rocchio}
% \end{subfigure}
% \caption{Different methods to Calculate Separability Score $\theta_{i,j}$}
% \label{fig:metric}
% \end{figure}

\stitle{Brute Force Calculation of $\theta_{i,j}$.} As suggested in Figure~\ref{fig:brute_force}, the simplest way to calculate $\theta_{i,j}$ is to first enumerate all possible separating lines $\ell$ and calculate $\theta_{i,j}^\ell$ for each of them. \eat{This is infeasible as there are an infinite number of possible lines. However, }We can easily trim down the search space to $O(n^2)$ lines by linking the points corresponding to every two objects in the 2-D plane. This is because the results of all other possible lines can be covered by these $O(n^2)$ lines~\cite{vapnik1998statistical}. Nevertheless, it is still very time-consuming to consider $O(n^2)$ lines for each feature pair $(f_i,f_j)$.

\eat{
\begin{table}[t!]
\centering
\small
\begin{tabular}{|c|c|c|c|}
 \hline
 Symb. & Description & Symb. & Description\\
 \hline
 \hline
 $\mm$ & feature-object matrix & $\ff$ & feature set in $\mm$ \\
 \hline
 $f_i$ & feature $i$ in $\ff$ & $m$ & number of features in $\ff$\\
 \hline
 $\oo$ & object set in $\mm$ & $N$ & number of objects in $\oo$\\
 \hline
 $\oo_+$ & positive object set & $\oo_-$ & negative object set\\
 \hline
 $\widehat{\oo}$ & labelled object set & $n$ & number of labelled objects in $\widehat{\oo}$\\
 \hline
 $o_k$ & object $k$ in $\widehat{\oo}$ & $l_k$ & label of object $o_k$\\
 \hline
 $\ell$ & separating line in 2-D & $\lhat$ & representative line in 2-D\\
 \hline
 $\eta_{i,j}^{\ell,k}$ & predicted label of $o_k$ & $\theta_{i,j}^{\ell,k}$ & $o_k$ is correctly separated? \\
 \hline
 $\theta_{i,j}^{\ell}$ & \# correctly separated $o_k$ & $\theta_{i,j}$ & separability score\\
 \hline
 $\widehat{\mm}$ & $\mm$ after transformation & $\tilde{\theta}_{i,j}$ & estimated $\theta_{i,j}$\\
 \hline
 \end{tabular}
\caption{Notation used in this paper.}
\label{tbl:notation}
\vspace{-18pt}
\end{table}
}


\stitle{Rocchio-based Measure.} \eat{Rather than enumerating the $O(n^2)$ possible lines to find the one with maximum $\theta_{i,j}^{\ell}$, }We can speed up the process by selecting a single {\em representative line} $\lhat$ providing us with an estimate of the true separability score $\theta_{i, j}$\eat{estimated separability score, $\theta_{i,j}^{\lhat}$, that approximates the true linear separability score $\theta_{i, j}$}.
%as shown in Equation~\ref{eqn:s_viz_appr}, .
%\begin{equation}\label{eqn:s_viz_appr}
%\theta_{i,j} \leftarrow \theta_{i,j}^{\lhat} %\approx
%\end{equation}
In order to achieve a fast and reliable estimate, we select our representative line based on Rocchio's algorithm~\cite{rocchio1971relevance}.
%As we will show later in Section~\ref{sec:exp}, $\theta_{i,j}^{\lhat}$ is comparable to $\theta_{i,j}$ when using Rocchio-based representative line $\lhat$. Let us describe in detail about the Rocchio-based representative line.
%The intuition behind using Rocchio's algorithm is that each object's predicted label should be the class label of its nearest class centroid.
Let us denote the centroids of positive objects $\oo_+$ and negative objects $\oo_-$ for a given $(f_i,f_j)$ as $\mu_{i,j}^+=(\mm_i^+,\mm_j^+)$ and $\mu_{i,j}^-=(\mm_i^-,\mm_j^-)$ respectively, where $\mm_i^+$ and $\mm_j^+$ are the values of the centroids of the positive objects on feature $f_i$ and $f_j$, and $\mm_i^-$ and $\mm_j^-$ are the values of the centroids of the negative objects on feature $f_i$ and $f_j$. The perpendicular bisector of the line joining the two centroids is selected as the representative separating line $\lhat$ (see Figure~\ref{fig:rocchio}), with its coefficients corresponding to Equation~\ref{eqn:est_label} defined as {\small $w_i = \mm_i^+-\mm_i^-$}, {\small $w_j = \mm_j^+-\mm_j^-$}, and {\small $w_0 = -(\frac{(\mm_i^+)^2-(\mm_i^-)^2}{2}+\frac{(\mm_j^+)^2-(\mm_j^-)^2}{2})$}.
\eat{following the template of Equation~\ref{eqn:line} with coefficients as in Equation~\ref{eqn:rep_line}:
%\lhat: \hspace{4mm} (\mm_i^+-\mm_i^-)\cdot x + (\mm_j^+-\mm_j^-)\cdot y -(\frac{(\mm_i^+)^2-(\mm_i^-)^2}{2}+\frac{(\mm_j^+)^2-(\mm_j^-)^2}{2}) =0
\begin{equation}\label{eqn:rep_line}
\begin{split}
& w_i = \mm_i^+-\mm_i^- \\
& w_j = \mm_j^+-\mm_j^- \\
& w_0 = -(\frac{(\mm_i^+)^2-(\mm_i^-)^2}{2}+\frac{(\mm_j^+)^2-(\mm_j^-)^2}{2})
\end{split}
\end{equation}}
\eat{In Figure~\ref{fig:rocchio}, the ``Rocchio-based'' measure with the representative separating line $\theta_{i,j}^{\lhat}$ equals $13$ with one negative object (red point) mis-predicted as positive, while the true $\theta_{i,j}$ also equals $13$.}

\stitle{Brute-force vs. Rocchio-based.} Compared to the brute force calculation, the Rocchio-based measure is much more light-weight, but at the cost of accuracy in calculating $\theta_{i,j}$. Intuitively, the representative line is a reasonable proxy to the best separating line since the Rocchio-based measure \eat{computes the centroids of the two classes as their representatives and }assigns each object to its nearest centroid. We further empirically demonstrate that $\theta_{i,j}^{\lhat}$ is a good proxy for $\theta_{i,j}$ in Section~\ref{sec:exp_comp}. Thus, we will focus on the Rocchio-based measure subsequently, removing $\lhat$ (or $\ell$) from the superscripts where it appears, and using $\theta_{i,j}$ and $\theta_{i,j}^{\lhat}$ interchangeably.
% \xagp{you have used Rocchio's alg/measure/representative line. Maybe avoid some of them for consistency.}


%==================================================================
\subsection{Proposed Suite of Optimizations}\label{sec:opt}


\begin{figure}[t!]
\centering %%% not \center
\vspace{-2mm}
\subfigure[Brute Force]{\label{fig:brute_force}\includegraphics[width=.2\textwidth]{fig/metric.eps}}
\subfigure[Rocchio-Based]{\label{fig:rocchio}\includegraphics[width=.2\textwidth]{fig/rocchio.eps}}
\vspace{-5mm}
\caption{Calculating Separability Score $\theta_{i,j}$. The scored separating line can be defined using (a) brute force (few sample lines are shown) or (b) the representative line from a Rocchio-based measure based on the object class centroids (white circles).}
\vspace{-5mm}
\label{fig:metric}
\end{figure}

In this section, we first analyze the time complexity of identifying the \topk feature pairs using the Rocchio-based measure, and then propose several optimization techniques to reduce the complexity.

\stitle{Time Complexity Analysis.} For\eat{In order to calculate the separability of} a given feature pair $(f_i, f_j)$, if we have already calculated the class centroids for each feature, the separating line $\lhat$ can be calculated in $O(1)$\eat{ time using Equation\eat{~\ref{eqn:rep_line}} for $w_i$, $w_j$ and $w_0$}. We can then calculate the number of correctly separated objects $\theta_{i,j}$ via\eat{ by evaluating all objects with respect to the line, i.e.,} $O(n)$ evaluations. Since there are $O(m^2)$ feature pair candidates, the total time complexity is $O(m^2n)$, which can be very large,
since $m$ and $n$ are typically large.

\stitle{Optimizations: Overview.} To reduce the time complexity, we introduce two categories of optimizations:
those that reduce the amount of time
for fully evaluating a given feature pair (Section~\ref{ssec:trans},~\ref{ssec:earlyT}) and
those that reduce the number of feature pairs
that require full evaluation (Section~\ref{ssec:sampling},~\ref{ssec:traversal}).
In the following, we refer to these optimizations as {\em modules} to indicate that they
can be used in any combination---however, in reality, careful engineering is necessary
to ``stitch'' these modules together to multiply the effects of the optimizations.

\eat{The first optimization module within the first category, }\trans module (Section~\ref{ssec:trans}) reduces redundant calculations across feature pairs by mapping the feature-object matrix $\mm$ into a new space that enables faster evaluation of object labeling\eat{, and thus faster calculation of separability score}. \eat{The second optimization module, }\earlyT module (Section~\ref{ssec:earlyT}) takes advantage of the fact that evaluation of a poorly separating feature pair can be terminated early without having to evaluate the separability of all $n$ objects.

\eat{The first optimization module within the second category, }\sampling module
(Section~\ref{ssec:sampling}) first identifies likely \topk feature pair candidates by evaluating their separability on a sampled subset of all objects, and then conducts full evaluations only on these feature pair candidates.
Finally, \traversal module (Section~\ref{ssec:traversal})
reduces the number of feature pairs checked by greedily
choosing feature pairs based on the separability of the
corresponding single features. These optimization modules can be used on their own or combined with each other.
In Section~\ref{sec:exp}, we will show how these optimizations modules greatly reduce the running time of finding the \topk separating feature pairs without significantly affecting the accuracy.

%==================================================================
\subsubsection{Pre-Transformation for Faster Feature Pair Evaluation} \label{ssec:trans}
We observe that there is massive redundancy across $\theta_{i,j}$'s computation of different feature pairs. Motivated by this, we propose the \trans optimization module which will pre-calculate some common computational components once across different features and reuse these components in evaluating the separability for each different feature pair. This \trans module transforms the original $\mm_{i,k}$ matrix into another space $\widehat{\mm}_{i,k}$ using the identified common feature pair components and updates the separability score equation accordingly. Specifically, with this transformation of the feature-object matrix $\widehat{\mm}_{i,k}$,
evaluating whether an object was correctly separated is simplified as: if $sign(\widehat{\mm}_{i,k} + \widehat{\mm}_{j,k}) = 1$, then $\theta_{i,j}^{k}=1$; otherwise, $\theta_{i,j}^{k}=0$. Details and an example can be found in Supplementary Note~\ref{app:trans} and Supplementary Figure~\ref{appF:transform}.

\eat{
Let us review the process of computing the separability $\theta_{i,j}$. Given a feature pair $(f_i,f_j)$ and the corresponding positive and negative centroids, {\em (a)} we first compute $w_0$, $w_i$ and $w_j$ for $\lhat$\eat{ based on Equation~\ref{eqn:rep_line}}. Next, for each object $o_k$, {\em (b)} we obtain the predicted label $\eta_{i,j}^{\lhat,k}$ according to Equation~\ref{eqn:est_label}. This step requires two multiplications and three additions. Finally, {\em (c)} we calculate $\theta_{i,j}^{k}$ and the separability $\theta_{i,j}$ based on Equations in Section~\ref{sec:metric}\eat{Equation~\ref{eqn:s_object} and \ref{eqn:s_line} respectively}. This whole process is repeated for every feature pair candidate. However, there is massive redundancy across the processing of different feature pairs. For instance, when calculating the separability for two different feature pairs $(f_i,f_j)$ and $(f_i,f_{j'})$ with a common $f_i$, $w_i$ is in fact shared, and calculation of $w_i\cdot \mm_{i,k}$ in Equation~\ref{eqn:est_label} is repeated for each object $o_k$.
\tr{\sinha{is there another work we can cite that has done this?} \silu{Aditya's another paper~\cite{parameswaran2012crowdscreen} did similar trick, but in different problem. Do we want to cite it?}\agp{I think it's too far off.}}

Given this, we propose the \trans optimization module which will pre-calculate some common computational components once across different features and reuse these components the separability for each feature pair to eliminate the repeated computation. This \trans module transforms the original $\mm_{i,k}$ matrix into another space using our identified common feature pair components and updates the separability score equation accordingly.

For each feature $f_i$, we find the average values of the positive and negative objects for that feature, $\mm_i^+$ and $\mm_i^-$ respectively, and then we pre-transform $\mm_{i,k}$, i.e., the value of object $o_k$ on the feature $i$, to $\widehat{\mm}_{i,k}=$ $\Big( (\mm_i^+-\mm_i^-)\cdot \mm_{i,k}-\frac{(\mm_i^+)^2-(\mm_i^-)^2}{2} \Big) \cdot l_k$.
\eat{
\begin{equation}\label{eqn:matrix_transform}
\widehat{\mm}_{i,k} = \Big( (\mm_i^+-\mm_i^-)\cdot \mm_{i,k}-\frac{(\mm_i^+)^2-(\mm_i^-)^2}{2} \Big) \cdot l_k
\end{equation}
}
The basic idea is to decompose
Equation~\ref{eqn:est_label} into two components,
with each one only related to a single feature.
This transformation incorporates
the class centroids into the matrix values,
obviating their integration later
for every feature pair that involves the given feature.
We\eat{Equation~\ref{eqn:matrix_transform}} also multiplies
in the class label of the object, $l_k$,
rather than repeating this multiplication every time the object is evaluated (see example in \transfig). With this transformation of the feature-object matrix,
evaluating whether an object was correctly separated is simplified as: if $sign(\widehat{\mm}_{i,k} + \widehat{\mm}_{j,k}) = 1$, then $\theta_{i,j}^{k}=1$; otherwise, $\theta_{i,j}^{k}=0$. \eat{from Equation~\ref{eqn:s_object} into Equation~\ref{eqn:s_object_transform}:
% \agp{confusing because $L$ is not there in Equation~\ref{eqn:s_object}}\silu{maybe add "Since our focus is on Rocchio-based measure, we replace the separating line $\ell$ in Equation~\ref{eqn:s_object} with the representing line $L$ in Equation~\ref{eqn:s_object_transform}."}
%, and get rid of checking the real label $l_k$ in Equation~\ref{eqn:s_object}
\begin{equation}\label{eqn:s_object_transform}
\theta_{i,j}^{k}=\left\{
 \begin{array}{ll}
  1 \textit{\hspace{2mm} if \hspace{2mm}} sign(\widehat{\mm}_{i,k} + \widehat{\mm}_{j,k}) = 1\\
  0 \textit{\hspace{2mm} otherwise \hspace{2mm}}
 \end{array}
 \right.
\end{equation}
%\begin{equation}\label{eqn:est_label_transform}
%\eta_{i,j}^{\lhat,k}=\sign(\widehat{\mm}_{i,k} + \widehat{\mm}_{j,k})
%\end{equation}
After pre-transformation, we are now ready
to compute the separability score $\theta_{i,j}$.
Given a feature pair $(f_i,f_j)$, we can compute $\theta_{i,j}^{k}$
for each object $o_k$ based on Equation~\ref{eqn:s_object_transform}.}
Note that this step only involves one addition and one comparison and is performed only once for each feature.
Next, we can calculate overall separability score $\theta_{i,j} = \sum_{k}{\theta_{i,j}^{k}}$\eat{ like in Equation~\ref{eqn:s_line} without the pre-transformation}. Overall,
we not only eliminate the steps of computing
$w_0$, $w_i$ and $w_j$ for every feature pair,
but also reduce the cost of calculating
$\eta_{i,j}^{k}$ in Equation~\ref{eqn:est_label}.
In the following sections,
we will consider $\widehat{\mm}$ using pre-transformation instead of $\mm$, and calculate $\theta_{i,j}$ accordingly\eat{
and Equation~\ref{eqn:s_object_transform} instead of
Equation~\ref{eqn:est_label} and~\ref{eqn:s_object}}.}
%Hence, by conducting transformation in Equation~\ref{eqn:matrix_transform} and~\ref{eqn:s_object_transform}, we have mapped our problem to a new space.

%\begin{figure}[h]
%\centering %%% not \center
%\vspace{-2mm}
%\subfigure[Pre-Transformation]{\label{fig:transform}\includegraphics[width=.235\textwidth]{fig/transformation.pdf}}
%\subfigure[Early Termination ($\tau = 1$)]{\label{fig:earlyT}\includegraphics[width=.235\textwidth]{fig/earlyT.pdf}}
%\vspace{-5mm}
%\caption{Optimizations for Feature Pair Evaluation. (a) The \trans module is applied once to the original values in the feature-object matrix $\mm$ (above) using Equation~\ref{eqn:matrix_transform} to produce $\widehat{\mm}$ (below). (b) With the \earlyT module, a feature pair can be discarded from consideration for the \topk before all objects are evaluated (above) with even greater speedup potential after re-ordering the objects (below).}
%\vspace{-8mm}
%\label{fig:trans_term}
%\end{figure}

%\begin{example}[Transformation]
%Figure~\ref{fig:transform} illustrates the transformation for two features, $f_i$ and $f_j$. The top half depicts $\mm_{i,k}$ and $\mm_{j,k}$ before transformation, where green color represents a positive label and red color represents a negative label. In this example, the centroids of the positive and negative objects are $\mu_{i,j}^+=(5,7)$ and $\mu_{i,j}^-=(3,5)$ respectively. Hence, we can rewrite Equation~\ref{eqn:matrix_transform} as $\widehat{\mm}_{i,k} = (2\mm_{i,k}-8)\cdot l_k$ and $\widehat{\mm}_{j,k} = (2\mm_{i,k}-12)\cdot l_k$ for features $f_i$ and $f_j$ respectively. After calculation, we can obtain the values for $\widehat{\mm}_{i,k}$ and $\widehat{\mm}_{j,k}$ shown in the bottom half of Figure~\ref{fig:transform}.
%\end{example}

\tr{
\sinha{at this point, write a summary of the time complexity of calculating $\theta_{i,j}^{\lhat}$ for a feature pair i,j using transformation (include preprocessing separately)}
\silu{the big-O time complexity is not changed. More precisely, only with constant improvement. Preprocessing takes O(mn) as we discussed in Results section.}
}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%NEED TOGGLE%%%%%%%%%%%%%%%%%%%%%%
% \begin{figure}[h]
% \centering
% \begin{subfigure}{0.235\textwidth}
% \centering
% \includegraphics[width=\linewidth]{fig/transformation.pdf}
% \vspace{-5mm}
% \caption{Pre-Transformation}%{$\theta_{i,j}^{\ell}$ with Different Lines}
% \label{fig:transform}
% \end{subfigure}
% \begin{subfigure}{.235\textwidth}
% \centering
% \includegraphics[width=\linewidth]{fig/earlyT.pdf}
% \vspace{-5mm}
% \caption{Early Termination ($\tau = 1$)}%{$\theta_{i,j}^{\lhat}$ with Representative Line}
% \label{fig:earlyT}
% \end{subfigure}
% \caption{Different Methods to Reduce Running Time}
% \label{fig:metric}
% \end{figure}


\tr{
\stitle{Extension to Single Features.} For a feature pair $(f_i,f_j)$ where $i=j$, we can treat the corresponding $\theta_{i,i}$ as the separability score for a single feature $f_i$. Equation~\ref{eqn:s_object_transform} can be further reduced to Equation~\ref{eqn:s_object_transform_single}.

\begin{equation}\label{eqn:s_object_transform_single}
\theta_{i,i}^{k}=\left\{
 \begin{array}{ll}
  1 \textit{\hspace{2mm} if \hspace{2mm}} sign(\widehat{\mm}_{i,k}) = 1\\
  0 \textit{\hspace{2mm} otherwise \hspace{2mm}}
 \end{array}
 \right.
\end{equation}

Hence, we can calculate the separability score $\theta_{i,i}$ very efficiently for each single feature, based on the transformed matrix $\widehat{\mm}$.
}

%==================================================================
\subsubsection{Early Termination} \label{ssec:earlyT}

Given a feature pair $(f_i,f_j)$,
we need to scan all the objects
to compute the separability score $\theta_{i,j}$\eat{ according to Equation~\ref{eqn:s_line}}.
However, since we only need to identify feature pairs in the \topk,
we can stop for each feature pair as soon as we can make that determination, without
scanning all objects; we call this the \earlyT module (Figure~\ref{fig:workflow}).

\stitle{High Level Idea.} We maintain a upper bound $\tau$
for the separability error $err_{i,j}$ of the \topk feature pairs.
Then, the lower bound of the separability
score can be denoted as ($n-\tau$)\eat{,
where $n$ is the total number of objects $\hat{\oo}$}.
Given a feature pair $(f_i,f_j)$, we start to scan the object list until the number of incorrectly classified objects exceeds $\tau$. If so, we can terminate early and prune this feature pair since it cannot be among the \topk.
Otherwise, $(f_i,f_j)$ is added to the \topk feature pair set and we update $\tau$ accordingly.

\stitle{Enhancement by Object Ordering.}
Although \earlyT has the potential to always reduce the running time, its benefits are sensitive to the ordering of the objects for evaluation. Since we terminate as soon as we find $\tau$ incorrectly classified objects, we can improve our running time if we examine ``problematic'' objects that are unlikely to be correctly classified relatively early.
For this\eat{To evaluate ``problematic'' objects earlier},
we order the objects in descending order of
the number of single features $f_i$ that incorrectly
classify the object $o_k$, i.e., $\widehat{\mm}_{i,k}\leq0$.
Thus, the first object evaluated is the one that is incorrectly
classified by the most single features. The benefit of this strategy is illustrated with an example in Figure~\ref{fig:earlyT}.
\eat{We study the impact of this strategy in Section~\ref{sec:exp}.}

\tr{\cb{note, 0's are treated as mis-labels and ties are broken randomly.}\agp{Probably not super important to clarify this.}}


\begin{figure}[t!]
 \centering
 \vspace{-2mm}
 \subfigure[\earlyT Example]{\label{fig:earlyT}\includegraphics[height=.165\textwidth]{fig/earlyT.eps}}
 \subfigure[\sampling Example]{\label{fig:sampling}\includegraphics[height=.165\textwidth]{fig/sampling.eps}}
% \includegraphics[width=0.8\linewidth]{fig/sampling.pdf}
 \vspace{-5mm}
\caption{Optimization Module Examples. (a) When evaluating a feature pair with \earlyT module, the transformed $\widehat{\mm}$ scores are scanned left to right and each incorrectly classified object is marked in blue. Without object ordering (above), evaluation terminates after five checked objects. When objects are reordered by the most ``problematic'' (below), the feature pair is rejected after checking only the first two objects. (b) To calculate the \topthree feature pairs with \sampling, the confidence interval of $\theta_{i,j}$ is calculated for every feature pair evaluated on the sample set $\sss$ (above). The $3^{rd}$ interval lower bound $\zeta$ is obtained (red dotted line), and all feature pairs with a larger upper bound are designated as candidates for validation (blue intervals). The selected candidates (center box) are evaluated on the whole object set $\hat{\oo}$ to compute the exact $\theta_{i,j}$ and pick the \topthree (right box).}
\vspace{-5mm}
\label{fig:etermANDsampling}
\end{figure}

%==================================================================
\subsubsection{Sampling-based Estimation} \label{ssec:sampling}

One downside of the \earlyT module
is that the improvement in the running time is highly data-dependent.
Here, we propose a stochastic method, called \sampling,
that reduces the number of examined objects.
Instead of calculating $\theta_{i,j}$ over the
whole object set $\hat{\oo}$, \sampling works on a
sample set drawn from $\hat{\oo}$.
\tr{
\sinha{use $\theta_{i,j}^{\lhat}$ instead of $\theta_{i,j}$}
\cb{we introduced this simplification at the end of Section \ref{sec:metric}}\agp{yes, but you used it in 2.3.1.}
\silu{Maybe we can add one sentence here to remind the readers? "Recall that we use $\theta_{i,j}^{\lhat}$ and $\theta_{i,j}$ interchangeably throughout the paper."}
}

\stitle{High Level Idea.} \sampling primarily consists of two phases: {\em candidate generation} and {\em validation}
(Figure~\ref{fig:sampling}).
In phase one, we estimate the confidence interval of $\theta_{i,j}$
for each feature pair using a sampled set of objects and
generate the candidate feature pairs for full evaluation
based on where their confidence intervals lie.
If the confidence interval overlaps with the
score range of the current \topk, then it is selected for evaluation.
In phase two (lower half of Figure~\ref{fig:sampling}),
we evaluate only the feature pairs in the candidate set,
calculating $\theta_{i,j}$ over the whole object set, $\hat{\oo}$,
to obtain the final \topk feature pairs.
Unlike our previous optimizations, \sampling returns
an approximation of the \topk ranking feature pairs.
\eat{In Section~\ref{sec:exp_comp}, we study how the greatly
reduced running time of \sampling affects the accuracy of the \topk pairs.} %We will describe in detail in Section~\ref{sssec:generate} and \ref{sssec:validate}.


\stitle{Candidate Generation.} %\label{sssec:generate}
Let $\sss$ be a sample set drawn uniformly from $\hat{\oo}$. Given a feature pair $(f_i,f_j)$, let $\theta_{i,j}(\sss)$ be the number of correctly separated objects in $\sss$. \eat{Let $\tilde{\theta}_{i,j}$ be the estimated separability score of all objects based on the sample set $\sss$. First, }We can estimate $\tilde{\theta}_{i,j}$ from $\theta_{i,j}(\sss)$ using $\tilde{\theta}_{i,j} = \frac{\theta_{i,j}(\sss)}{|\sss|} \cdot n$\eat{Equation~\ref{eqn:est_score}} by assuming the ratio of correctly separated samples in $\sss$ is the same as that in $\hat{\oo}$.
Using Hoeffding's inequality~\cite{hoeffding1963probability}, we have that by selecting $\Omega(\frac{1}{\epsilon^2})$ samples, that $\theta_{i,j}$ is in the confidence interval $[\tilde{\theta}_{i,j}-\epsilon n, \tilde{\theta}_{i,j}+\epsilon n]$ with high probability (details in Supplementary Note~\ref{app:sampling}). Since the sample size $|\sss|$ is independent of the number of objects, this module helps \genviz scale to datasets with large $n$.
\eat{
\begin{equation}\label{eqn:est_score}
\tilde{\theta}_{i,j} = \frac{\theta_{i,j}(\sss)}{|\sss|} \cdot n
\end{equation}}
\eat{
\noindent Next, we show that with a constant sample set size $|\sss|$, $|\theta_{i,j}-\tilde{\theta}_{i,j}|$ is bounded by a small value with high probability, using Hoeffding's inequality~\cite{hoeffding1963probability}. \eat{We formally quantize the sample set size in Theorem~\ref{them:est}.}

\begin{theorem}[Estimation Accuracy]\label{them:est}
By considering $\Omega(\frac{1}{\epsilon^2}\cdot \log(\frac{1}{\delta}))$ samples, with probability at least $1-\delta$, we have $|\frac{\theta_{i,j}(\sss)}{|\sss|}-\frac{\theta_{i,j}}{n}| \leq \epsilon$, i.e., $|\tilde{\theta}_{i,j}-\theta_{i,j}|\leq \epsilon n$.
\end{theorem}
\tr{\begin{proof}
  Based on Hoeffding's inequality:
  $$Pr(|\frac{\theta_{i,j}(\sss)}{|\sss|}-\frac{\theta_{i,j}}{n}| \geq \epsilon) \leq 2e^{-2|\sss|\epsilon^2}$$
  Hence, when $|\sss| = \frac{1}{2\epsilon^2}\cdot \log(\frac{2}{\delta}) = \Omega(\frac{1}{\epsilon^2}\cdot \log(\frac{1}{\delta}))$, we have:
$$
  Pr(|\frac{\theta_{i,j}(\sss)}{|\sss|}-\frac{\theta_{i,j}}{n}| \geq \fepsilon) \leq \delta \ \ \ \ \ \ \ \ \ \ \ \square
$$
\end{proof}
}
\noindent We can treat $\log(1/\delta)$ as a constant, e.g., by setting $\delta = 0.05$. Thus, Theorem~\ref{them:est} essentially states that with only $\Omega(\frac{1}{\epsilon^2})$ samples, with probability 95\%, the confidence interval for $\theta_{i,j}$ is $[\tilde{\theta}_{i,j}-\epsilon n, \tilde{\theta}_{i,j}+\epsilon n]$. Note that the sample size $|\sss|$ only depends on $\epsilon$ and is independent of the number of objects, $n$. Hence, the \sampling module helps \genviz scale to datasets with large $n$.
}
%For instance, when setting $\epsilon=0.05$, the sample size is fixed to be $400$.

\eat{With Theorem~\ref{them:est}, }Following the top half of Figure~\ref{fig:sampling}, we can first calculate the confidence interval of $\theta_{i,j}$ for each feature pair $(f_i,f_j)$. Next, we compute the lower bound of $\theta_{i,j}$ for the \topk feature pairs, denoted as $\zeta$ as shown by the red dotted line. Finally, we can prune feature pairs away whose upper bound is smaller than $\zeta$, keeping the candidate set $\cc$ of feature pairs depicted by blue intervals. These feature pairs $\cc$ will be further validated in phase two, i.e., candidate validation. Typically, $|\cc|$ will be orders of magnitude smaller than $m^2$, the original search space for all feature pairs.
%We will verify this later in our evaluations (Section~\ref{sec:exp}).

\stitle{Candidate Validation.} %\label{sssec:validate}
\eat{Since we only estimated the separability scores $\theta_{i,j}$ for our feature pair candidates in phase one, }We re-evaluate all of the candidates generated from phase one to produce our final feature pair ranking. This \eat{phase two }evaluation is performed using the whole object set $\hat{\oo}$ and the \topk feature pairs are reported (lower half of Figure~\ref{fig:sampling}).
\tr{\sinha{are we using early termination and object ordering in phase two?} \silu{no. We tried both and found that early termination is not helping most of time since 1) the feature pair candidates, which perform well in phase 1, are indeed the ones that are hard to prune and 2) \earlyT incurs some overhead itself.}
}
\eat{For each feature pair $(f_i,f_j)\in \cc$, we calculate the separability score $\theta_{i,j}$ and find the \topk feature pairs (lower half of Figure~\ref{fig:sampling}).}

\stitle{Enhancement by Candidate Ordering.}
In Section~\ref{ssec:earlyT} we proposed an enhancement
that allows us to terminate computation early by
manipulating the order of the objects;
here we similarly found a way to reduce the running
time by changing the order in which feature pair candidates
are validated in phase two.
Instead of directly validating each feature pair candidate,
we first order the candidates in descending order according to the upper bound of each candidate's confidence interval.
Then, we sequentially calculate the full separability
score $\theta_{i,j}$ for each feature pair, and update $\zeta$ correspondingly.
Recall that $\zeta$ is the current estimate of the lower bound of $\theta_{i,j}$ for the \topk feature pairs. Finally, we terminate our feature
pair validation when the next feature pair's upper bound smaller than the current value of $\zeta$ (Figure~\ref{fig:candidate_ordering}).

\begin{figure}[h]
 \centering
 \vspace{-5mm}
 \includegraphics[width=0.9\linewidth]{fig/candidate_ordering.eps}
 \vspace{-6mm}
\caption{Candidate Ordering Enhancement. (a) Feature pair candidates are sorted by the upper bounds of their confidence intervals (solid red boundary), and the lower bound of the \topthree feature pairs, i.e., $\zeta$, is set (red dotted line). (b,c,d) For each feature pair, we calculate $\theta_{i,j}$ (filled blue circle) using all objects and update $\zeta$ if necessary. Note that $\zeta$ is increased in (d) after evaluating the third feature pair and since $\zeta$ is larger than the upper bound of the fourth feature pair, candidate validation can terminate and return the top ranking pairs.}
\vspace{-5mm}
\label{fig:candidate_ordering}
\end{figure}
\tr{
\sinha{ some readers will be confused by higher values being on the left}}

%==================================================================
\subsubsection{Search Space Traversal} \label{ssec:traversal}

The optimizations discussed so far check fewer
than $n$ objects for each feature pair and reduce
the number of feature pairs
for full evaluation. Our \traversal module
aims to reduce the number of feature pairs
considered from $m^2$ to a smaller number.
Instead of examining each feature pair, we only examine a limited number of feature pairs, but in an optimized traversal order. The number of examined feature pairs, $\chi$, determines a trade-off between efficiency and accuracy. Fewer feature pairs checked will result in faster running times, though at the cost of accuracy to the \topk.
\eat{The full search space for the feature pairs is the upper triangle in Figure~\ref{fig:traversal}(a), where each row represents $f_i$ and each column represents $f_j$.}
The order of the feature pairs must be determined carefully and we propose two alternative orderings based on the ranking of single features by their separability scores $\theta_{i,i}$. The first traversal order, called \emph{horizontal traversal}, prioritizes feature pairs that have at least one high ranking single feature in the considered feature pair. The second order, called \emph{vertical traversal}, prioritizes feature pairs where both features have high single feature scores rankings.
%\squishlist
%\item \emph{Horizontal traversal:} for each feature $f_i$, pair it with each other feature $f_j$, where $j\geq i$, to obtain $(f_i,f_j)$. Repeat for each $f_i$, $1 \leq i\leq m$.
%\item \emph{Vertical traversal:} for each feature $f_j$, pair it with each other feature $f_i$, where $i\leq j$, to obtain $(f_i,f_j)$. Repeat for each $f_j$, where $1 \leq j\leq m$.
%\item \emph{Diagonal traversal (Figure~\ref{fig:traversal}(c)):} for a fixed value $v$, pair $(f_i,f_{v-i})$, where $1\leq i<v$. Repeat it for each $v$, where $2 \leq v\leq 2m$.
%\squishend
%\noindent
See Supplementary Figure~\ref{appF:traversal} for more details and an example.

\eat{
\stitle{Enhancement by Feature Ordering.} We propose to first sort features based on single features' separability scores $\theta_{i,i}$. Single features with good separability individually are ranked higher. Let $\{f_1^{'} f_2^{'},\cdots,f_m^{'}\}$ be the sorted single-feature set. Next, we discuss the intuition behind the two different traversal order schemes. Consider partitioning the sorted single-feature list into three categories based on the single-feature separability score $\theta_{i,i}$: {\em good, average,} and {\em poor}.
\squishlist
\item \emph{Horizontal traversal:} the intuition is that there is at least one {\em good} or {\em average} single feature in each \topk feature pair $(f_i,f_j)$.
\item \emph{Vertical traversal:} the intuition is that both features in each \topk feature pair $(f_i,f_j)$ must be categorized as {\em good} or {\em average} individually.
%\item \emph{Diagonal traversal:}
\squishend
\noindent We examine the impact of these traversal schemes in our evaluation.

%The intuition behind horizontal traversal is that there is at least one {\em top} or {\em median} single features in the \topk feature pairs $(f_i,f_j)$; while the intuition behind horizontal traversal is that both features in \topk feature pairs $(f_i,f_j)$ must rank {\em top} or {\em median} by their own.


\begin{figure}[h]
 \centering
 \vspace{-5mm}
 \includegraphics[width=0.7\linewidth]{fig/traversal.eps}
 \vspace{-5mm}
\caption{\traversal Orderings. \genviz is able to examine the feature pair search space (a) horizontally or (b) vertically.}
\vspace{-5mm}
\label{fig:traversal}
\end{figure}
}

%Typically, $K$ is not large, e.g., {\sc Top-1000}, while $m$ is large, e.g., {\sc 20000}.
%\begin{example}[\traversal]
%Assume $m=2\times 10^4$. Initially, the number of possible feature pairs is roughly $\frac{m^2}{2}=2\times 10^8$. However, if we limit the number of considered feature pairs to $\chi=10^7$, we reduce our search space to $\frac{1}{20}$ of the total number of feature pairs. We order the single features by their individual separability scores. In horizontal traversal, only feature pairs with at least one individual feature ranked in the top 500 will be considered; while vertical traversal will consider only feature pairs with both individual features ranked better than 2000.
%the "worst" $f_i$ ranks around 500, while $f_j$ can be any feature in $\ff$. On the other hand, in vertical traversal, the "worst" $f_i$ (and $f_j$) ranks around 2000.
%\label{examp:traversal}
%\end{example}


%==================================================================
\section{Results}
\label{sec:exp}
In this section, we illustrate that \genviz
rapidly identifies meaningful, significant, and interesting
separating feature pairs in real biological datasets.
First, we describe the datasets and the algorithms used
in our evaluation.
Each algorithm that we evaluate
represents a combination of optimization modules
for ranking \topk feature pairs using our
Rocchio-based measure---we report the running time
and accuracy
of the algorithms.
Second, we compare the \topk feature pairs returned by \genviz with the corresponding \topk single features,
and examine their significance and support in existing publications.
Last, we present some sample visualizations to illustrate the separability of the object classes.
%Finally, we enumerate and visualize
% the \topk feature pairs returned by \genviz,
% compare them to the corresponding \topk single features,
% and examine their significance and
% support in existing publications.

%==================================================================
\subsection{Evaluation Setup}


\begin{table}[h]
\centering
\vspace{-5mm}
\small
\begin{tabular}{|c|c|c|c|c|c|c|c|c|}
 \hline
 & $|\ff|$=$m$ & $|\oo|$=$N$ & $|\sss|$ & $\chi$ & \# of $\widehat{\oo}$ & avg($|\oo_+|$) & avg($|\oo_-|$) \\
 \hline
 \msig & 19,912 & 22,209 & 400 & $10^7$ & 10 & 295 & 21,914 \\
 \hline
 \lincs & 22,268 & 98,061 & 400 & $10^7$ & 40 & 165 & 97,897 \\
 \hline
 \end{tabular}
\caption{\scriptsize   Dataset Statistics. For each dataset, the number of features $m$, objects $N$, sample size $|\sss|$ used by \sampling module, feature pairs $\chi$ examined by \traversal module, number of object sets: \# of $\widehat{\oo}$, average positive set size: avg($|\oo_+|$), and average negative set size: avg($|\oo_-|$).}
\label{tbl:dataset}
\vspace{-18pt}
\end{table}

 \begin{table*}[t]
 \centering
 \small
 \begin{tabular}{|c|c|c|c|c||c|c|}
  %\hline

  \hline
  & \earlyT & \sampling & Candidate Ordering & \traversal & Approximation & Complexity\\
  \hline
  \baseline & no  & no & no & Any & no &$O(m^2n)$\\
  \hline
  %\early & yes & yes & no & no & no & Any & no & $O(m^2n)$\\
  % \hline
  \earlyOrder  & yes & no & no & Any & no & $O(m^2n)$\\
  \hline
  \samp  & no & yes & no & Any & yes (guarantee) & $O(mn+m^2|\sss|+|\cc|n)$\\
  \hline
  \sampOpt  & no & yes & yes & Any  & yes (guarantee) & $O(mn+m^2|\sss|+|\cc|n)$\\
  \hline
  \horiz   & no & yes & yes & Horizontal & yes (heuristic) & $O(mn+\chi|\sss|+|\cc|n)$\\
  \hline
  \vertic   & no & yes & yes & Vertical  & yes (heuristic) & $O(mn+\chi|\sss|+|\cc|n)$ \\
  \hline
  \end{tabular}
 \caption{Selected Algorithms Using Different Optimization Modules. All algorithms, including the \baseline, are using \trans. In addition, \earlyT and \traversal are coupled with object ordering and feature ordering by default, respectively. For each algorithm (row), shows which optimization modules are employed, whether the algorithm is returning the exact answer or an approximation answer, and the running time complexity for that combination. The term ``guarantee'' (``heuristic'', resp.) indicates that the returned answer is with (without, resp.) stochastic guarantee. In addition, $m$ and $n$ are the number of features and objects, $\sss$ is the sampled set size, $\chi$ is the limit on the number of feature pairs considered, and $\cc$ is the number of generated feature pair candidates.}
 \label{tbl:alg}
 \vspace{-18pt}
 \end{table*}

\stitle{Datasets.} We consider datasets from two
biological applications (see Table~\ref{tbl:dataset}):
(a) in \msig,
we find gene annotations such as pathways and biological processes
that separate the differentially expressed genes
from the undisturbed genes in specific cancer studies;
(b) in \lincs,
we find genes whose expression levels
can distinguish experiments in which specific drug treatments were administered from others.

In \msig,
we are given a feature-object
matrix with genes as the objects
and gene properties as the features.
Rather than being a 0/1 membership indicator matrix, the values of this feature-object matrix indicate the strength of the relationship between the gene and the set of genes that have been annotated with the gene property.
Matrix values are calculated using random walks ~\cite{blatti2016characterizing} on a heterogeneous network built from prior knowledge found in gene annotation and protein homology databases (see Supplementary Note~\ref{app:msig_matrix} for more details).
The positive genes for each dataset in \msig
are the set of differentially expressed genes (DEGs)
in a specific cancer study downloaded from the Molecular Signatures Database (MSigDB)~\cite{subramanian2005gene}.
Each of our tests is an application of
\genviz to such a dataset,
reporting pairs of properties
that separate DEGs of that cancer study (the ``positive'' set)
from all other genes (the ``negative'' set).

In \lincs, the feature-object matrix
contains expression values for different genes
(features) across many drug treatment experiments
(objects) conducted on the MCF7 cell line by the LINCS L1000 project~\cite{subramanian2017next}.
The values of the matrix are gene expression values as reported by the ``level-4' imputed z-scores measured in the L1000 project.
In each dataset, the positive object set includes
multiple experiments that used the same drug, at varying dosages and for varying durations. We applied \genviz on each dataset
so as to find the top pairs of genes (feature pairs)
whose expression values separate the \lincs experiments
relating to a single drug from all other \lincs experiments.
Note that the average number of positive objects
in any dataset is far fewer than the average number
of negative objects. To address this imbalance,
we adjust $\theta_{i,j}^{\ell}$ \eat{Equation~\ref{eqn:s_line}}
to a weighted sum form\eat{ as shown below}: $\theta_{i,j}^{\ell}= \sum_{o_k\in{\oo_-}}{\theta_{i,j}^{\ell,k}}+\frac{|\oo_-|}{|\oo_+|}\cdot\sum_{o_k\in{\oo_+}}{\theta_{i,j}^{\ell,k}}$.
%For each of the two applications, the numbers of features $m$, objects $N$, sample size $|\sss|$, and the examined feature pairs $\chi$ are shown in Table~\ref{tbl:dataset}. We tested on 10 different datasets for \msig and 40 different datasets for \lincs. For each positive object set, we treat all of the remaining objects as the negative set and list the average number of positive and negative objects in Table~\ref{tbl:dataset}.
\eat{
\begin{equation}\label{eqn:s_line_weighted}
\hspace{3mm} \theta_{i,j}^{\ell}= \sum_{o_k\in{\oo_-}}{\theta_{i,j}^{\ell,k}}+\frac{|\oo_-|}{|\oo_+|}\cdot\sum_{o_k\in{\oo_+}}{\theta_{i,j}^{\ell,k}} %{\sf Separability \hspace{1mm} Score\hspace{1mm} with\hspace{1mm} Line\hspace{1mm} \ell:}
\end{equation}
}
%\stitle{Metrics.} We focus on two major metrics in our test: separability score and time. Specifically, each algorithm outputs $k$ feature pairs with highest separability score, i.e., \topk feature pairs, and we call the smallest separability score among the \topk feature pairs as \topk separability score.
%, where $m$ and $n$ are constants depending on the size of the feature-object matrix, $|\sss|$ and $\chi$ are the predetermined sampled set size and limit on the number of feature pairs considered, and $|\cc|$, the number of feature pair candidates is variable and is generated during the algorithm
\stitle{Algorithms.}
We evaluated six combinations of our optimization modules\tr{\sinha{"modules" is somewhat misleading since it suggests separation during combination, while in reality I believe a combination of modules can weave together two ideas to the point that they are no longer "modular". Maybe add a clarification here that what we call modules really are "strategies" for optimization. Or replace "modules" with "strategies" throughout. This is under the assumption that I am right about modularity being lost during some combinations.}\agp{good point. my suggestion here would be to simply add a caveat early on in the optimization section, which i've done}}
from Section~\ref{sec:opt}, listed in Table~\ref{tbl:alg}.
For our baseline, we use the algorithm
with only the matrix pre-transformation optimization module (\trans).
The rightmost column of Table~\ref{tbl:alg} shows the varying time complexity
of the algorithms. Consider the \horiz as an example.
First, \trans takes $O(mn)$ time.
\tr{\cb{ note, scoring single features takes O(mn) time and uses only transformation optimization.}}
Then, \traversal requires a sorting over the feature set, taking $O(m\log m)$\tr{\footnote{Since $m\log m < nm$, we omit term $m\log m$ from the overall complexity.}} time. Finally, with \sampling over $\chi$ feature pairs,
the running time is reduced from $O(m^2n)$ time to $O(\chi|\sss|+|\cc|n)$ time, where the first and second term represent the time for candidate generation and candidate validation respectively. Note that $|\cc|$ is typically orders of magnitude smaller than $\chi$ in \horiz, as discussed in Section~\ref{ssec:sampling}.
Combinations of modules beyond the six reported were always
inferior to one of the ones shown in the sense that they returned the same \topk feature pairs and had a longer running time.
We implemented the algorithms in C++, and conducted the evaluations on a machine with 16 CPUs and 61.9 GB of RAM.
%==================================================================
\subsection{Comparison of Different Algorithms}
\label{sec:exp_comp}
In this section, we first
justify that Rocchio-based measure
is a good proxy for the best possible separating score
computed by a brute force method.
Then we compare the performance of the
algorithms in terms of the
running time and the separability of their \topthousand feature pairs.
%\begin{figure}[h]
%\centering %%% not \center
%\vspace{-5mm}
%\subfigure[Best Feature Pair Comparison]{\label{fig:brute_rocchio_ratio}\includegraphics[width=.235\textwidth]{fig/rocchio_brute_ratio.pdf}}
%\subfigure[Measure Comparison]{\label{fig:brute_rocchio_score}\includegraphics[width=.235\textwidth]{fig/rocchio_brute_score.pdf}}
%\vspace{-5mm}
%\caption{Comparison of Brute Force-based and Rocchio-based separability score. (a) For each of 10 datasets, we display the ratio of the true separability score between the best feature pair chosen by brute force and by the Rocchio-based method. (b) For each dataset, we display the ratio of the true separability score and the Rocchio-based separability score for the best feature pair selected using Rocchio-based method.}
%\vspace{-5mm}
%\label{fig:brute_rocchio}
%\end{figure}
%\begin{table}[t]
%\centering
%\small
%\begin{tabular}{|c|c||c|p{3cm}}
 %\hline
%\hline
%& Modules Used & Result \\
%\hline
%\baseline & -- & Exact \\
% \hline
% \earlyOrder  & \earlyT & Exact\\
% \hline
%\samp  &  \sampling & Guarantee \\
%\hline
%\sampOpt & \sampling, Candidate Ordering & Guarantee \\
%\hline
%\multirow{ 2}{*}{\horiz} & \sampling, Candidate Ordering & \multirow{ 2}{*}{Heuristic} \\
%&   \traversal(Horizontal) & \\
%\hline
%\multirow{ 2}{*}{\vertic} & \sampling, Candidate Ordering & \multirow{ 2}{*}{Heuristic} \\
%&   \traversal(Vertical) & \\
%\hline
%\end{tabular}
%\caption{\scriptsize   Selected Algorithms Using Different Optimization Modules. All algorithms, including the \baseline, are using \trans. In addition, \earlyT is coupled with object ordering by default. Each row (algorithm) shows which optimization modules are employed, and whether the algorithm is returning the exact or approximation answer\eat{, and the running time complexity for that combination}. The term ``guarantee'' (``heuristic'', resp.) indicates that the returned answer is with (without, resp.) stochastic guarantee. See Supplementary Table~\ref{appT:time} for time complexity.\eat{In addition, $m$ and $n$ are the number of features and objects, $\sss$ is the sampled set size, $\chi$ is the limit on the number of feature pairs considered, and $\cc$ is the number of generated feature pair candidates.}}
%\label{tbl:alg}
%\vspace{-18pt}
%\end{table}
\tr{
\sinha{maybe naming algorithms A0, A1-A5.}
\sinha{describe table as tree?}
}
\stitle{Accuracy of Rocchio-based approximation.} As discussed in Section~\ref{sec:metric}, when using brute force, we need to consider $O(n^2)$ lines in order to find the best separating line $\ell^* \leftarrow \arg_\ell \max \{\theta_{i,j}^{\ell}\}$, with a time complexity of $O(n^2m^2)$ when considering all feature pairs. An alternative is to use Rocchio-based representative separating line $L$, dramatically reducing $O(n^2)$ lines considered to $O(1)$. Since the brute force method becomes computationally infeasible for datasets with large $n$, we compared the Rocchio-based measure to the brute force-based measure using
specially defined small object sets, $\widehat{\oo}$, for the 10 datasets in \msig. For this comparison, the up-regulated genes in each \msig test was defined as the set of positive objects and the down-regulated genes as the set of negative objects, resulting in an average number of 295 objects for each comparison. We call the brute force-based separability score the {\em true} separability score, since it examines all possible separating lines. We first find the best feature pair using Rocchio-based measure and the brute force based measure separately (potentially different feature pairs) and then calculate the ratio of the true separability scores of the Rocchio versus the brute force best feature pairs. We observe that the Rocchio-based method picks a best feature pair that has true separability score similar to the best pair picked by brute force, with the ratio of the two scores being better than 0.94 in all ten datasets (Supplementary Figure~\ref{appF:exp_sep} (a)). Second, for the best feature pairs identified by Rocchio-based method for the ten datasets, we calculate the ratio of the Rocchio-based separability score and the brute force-based separability score, and find the difference to be greater than 0.96 on average (Supplementary Figure~\ref{appF:exp_sep} (b)).
\stitle{Running Time.} Figure~\ref{fig:time} depicts the running times of our different selected algorithms. Each plotted box corresponds to one algorithm, representing the distribution of running times for finding the \topk feature pairs (by Rocchio score) for all datasets.
%(i.e., 10 gene set collections for \msig and 40 drug treatments for \lincs).
\begin{figure}[h]
\vspace{-5mm}
\centering %%% not \center
\subfigure[\msig]{\label{fig:msig_time}\includegraphics[width=.235\textwidth]{fig/msig_time.eps}}
\subfigure[\lincs]{\label{fig:lincs_time}\includegraphics[width=.235\textwidth]{fig/lincs_time.eps}}
\vspace{-5mm}
\caption{Running Time Comparison. A boxplot for each algorithm is shown with the median value appearing in matching color above. For each boxplot, whiskers are set to be $1.5\times$ the interquartile range, the outliers are shows as red dots, and the average is marked with as a black star. The number on the top shows the median running time for each algorithm.}
\vspace{-5mm}
\label{fig:time}
\end{figure}
First, let us compare the median running times among different algorithms. For \msig, the \baseline takes more than 2 hours, \earlyOrder takes less than 1 hour, \samp and \sampOpt take around 6 and 5 minutes respectively, while \horiz and \vertic both take only 1 minute on average. Overall, the optimizations result in a reduction of the running time by over {\bf $180\times$}. We next examine the effect of different modules on the running time. {\em (a) \earlyT:} we observe that the \earlyT module helps achieve a $2\times$ speed up, with the average number of checked objects (genes) reduced from $20K$ to $5K$ (Supplementary Table~\ref{appT:time}); {\em (b) \sampling:} the \sampling module helps reduce the running time dramatically, with $20\times$ reduction from \baseline to \sampOpt, since on average only $2M$ candidates are generated from all possible $200M$ feature pairs (Supplementary Table~\ref{appT:time});
{\em (c) \traversal:} the modules \horiz and \vertic achieve an additional $6\times$ speed-up compared to \sampOpt by terminating after only considering $\chi=10^7$ feature pairs, approximately $\frac{1}{20}$ of all possible feature pairs. This speedup of \horiz and \vertic is approaching the limit set by the feature ordering overhead (around $6s$) and the time for the \trans module (around $8s$) (Supplementary Table~\ref{appT:time}). The improvement over \sampOpt is not stronger since the candidate generation phase of \sampOpt is able to remove a vast amount of the feature pairs from full evaluation that would also be ignored by \horiz and \vertic (Supplementary Table~\ref{appT:time}).
Next, consider the log-scale interquartile range (IQR) of the running times for the different selected algorithms (Figure~\ref{fig:time}). We observe that \earlyOrder has the largest interquartile range, indicating that the \earlyT module, which tries to reduce the number of objects evaluated for each feature pair, is very dependent on the object set and feature values. As we discussed in Section~\ref{ssec:earlyT}, \earlyT has no guarantee on improving the running time. In fact, the algorithm can occasionally be worse than the \baseline as shown in Figure~\ref{fig:lincs_time} because \earlyT incurs additional overhead for checking the criteria for pruning and early termination when scanning the object list for each feature pair.
Similar results for \lincs are shown in Figure~\ref{fig:lincs_time} (see Supplementary Note~\ref{app:lincs}).
\eat{Similar results can be found in Figure~\ref{fig:lincs_time} for \lincs. Here, we observe over $400\times$ average decrease in the running time of finding the \topk feature pairs that separate the \lincs experiments of a single perturbagen from others. The greatest speedup comes with adding the \sampling module, where only $100K$ feature pair candidates, i.e., $|\cc|$, are checked out of all $250M$ feature pairs (\timetbl). For the selected algorithms with best running times, \horiz and \vertic, the pre-transformation and feature ordering overhead account for an average of $45+35=80s$ of the overall 104 and 94 median seconds respectively.}
% (b) the pruning of feature pairs in candidate generation phase is more powerful in \sampOpt, since on average the feature pairs considered in \horiz and \vertic is better than that in \sampOpt intuitively.
\stitle{Separability Quality.} In Supplementary Figure~\ref{appF:exp_sep} (a), we found the the accuracy of the baseline method which computes the Rocchio-based estimate of top-k features to be high. The \earlyT module is deterministic and produces the same \topk feature pairs as the baseline method only with optimized computation. The \sampling module, on the other hand, is stochastic and can only provide an approximation of the \topk feature pair ranking. Finally, the \traversal module is heuristic and may output \topk feature pairs that are very different from the ranking produced by the \baseline algorithm.
and since \baseline returns the true Rocchio-based separability score of each feature pair, we measured the quality of each selected algorithm by counting the number of common feature pairs returned in the \tophundred between the \baseline and the given algorithm. Figure~\ref{fig:separability} shows this separability quality comparison.
\begin{figure}[h]
\centering %%% not \center
\vspace{-5mm}
\subfigure[\msig]{\label{fig:msig_separability}\includegraphics[width=.235\textwidth]{fig/msig_accuracy.eps}}
\subfigure[\lincs]{\label{fig:lincs_separability}\includegraphics[width=.235\textwidth]{fig/lincs_accuracy.eps}}
\vspace{-5mm}
\caption{Separability Quality Comparison. Boxplots in the style of Figure~\ref{fig:time} comparing the number of feature pairs each method returned from the 100 best feature pairs of the Baseline.}
\vspace{-5mm}
\label{fig:separability}
\end{figure}
Let us first focus on \msig. \earlyOrder, as expected, has exactly the same separability quality as the \baseline. We also observe that the \samp and \sampOpt rankings are nearly identical to the \tophundred feature pairs of the \baseline, owing to the probabilistic guarantee described in Supplementary Note~\ref{app:sampling}. The \horiz and \vertic algorithms output a median of 92 and 48 feature pairs in common with \baseline, respectively, because of the heuristic \traversal module. In the \msig results, \horiz performs much better than \vertic, with the median much higher and the interquartile range much narrower, as shown in Figure~\ref{fig:msig_separability}. This suggests, as we hypothesized, that interesting separating feature pairs exist outside of only the combinations of the top single features as in \vertic. We repeated this quality analysis for \lincs and found that the \sampling based algorithms returned identical \tophundred feature pairs for all 40 datasets. The quality of the \traversal based algorithms was again lower, though the performance separation of the \horiz and the \vertic algorithms was not as large as for \msig.
\stitle{Takeaways.} If the accuracy is paramount,
\sampOpt is recommended; if the running time is paramount to the user, \horiz is recommended.
%The p-value from Fisher's exact test represents the significance of deviation from the null hypothesis, i.e., the predicted label is independent of the real label. This significance test is frequently used to characterize gene sets \needcite{}.
%==================================================================
\subsection{Feature Pair vs. Single Feature.}\label{sec:FPvSF}
In this section, we quantify the statistical significance of the top ranking results of the selected algorithms. We show that we often find separating feature pairs that are more significant than the best single separating feature. To assess the significance of a separating feature or feature pair, we first calculate the p-value using the one-sided Fisher's exact test on a $2\times2$ contingency table. This contingency table is constructed with the rows being the true positive and negative labels, the columns being the predicted positive and negative labels, and the values being the number of objects that belong to each table cell. Using the Fisher's exact test p-value, we assert that feature pairs can provide a better separability compared to single features, i.e., (a) feature pairs have stronger p-values compared to the corresponding individual features even after appropriate multiple hypothesis correction and (b) there exist high-ranked pairs of features that are poorly-ranked on their own as single features.
% \begin{figure}[h]
% \centering %%% not \center
% \vspace{-5mm}
% \subfigure[\msig]{\label{fig:msig_singleF}\includegraphics[width=.16\textwidth]{fig/msig_singleF.pdf}}
% \subfigure[\lincs]{\label{fig:lincs_singleF}\includegraphics[width=.31\textwidth]{fig/lincs_singleF.pdf}}
% \vspace{-5mm}
% \caption{Single Feature Corrected P-value Distribution vs. . For each test (x-axis), shows the significance ($-\log_{10} (corrected\_pval)$) of the \tophundred best single features (blue dots) for the (a) \msig and (b) \lincs datasets. We order the datasets by their best corrected single feature p-value, and discard the datasets where no single feature has corrected p-value better than $10^{-5}$.}
% \vspace{-5mm}
% \label{fig:singleF}
% \end{figure}
\stitle{Single Feature.} Finding \topk single features is a special case of finding feature pairs by setting $i=j$. For each single feature obtained, we compute the p-value with Fisher's exact test, denoted as $pval$. Next, we define the Bonferroni corrected p-value as $corrected\_pval= pval\times m \times n$, since there are $m \times n$ possible hypotheses, one for each possible single feature and separating line. We say a selected feature is significant if the corrected p-value is smaller than the threshold $10^{-5}$, i.e., $-\log_{10} (corrected\_pval)\geq5$. In Figure~\ref{fig:SF_FP}, we plot the distribution of the corrected p-value of the \tophundred features reported for each dataset in \msig and \lincs. We observe that 10 out of 10 datasets in \msig and 32 out of 40 datasets in \lincs have at least one significant single feature, and will focus on these datasets for further analysis. We observe very small p-values, $\leq 10^{-50}$, in the left part of Figure~\ref{fig:msig_singleF} and \ref{fig:lincs_singleF}, indicating that single features are sufficient to separate the object classes for several datasets well.
\begin{figure}[h]
\centering %%% not \center
\vspace{-5mm}
\subfigure[\msig]{\label{fig:msig_singleF}\includegraphics[width=.16\textwidth]{fig/msig_SF_FP.eps}}
\subfigure[\lincs]{\label{fig:lincs_singleF}\includegraphics[width=.31\textwidth]{fig/lincs_SF_FP.eps}}
\vspace{-5mm}
\caption{Single Feature Corrected P-value Distribution vs. Feature Pairs' Corrected P-value Distribution. For each test (x-axis), shows the significance ($-\log_{10} (corrected\_pval)$) of the \tophundred best single features (grey dots) and feature pairs (blue dots) for the (a) \msig and (b) \lincs datasets. We order the datasets by their best corrected single feature p-value, and discard the datasets where no single feature has corrected p-value better than $10^{-5}$.}
\vspace{-5mm}
\label{fig:SF_FP}
\end{figure}
%
%This shows the effectiveness of our Rocchio-based
% \begin{figure}[h]
% \centering %%% not \center
% \vspace{-5mm}
% \subfigure[\msig]{\label{fig:msig_FP}\includegraphics[width=.16\textwidth]{fig/msig_FP.pdf}}
% \subfigure[\lincs]{\label{fig:lincs_FP}\includegraphics[width=.31\textwidth]{fig/lincs_FP.pdf}}
% \vspace{-5mm}
% \caption{Feature Pairs' Corrected P-value Distribution. Plotted significance for the \tophundred best feature pairs (blue dots) for each test in the (a) \msig and (b) \lincs datasets. \sinha{Change Experiment ID -> Dataset ID} }
% \vspace{-5mm}
% \label{fig:FP}
% \end{figure}
\stitle{Feature Pair.} We next build the contingency tables and calculate the p-value for the \topk feature pairs. To correct for $m^2$ possible feature pairs and the $n^2$ possible ways to choose the separating lines for each feature pair, we apply a Bonferroni p-value correction to produce the $ corrected\_pval = pval\times m^2 \times n^2$. We plot the distribution of the corrected p-values for the \topk feature pairs in Figure~\ref{fig:SF_FP}. Once again, the threshold for defining a significant feature pair is set to $10^{-5}$. We find that 10 out of 10 datasets in \msig and 27 out of selected 32 datasets in \lincs have at least one significant feature pair by this metric. Visual comparison of the \tophundred single features to the \tophundred feature pairs (Figure~\ref{fig:SF_FP}) per dataset reveals several datasets where the
corrected p-values of the feature pairs are more significant than those of the best single features, even after accounting for the larger search space. Admittedly, this is not always the case, e.g., for five \lincs datasets no feature pair was found to be significant at $ corrected\_pval \leq 10^{-5}$ while at least one single feature did meet this threshold. Overall, this analysis suggests that rapid discovery of top feature pairs may identify more significant patterns in the given dataset than a traditional single-feature analysis does.
%, we observe that the corrected p-values of the feature pairs are often more significant and more tightly \sinha{this is not necessarily convincing. Could be fp's that share a feature.} distributed than the single feature corrected p-values. This indicates
%that the best feature pairs have better separability quality than the top ranking single features even after accounting for the larger search space. \sinha{What about the \lincs datasets for which no significant feature pair comes out?}
In the following, we further illustrate that feature pairs can also provide better and newer insights compared to single features.
%\begin{figure}[h]
%\centering %%% not \center
%\vspace{-1mm}
%\subfigure[\msig]{\label{fig:msig_histogram_diff}\includegraphics[width=.235\textwidth]{fig/histogram_msig_diff_pval.pdf}}
%\subfigure[\lincs]{\label{fig:lincs_histogram_diff}\includegraphics[width=.235\textwidth]{fig/histogram_lincs_diff_pval.pdf}}
%\vspace{-5mm}
%\caption{Histogram of $improv\_quot$. For the \toptwenty feature pairs from all runs from the (a) \msig and (b) \lincs datasets, distribution of the improvement of the feature pair significance over the corresponding single feature significance. The red line shows the significance threshold of 5.}
%\vspace{-3mm}
%\label{fig:histogram_diff}
%\end{figure}
%Remember that the correction factors for single feature and feature pair are different: $mn$ vs. $m^2n^2$.
\stitle{Improvement from Single Feature to Feature Pair.} Having computed the corrected p-value for each single feature and feature pair in the \tophundred for our datasets, we now examine the improvement of each feature pair from its two corresponding single features in terms of p-value. For each feature pair $(f_i,f_j)$, we define the improvement quotient as the ratio between the corrected p-value of $(f_i,f_j)$ and the better one of the corrected p-value of $f_i$ or $f_j$, i.e., $improv\_quot = \frac{corrected\_pval(f_i,f_j)}{\min(corrected\_pval(f_i),corrected\_pval(f_j))}$. We examined only the $improv\_quot$ for the \toptwenty feature pairs for each of the 10 runs in \msig and 32 runs in \lincs. We found that on average across these datasets, 9.3 of the \toptwenty feature pairs in \msig and 8 of the \toptwenty feature pairs in \lincs are more significant than their corresponding single features ($-\log_{10} (improv\_quot) > 5$). The distribution of the $improv\_quot$ is plotted in \eat{\histogramdiff} Supplementary Figure~\ref{appF_exp_hist}. Overall, these histograms show that there is a improvement from single features to some feature pairs in terms of the separability significance. Next, we will explore the improved feature pairs more carefully, commenting on their redundancy, reliability, and relevance.
% \begin{figure}[h]
% \centering %%% not \center
% \subfigure[\msig]{\label{fig:msig_FP}\includegraphics[width=.16\textwidth]{fig/msig_diff_pval.pdf}} # \tophundred
% \subfigure[\lincs]{\label{fig:lincs_FP}\includegraphics[width=.31\textwidth]{fig/lincs_diff_pval.pdf}}
% \caption{Top-100 Feature Pairs' Corrected P-value Distribution}
% \label{fig:FP}
% \end{figure}
\stitle{New Insights from Feature Pairs.} In order to assess the quality of the top ranking feature pairs, we focused on the \lincs data set where the objects are experimental treatments on the MCF7 breast cancer cell line with the same {\em drug} and the features are expression values for different {\em genes}. For the evaluations above, we used object sets for the 40 drugs with the largest number of LINCS experiments. For the following analysis, we refine our list to those that are common drugs and have at least 60 LINCS experiments on the MCF7 cell line. These nine drugs are vorinostat, trichostatin, estradiol, tamoxifen, doxorubicin, gemcitabine, daunorubicin, idarubicin, and pravastatin. For each chosen drug, we ran the \sampOpt algorithm of \genviz to rank the \topthousand feature (gene) pairs for separating the LINCS experiments of the drug from all other MCF7 experiments.
For all drugs, except pravastatin, all of the \topthousand ranked feature pairs were found to be significant ($-\log_{10} (corrected\_pval) > 5$). The average log significance of the \topthousand feature pairs correlated with the number of experiments ($r=0.91$) in the drug's positive object set (Table~\ref{tbl:selected_fps})\eat{, as expected}. As described in the Section~\ref{sec:FPvSF}, we are especially interested in feature pairs whose corrected p-value is better than the corrected p-values of their corresponding single features ($-\log_{10} (improv\_quot) > 0$). We found 1070 ``improved'' feature pairs with larger separability over their single feature among the top1000 of these evaluation drug sets. One drug, trichostatin, had especially strong single features and showed no feature pairs that significantly improved on them. The remaining seven drugs, however, benefited from the feature pair analysis yielding between 9 (tamoxifen) and 369 (doxorubicin) improved feature pairs (Table~\ref{tbl:selected_fps}).
\begin{table}[t!]
\centering
\small
\begin{tabular}{|l|c|c|c|c|}
\hline
 Drug & NumExprs & Avg Signif & Top1000 Signif & Top1000 Improved\\
\hline
vorinostat & 904 & 235.5 & 1000 & 287\\
\hline
trichostatin & 689 & 277.1 & 1000 & 0\\
\hline
estradiol & 325 & 166.8 & 1000 & 203\\
\hline
tamoxifen & 122 & 105.8 & 1000 & 9\\
\hline
doxorubicin & 104 & 28.0 & 1000 & 369\\
\hline
gemcitabine & 97 & 52.5 & 1000 & 116\\
\hline
daunorubicin & 91 & 40.9 & 1000 & 28\\
\hline
idarubicin & 78 & 30.1 & 1000 & 58\\
\hline
pravastatin & 61 & -7.5 & 0 & 0\\
\hline
Grand Total & & 43.1 & 9068 & 1070\\
\hline
 \end{tabular}
\caption{\scriptsize   For each chosen drug from \lincs, the number of experiments in MCF7 cell line that were performed with that drug (NumExprs), and statistics for the top1000 feature pairs for that drug including the average $-\log_{10} (corrected\_pval)$ (Avg Signif), number of feature pairs with $-\log_{10} (corrected\_pval) > 5$ (Top1000 Signif), and number with $-\log_{10} (improv\_quot) > 0$ (Top1000 Improved).}
\label{tbl:selected_fps}
\vspace{-18pt}
\end{table}
Many of the above-mentioned 1070 significantly improved feature pairs are partially redundant, in the sense that they comprise a common best-ranked single feature (gene). An example of this is with the object set for the drug (small molecule) estradiol. We found the gene PRSS23 as the single feature with the highest separability and many feature pairs containing PRSS23 and a second gene as having an improved corrected p-value, for example (PRSS23, RAP1GAP), (PRSS23, TSC22D3), and (PRSS23, BAMBI). We looked for evidence of the relationship between the drug estradiol and these feature pair genes in the Comparative Toxicogenomics Database (CTD)~\cite{grondin2018accessing} and with our own literature survey. From this search, we found evidence for the pronounced effect of estradiol in increasing expression levels of PRSS23~\cite{chan2012serine}, RAP1GAP~\cite{moggs2004phenotypic}, and BAMBI~\cite{spink2009long}, and decreasing expression of TSC22D3~\cite{sengupta2013molecular}. So although the top single feature (gene PRSS23) reoccurred in multiple top feature pairs, each secondary feature gene was also meaningfully related to the administered drug in this case.
We next examined the 1070 improved feature pairs, found over the 9 \lincs datasets, to determine their consistency with existing biological knowledge bases (see Supplementary Note~\ref{app:relation_data} for details). The interaction networks from these sources covered 23,167 genes and had at least one known interaction between $2.17\%$ of all possible gene pairs. Of the 996 unique feature pairs with significant $improv\_quot$ where both genes mapped onto the genes covered by the interaction networks, 133 gene pairs ($13.4\%$) were found to have at least one known interaction. This six-fold enrichment demonstrates that \genviz more often finds pairs of genes that have a known relationship than is expected by chance. One example \eat{of such a feature pair }is (GLRX, NME7) that is especially good for separating vorinostat experiments from all others. Not only are both of these genes known to have increased mRNA expression in response to vorinostat~\cite{qi2014systematic},~\cite{soldi2013genomic}, but the two genes are annotated by STRING to both be in database pathways of nucleotide biosynthesis, co-express with each other in other model organisms, and mentioned together often in literature abstracts. Later, in Section~\ref{ssec:viz}, we will demonstrate that the positive objects and negative objects are visually separated under this feature pair, as in Figure~\ref{fig:viz}.
In Supplementary Table~\ref{appT:fp}, we examine several of the ``improved'' feature gene pairs reported by \genviz analysis for the \lincs nine drug datasets. Of thirty-nine feature pairs in this table, twelve of them have three types of accompanying evidence: 1) a literature-based relationship between the drug and the first gene, 2) a literature-based relationship between the drug and the second gene, and 3) an interaction network relationship between the pair of genes. Six have two of the three types of evidence and there are only three with no evidence at all. Particularly interesting are the top improved feature pairs in which neither of the single gene features ranked well alone. An example is the gene pair CDKN1A and CEBPB for separating doxorubicin experiments from others. Either gene feature alone is not within the top 600 genes for separating doxorubicin experiments from others. However, the combination of the pair is significant at a corrected p-value of $2\times10^{-25}$ and is the second most improved feature pair for doxorubicin. This feature pair also has all three types of accompanying evidence; doxorubicin is known to increase expression of CDKN1A and CEBPB~\cite{zhao2015gene}, and the pair of genes are annotated in STRING to have evidence for co-expression and text mining relationships. This feature pair can be used to form an interesting hypothesis for further analysis or experiment. The potential for finding more significant and previously unidentified features is why \genviz is designed to recover top ranking feature pairs instead of just single features.
% \begin{figure}[h]
% \centering %%% not \center
% \subfigure[\msig]{\label{fig:msig_FP}\includegraphics[width=.16\textwidth]{fig/msig_better_rank.pdf}}
% \subfigure[\lincs]{\label{fig:lincs_FP}\includegraphics[width=.31\textwidth]{fig/lincs_better_rank.pdf}}
% \caption{Top-100 Feature Pairs' Corrected P-value Distribution}
% \label{fig:FP}
% \end{figure}
%\cb{I did not insert a substitute for this figure~\ref{fig:better_rank_20} (no longer referred to in the text and to be deleted). It may not be as necessary with some of the examples in the \fpstbl.}
%\begin{figure}[h]
%\vspace{-5mm}
% \centering
% \includegraphics[width=0.8\linewidth]{fig/better_rank_20.pdf}
% \vspace{-5mm}
%\caption{Comparison of Feature Pair and Single Feature Rankings. The \toptwenty feature pairs from \msig (blue) and \lincs (red) datasets are depicted, with each circle representing a feature pair. The x-axis is the rank of the feature pair among all feature pair candidates, and y-axis is the better rank of its two individual single feature among all single features. The feature pair with the green circle is highlighted in the text. }
%\vspace{-5mm}
%\label{fig:better_rank_20}
%\end{figure}
%\sinha{why overlap the \msig and \lincs? suggests a simultaneous examination, which is not what we want here} \agp{yes}
%==================================================================
\subsection{Output Visualizations}\label{ssec:viz}
As discussed in Section~\ref{sec:intro}, the output of \genviz is not simply a ranking of the top feature pairs with their scores, but also a visualization that helps users to interpret the separability. In Figure~\ref{fig:viz}, we depict sample output visualizations from the \msig and \lincs runs. For \msig, we select the feature pair with the highest improved p-value, i.e., $improv\_quot$, using the \sampOpt algorithm. For our \lincs representative, we visualize the gene feature pair (GLRX, NME7) for the drug vorinostat as described in the previous section.
\eat{We use a heatmap to illustrate the distribution of positive objects (in blue) and negative objects (in red) under the selected feature pair $(f_i,f_j)$, where $f_i$ is x-axis, $f_j$ is y-axis. The color represents the difference between the weighted number of positive objects and negative objects. If the number of positive objects is larger than negative objects in the cell, the cell is colored as blue, otherwise, it is red. For the ease of interpretation, we also add the centroids of positive and negative objects using a blue circle and a red circle respectively. }
For the \msig example (Figure~\ref{fig:viz}(a)), we observe that the feature values for negative objects are clustered around zero, while the genes differentially expressed in papillary thyroid carcinomas from this \msig study have larger values overall, indicating stronger connections to the two Gene Ontology terms features, cell adhesion and response to reactive oxygen species. This is consistent with studies that have highlighted the over expression of important cell adhesion genes in thyroid cancer~\cite{gorka2007nrcam}. For the \lincs example (Figure~\ref{fig:viz}(b)), positive objects mostly have elevated expression for the two reported genes (GLRX and NME7) compared to the negative objects. The direction of this differential gene expression for both genes is consistent with literature for vorinostat experiments~\cite{qi2014systematic}, ~\cite{soldi2013genomic}. These above two examples illustrate how visualization of significant feature pairs can be a useful way to explain the separability of object sets and understand the data.
\begin{figure}[h]
\vspace{-5mm}
\centering %%% not \center
\subfigure[\msig]{\label{fig:viz_msig}\includegraphics[width=.235\textwidth]{fig/ratioHist_msig_2.eps}}
\subfigure[\lincs]{\label{fig:viz_lincs}\includegraphics[width=.235\textwidth]{fig/ratioHist_newlincs_2.eps}}
\vspace{-5mm}
\caption{Visualization Output of \genviz. Heatmap visualization with the pair of top features providing the x and y axes and the name of the run providing the plot title. The relative density of objects determines the color of the heatmap cells with blue indicating a greater proportion of positive objects and red indicating a greater proportion of negative objects. The class centroids are represented by blue (positive class) and red circles (negative class). The two examples shown are representatives from \msig and \lincs datasets. }
\vspace{-5mm}
\label{fig:viz}
\end{figure}
\section{Discussion}\label{sec:disc}
The \genviz algorithm with its optimization modules enables researchers to
visualize and explore the interplay between important pairs of genomic features
rapidly, rather than relying on slow machine learning feature extraction methods
or only examining the simple list of top single features.  The optimization
modules led to a two orders of magnitude speed up in the task of returning the
top feature pairs for separating the biological classes in our two
benchmark datasets, \msig and \lincs. The quality of these top feature pairs
was confirmed by their agreement with literature and interaction databases,
and the features are easily understood with intuitive heatmap visualizations.
\genviz relies on the Rocchio-based separability measure, which well
approximates the best possible linear separator quickly and enables
optimizations like \trans that can pre-compute important quantities from the
feature-object matrix before the positive and negative object sets are even
provided. One potential downside of the Rocchio-based measure is
that because of its dependency on linearity, feature pairs with distinct
object class distributions that form complex, non-convex, non-isotropic
patterns are potentially very interesting, but will not be well-ranked by
\genviz. Finally, in \genviz, the optional \sampling module and \traversal
modules make stochastic or greedy decisions in order to estimate the quality of
and prune the potential candidate feature pairs for evaluation.  While this
greatly benefits the amount of time required to find the top ranking pairs, it
has the potential to do so at the cost of ranking accuracy. Overall, we observed
that for our settings, the sacrifice in accuracy was slight for the \sampOpt
feature pair rankings and more substantial when using the \horiz and \vertic
rankings with the greedy candidate traversal.  However, users of \genviz are
able to optimize the trade-off with performance and accuracy by modifying the
sample size, $|\sss|$, used by the \sampling module or the number of candidate
feature pairs examined, $\chi$, by \traversal module depending on the needs of
their research and dataset.
\tr{
\cb{Low priority, but I am interested in knowing for all the feature pairs we return that are better/new. what type of relationship between the pair are we finding. maybe captured by the slope of the perp bisector?}
\cb{again low priority, but how robust are these results. If you run a sampling method twice, do you get the same feature pairs. if you change the positive set slightly, are your results comparable?}
}
\section*{Funding}
This work was supported by the National Institutes of Health Big Data to Knowledge (BD2K) initiative [1U54GM114838, 3U54EB020406-02S1], the National Science Foundation [IIS-1733878], 3M, and Microsoft.
\vspace*{-3pt}
%\appendix
%\section{Appendix}\label{sec:supplement}
%\begin{table*}[t]
%\centering
%\small
%\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|}
% \hline
% & & \multirow{ 2}{*}{\baseline} & \multirow{ 2}{*}{\earlyOrder} & \multicolumn{2}{c|}{\samp}
%& \multicolumn{2}{c|}{\sampOpt} & \multicolumn{2}{c|}{\horiz} & \multicolumn{2}{c|}{\vertic} \\
%  \cline{5-12}
%& & & & Phase 1 & Phase 2
%& Phase 1 & Phase 2 & Phase 1 & Phase 2 & Phase 1 & Phase 2 \\
%  \hline
%\multirow{ 2}{*}{\msig} & FPs Checked & $2\times 10^8$ & $2\times 10^8$ & $2\times 10^8$ & $4.1\times 10^6$ & $1.9 \times 10^8$ & $2.3\times 10^6$ & $10^7$ & $1.2\times 10^6$ & $10^7$ & $5.5\times 10^5$ \\
%\cline{2-12}
% & Objects Checked & 22209 & 5427 & 694 & 22209 & 694 & 22209 & 694 & 22209 & 694 & 22209 \\
%  \hline
%\multirow{ 2}{*}{\lincs} & FPs Checked & $2.5\times 10^8$ & $2.5\times 10^8$ & $2.5\times 10^8$ & $3.3\times 10^5$ & $2.5\times 10^8$ & $9.1\times 10^4$ & $10^7$ & $8.6\times 10^4$ & $10^7$ & $5.0\times 10^4$\\
%\cline{2-12}
% & Objects Checked & 98061 & 15631 & 564 & 98061 & 564 & 98061 & 564 & 98061 & 564 & 98061 \\
%  \hline
%  \end{tabular}
%\caption{Supplement Table for Running Time Comparison in Figure~\ref{fig:time}. FPs represents feature pairs.}
%\label{tbl:supplement}
%\end{table*}
%
\eat{
xxx[y]
[y]}