4-discussion.tex

\section{Discussion}


\subsection{Applications of GANs for health data and innovation}

Overall, the published \gls{gan} algorithms for \gls{ohd} provided equivalent or superior performance versus the statistical modeling-based methods against which they were benchmarked. Importantly, their capabilities are highly relevant to the medical field: domain translation for unlabeled data, conditional sampling of minority classes, data augmentation, learning from partially labeled or unlabeled data, data imputation, and forward simulation of patient profiles. While some of these claims are overoptimistic or lack convincing evidence, they paint an encouraging picture for the value of synthetic \gls{ohd} and the transformative effect it could have on healthcare initiatives and scientific progress.\par

The ongoing Covid-19 pandemic has brought unprecedented levels of cooperation between scientists from around the world. The urgency of obtaining data has highlighted on difficult terms the need for novel ways of sharing and generating data \cite{bandara_improving_2020, Cosgriff_2020}. Global concerted efforts were highly successful, but also required adaptation, with some proposing exemptions from the GDPR \cite{mclennan_covid-19_2020}. Data sharing was limited to aggregate counts, rather than at the patient level, limiting the depth of analyses. \par

In the beginning of an epidemic, the scarcity of data can be compensated with synthetic data. Building generative statistical methods in such conditions is a difficult task \cite{Latif2020-ol}. While additional data becomes available to fine-tune the model, so do the number of features and the complexity of the model. This was attempted by Synthea \cite{Walonoski_2017} in the early months of the pandemic, with humble results, nonetheless they were used in many online challenges, hackathons, and conferences. 
\begin{quote}
    The authors state that if one takes "[...] Field Marshall Moltke’s notion of “no plan survives contact with the enemy” as true and expands the scope to modeling and simulation, then we might say that “no model survives contact with reality.” \cite{walonoski_synthea_2020}. We would argue that \glspl{gan} grow stronger in contact with reality. 
\end{quote}
Generative models refine their representation as more data is provided and could be combined with current methods of forecasting. When the amount of ground truth data is small, semi-supervised learning simulations can improve the performance of predictors \cite{dahmen_synsys_2019}. Domain translation, as demonstrated in \gls{radialgan}, would be exceptionally useful to combine datasets from disparate localities. In a recent publication, two different data augmentation techniques provided a significant increase in sensitivity and specificity for the detection of COVID-19 infections, one of which producing \gls{sd} with a \gls{gan} \cite{Sedik2020-tx}.

\subsection{Challenges posed by OHD}

The challenges posed by health data for \glspl{gan} are obvious, a number of recurrent factors influence the outcome of efforts to develop them. These problems are not limited to generative algorithms, but also \gls{ml} in general. For generative models, multi-modality caused the most trouble in achieving a stable training procedure. At the outset, preventing \gls{mode-collapse} attracted the most research efforts, in addition to data combinations of categorical and real-valued features. A rapid succession of efforts aimed at improving \gls{medgan} by incorporating the latest machine learning techniques showed continued improvements. However, taken as a whole the efforts were haphazard in their methods and metrics. Often yielding unsurprising results, considering the techniques were known to improve performance across a broad range of applications. This is expected in a new field of application, and more concerted efforts to systematically approach the problems should progressively form.\par

\subsubsection{Feature engineering}
We observed that majority of methods included in the review made use of heavily transformed representations of patient records. This is in part due to the inconvenient properties of health data, such as missingness. However, it is somewhat apparent that the main motive is to accommodate existing algorithms. Along with demographic variables, \gls{ohd} data mostly takes the form of triples composed by (1) a timestamp, (2) a medical concept and (3) the recorded value. Their count is different for each patient, irregular intervals between each triple and the number of possible values in a dimensions can be huge. Moreover, there are generally multiple episodes of care, each with a different cause. These properties are not typically considered practical for machine learning. \par

At varying degrees, depending on the transformations, information is being lost or bias is introduced. For example, when data are reduced by aggregation to one-hot encoding, the complex relationships found in medical data are, for the most part eliminated. Similarly, information is lost when forcing real-valued time-series into a regular representation, by truncating, padding, binning or imputation. Moreover, it is highly unlikely that the data is missing at random, introducing the potential for bias when a large part of the real data is rejected on this basis, or the medical codes are truncated to their parent generalizations \cite{Zhang2020, Choi2017-nt}. 

\subsection{From innovation to adoption: Evaluation metrics and benchmarking}
Interesting innovations were demonstrated, and progress has good momentum. Their application and adoption will undoubtedly be more sluggish, as has been the case with predictive \gls{ml}. For good reason, the bar is set high in demonstrating consistent outcomes and ensuring patient safety. While the problem of \gls{mode-collapse} has been alleviated, evidence has yet to be provided with regards to ensuring that the finer details of the distribution are estimated with sufficient granularity to produce realistic patient profiles. Consistent behavior and reproducible results will be required to expect any significant adoption. In regards to evaluation, it is manifest that the choice of optimal metrics and indicators is still being explored. The fact is that the efforts are far from consistent or systematic. As an example, competing methods are often compared with different metrics or with contradictory results in different datasets \cite{baowaly_2019_IEEE,baowaly_2019_jamia,Camino2018-re,Choi2017-nt,Zhang2020}. Overall, none of the evaluation metrics addressed the concept of realism in synthetic data.\par

\subsubsection{Qualitative realism}
Qualitative evaluation, in its current form, provides little evidence. For medical experts, these representations are meaningless. As such, the results of qualitative evaluation often state that synthetic data is indistinguishable from the real data \cite{Choi2017-nt,Wang_2019}. It is doubtful that they could in fact be distinguished. \citeauthor{esteban2017real} found that participants avoided the median score and were not confident enough to choose either extreme \cite{esteban2017real}.\par

\input{concepts/vis}%
%

In their evaluation of \gls{medgan}, \cite{yale:hal-02160496} argue that the positive resemblance of plotted feature distributions is due to the fact that the model's architecture tends to favor reproducing the means and probabilities of each diagnosis column. They note that synthetic data contains samples with an unusually high number of codes, which is not apparent in the plots. Their hypothesis is that these samples are used by the algorithm to discharge the rare medical codes with weak correlation, in an effort to balance the distributions. However, they stated in their experiments that comparing \gls{pca} plots of real and synthetic data  was nonetheless insightful to get an impression of their behavior \cite{Yale_2020}. If visual inspection is to be used, it should be done systematically according established frameworks (See Panel \ref{pan:visualisation})\par

\subsubsection{Quantitative fitness}
Reproducing aggregate statistical properties is rather unconvincing evidence that a model has learned to reproduce the complexity of patient health trajectories. In some cases the statistical metrics may be contradictory, such as when the ranking of medical frequencies in the data are wrong, but augmentation leads to improved performance \cite{Che_2017}. \citeauthor{Choi2017-nt} found that although the synthetic sample seemed statistically sound, it contained gross errors such as gender code mismatches and suggested the use of domain-specific heuristics \cite{Choi2017-nt}. \gls{heterogan} was an encouraging step in this direction, but do not represent a solution. Conditional training methods have led to improvements. For example, when labels corresponding to sub-populations or classes are used to condition the generative process. \citeauthor{Zhang2020} showed that conditioned training with categorical labels, in this case age ranges, improves utility for small datasets \cite{Zhang2020}. \par

Utility-based metrics do overall provide a more solid evaluation of data quality. However, they only confirm the value of the data according to a narrow context. They are indicative of realism so far as a patient's state is indicative of a medical outcome. Moreover, they do not provide any insight about the validity of the relations found in a patient record and its overall consistency. While such consideration was found sparingly in the publications, extensive research is available on the subject of medical information representation. The complexity of health data and its variety make it a considerable, but captivating challenge.\par

\subsubsection{Constraints}
As described in Section \ref{noauto}, \gls{heterogan} introduces constraint-based loss. Based on the distribution of individual features and utility-based metrics, the authors argue that the bias intrinsic to their methods has not led to undesirable bias or side-effects in other aspects of the learned distribution. However, the constraints were strict and would be hard to scale. The idea of incorporating knowledge-based constraints in the otherwise naive \gls{gan} is in fact gaining attention (See Section \ref{sec:knowledge}) \par

\section{Suggestions of requirements for OHD-GAN development}

\subsection{Models of appropriate scope and equivalent degree of evaluation}\label{sec:basic}
Overall, evaluation methods were superficial or uni-dimensional relative to the scope of the task. As previously discussed, finding convincing and robust evaluation metrics for synthetic health data is an open issue. Weak metrics become a prominent issue when the learning task is broad, loosely defined, constructed for the sole purpose of evaluation, or the scope of application is too large. The difficulty of explaining or validating the realism of data representing a patient, often longitudinal and which factors deferentially contribute to disease characterization makes the assessment of synthetic data ambiguous, thus demanding stronger evidence to claims.\par

\input{boxes/scope_eval}%
%
\subsection{Data-driven architecture}\label{sec:archi}

Deep architectures are based on the intuition that multiple layers of nonlinear functions are needed to learn complicated high-level abstractions \cite{Bengio_2009}. \gls{cnn} capture patterns of an image in a hierarchical fashion, such that in sequence, each layer forms a representation of the data at a higher level of abstraction. This type of data-oriented architecture has led to impressive performance for \gls{cnn} on image data.\par For health data with then find ourselves with the predicament that in order to make good synthetic, one needs to have as much knowledge of the source data and the application. We face again the problem that the creation of \gls{sd} needs to be done by the people that already have access to the real data and have extensive data science knowledge. A rarity in hospitals.

Health data presents a different, analogous multi-level structure. As an illustration, a predictive algorithm developed in a hierarchical structure was shown to form representations of \gls{ehr} that capture the sequential order of visits and co-occurrence of codes within a visit. It led to improved predictor performance, and also allowed for meaningful interpretation of the model \cite{choi2016multi}. Similarly, models of time-series based on a continuous time representation \footnote{Those interested in \gls{gan} for wavelike data will find many examples \cite{Delaney2019,Golany2019,Ye2019,Wang2019d,Singh2020,Aznan2019,Hartmann2018}.}, such as \glspl{eeg} and \glspl{ecg} found in \gls{ehr} data, have shown improved accuracy over discrete time-representations \cite{rubanova2019latent,de2019gru}. Creative adaptations of the data for existing architectures have provided surprising results. For example, \gls{ohd} input into a CNN were transformed to image(bitmaps) in which the pixels encoded the information \cite{Fukae2020}

\input{boxes/data_leads}%
%
\subsection{Evolving the patients}
As we have seen, \gls{ohd-gan} are not exclusively used to produce "fake" patients, but also to be representative of a particular patient. Common examples are translating between patient states, or producing counterfactuals. It would be interesting to see if combining \gls{gan} with what is know as evolutionary computing could produced valuable results. We can think of a \gls{gan} transforming the patient data to an alternative state, after which the evolutionary algorithms would optimize this new state in a continuous fashion, as new data about the patient becomes available. Immediately after writing this, a quick search confirms the combination can have impressive results, either in optimizing the evolutionary process \cite{He2020-zm}, exploring the latent space \cite{Schrum2020-vl}, or expanding the information received by the discriminator \cite{Mu2020-id}. 

\input{boxes/novel-combinations}%
%

\subsection{Forcing, disciplining or guiding \label{sec:knowledge}}

To build statistical models we define rules and relations that they are forced to optimize when learning. On the other hand, \glspl{gan} are given free range in a space of possibilities and are disciplined for exploring certain areas, but are provided no explanation. \par

We build enormous models and let them fight back and forth in a min-max battle that goes on forever, denying them our valuable knowledge. The idea of introducing human knowledge in the otherwise naive training process has gained some attention.\par

Posterior regularization is usually used to impose constraints on probabilistic models, but \glspl{gan} lack the necessary Bayesian component. In the student-teacher model, where a larger model is used to train a smaller one, the process known as knowledge distillation. Such models are developed for many applications, such as compression, improving accuracy and accelerating training \cite{abbasi2019odeling}.

In the field of \gls{rl}, a mathematical correspondence between \gls{ps} and \gls{rl} led to the probabilistic \gls{pr} framework \gls{irl} that seeks to learn a reward function from expert demonstrations. This was followed by approaches capable of learning both the reward function and the policy \cite{finn2016guided,fu2018learning}. \citeauthor{Hu2018} then demonstrated a correspondence between \glspl{rl} and \glspl{gan}. This allowed them to develop a \gls{gan} with a constraint-based learning objective \cite{Hu2018}.\par

The constraints, seen as a reward function, can be learned by the model through an algorithm involving maximum entropy. This means the known constraints can be input directly or partially and left to be learned automatically. The algorithm consistently improved the speed and quality of training, and accuracy on a few tasks. The approach is exemplified on an image translation task where images of people are transformed from one pose (ex. looking forward) to another (ex. head turned left). The constraint is provided by a pre-trained auxiliary classifier that assigns each pixel to a body part, and is adapted jointly with the \gls{gan}. The \gls{gan} is rewarded for preserving the mapping in the output image. A performance comparison against unconstrained and fixed-constraint models results in similar training loss and evaluation metric. However when evaluated by humans, the novel approach surpasses the other models on 77\% of test cases. \par

\input{boxes/knowledge}%
%%
 
\subsection{Interpretability\label{sec:latent-space}}
Even though a few authors attempted to understand the behavior of their models, overall the subject was left largely unmentioned. It is imperative that future experimentation and publication give equal importance to the interpretation of their models and establishing means to do so. In the healthcare domain, black box machine learning models find little adoption, and synthetic data is most often met with dismissal to its validity. The task is not impossible, as for any other opaque system, and in fact experimental sciences in general. The simplest approach is to  provide input, observe the output, reformulate our hypotheses, and modify the input accordingly. If needed, the sequence is iterated to convergence. Fortunately, in this case the internals workings are entirely available, tipping the balance between brute-force, and knowledgeable-driven exploration of the system. In addition, we believe "qualitative" evaluation by visual inspection has much greater potential, still to be defined. What better to define interpretation than a medical professional decoding the hidden relations in data visually. \par

In theory, the latent space is a lower-dimensional representation of basic concepts that should be directly interpretable. However, in practice these concepts are entangled over multiple nodes. In what is a preliminary, but encouraging proof-of-concept, \cite{lui2019-latent} explore how they can use perturbations to reveal patterns in a \gls{beta-vae} trained to capture brain structure in mice. By generating a collection of images from a dense interpolation of the latent space, they were able to examine the projective field of latent variables onto the pixels. They found zones of high variance that corresponded to biologically relevant areas. Reversing the experiment, they masked areas of the images and found that many latent factors were not activated by all regions of interest and had localized receptive fields. Whereas complex highly connected regions such as the hippocampus activated almost all latent factors. Curiously, the projective and receptive fields may not be aligned. Numerous other publications have shown that they capture meaningful properties and structure of the data, reducing complexity to a level that lends itself to interpretation \cite{Way2020, Koumakis2020}. In one instance involving transcription factor micro-array data, a close one-to-one mapping could be obtained from the last hidden layer, in addition to the higher level layers that related to biological processes in a hierarchical fashion \cite{chen2016-latentyeast}. Pushing the boundaries further, by correlating the output features of a GAN with the latent space dimensions allowed controllable semantic manipulation of the generated data \cite{Wang2020latent,Ding2020latent,Li2020latent}. However, a recent information-theoretic \gls{gan} simplified interpretation greatly by forcing the latent nodes to learn disentangled representations. In addition to adversarial loss, \gls{info-gan} also maximizes the mutual information between small numbers of latent nodes. The result is highly interpretable nodes that represent distinct concepts that can be easily influenced, or in some cases interpolate smoothly between features \cite{Chen2016c}.\par

\subsection{Benchmarking, a priority}
It became slowly obvious through the succession of experiments that there is a glaring problem of standardization of evaluation. New algorithms and applications are being demonstrated at an increasing rate. On the contrary, standardized benchmarks, procedures to transform the data, and source has remained scarce, one can hardly compare the models objectively or nominate the best performances. Commendably, \citeauthor{Camino2020bench} are the first to bring attention to this issue in a position paper that provides quantitative arguments. Notably the myriad of ways commonly used datasets are reprocessed, metrics that are not comparable, and hyperparater sweep results, for which no transformation code and optimal values are released and the lack of effort towards reproducibility will only reduce credibility of the field. On a positive note, we've compiled a list of the repositories which were made open-source in Table \ref{tab:5:sourcecode} and a list of the common dataset links can be found in Table \ref{tab:5:sourcecode}.\par
In this regard the replication of medical studies with synthetic data by \citeauthor{Yale_2020} substantiate the value of \gls{sd} for exploratory data analysis, reproducibility on restricted data and more generally education in scientific training \cite{Reiner_Benaim2020-lx}. Reproducing medical or clinical studies will be necessary to gain mainstream adoption of \gls{gan} produced \gls{sd} and dispel the scepticism it is generally met with. The medical domain is known for its slow pace in adopting new technologies and predictive \gls{ml} is still far from meeting its full implementation potential \cite{Qayyum2020-ir}. Medical professionals care foremost about the well-being of their patients and will only consider results obtained from synthetic data if they have the assurance that they are valid \cite{Rankin2020}.  A remarkable resource for the purpose of benchmarking is the clinical prediction benchmarks defined on the \gls{mimic} data by \citeauthor{harutyunyan_multitask_2019}. The tasks are clearly defined and the source code to process the data and the algorithms is available \cite{harutyunyan_multitask_2019}. We suggest comparing the accuracy of the predictive algorithms applied to the original data versus the synthetic data to be evaluated. However, concerted efforts and informal guidelines that can be agreed upon should be on a regular schedule. We fully support the idea or organized challenges and hackahton proposed by \cite{Camino2020bench} and suggest a progressive approach to realizing it.\par

\subsubsection{Ultra-open source, collaborative, publishing communities}
In a successful and educational experiment on collaborative writing and crowd-sourcing, an article was entirely written in an open-source GitHub repository. Anyone willing to add their knowledge to the publication was welcome to do so, reaching 30+ authors in 20 countries. Every change proposal is requested for inclusion by a Pull Request, for which R2-3 approvals are necessary. Withing minutes, automated deployment procedures (Github since then released Actions, requiring minimal coding), took care of verifying compliance to guidelines, citation management, DOI registration, and compilation of latex or Markdown. Within minutes a revised  document is released, making the publication a contiguously up-to-date source of knowledge, that can be augmented in the web version with interactive code-books and figures.\par
Issues can be discussed in the appropriate channels, but most importantly the nature of GitHub ensures attribution of work done, down to a single character. The authors also implemented immutable backup on the blockchain. Since then distributed storage and computation blockchains have reached maturity and could store models, training artefacts, and data for competition at a trivial cost. As an alternative,  the \href{http://bit.ly/WandB-ML}{Weights and Biases (WandB)} platform is a fitting environment, worth a look even for individuals. The traditional publishers have long been touting a makeover of the publication system, changes are slow and trivial, whereas den centralized, person to person, systems have been transforming whole sectors faster than ever.\\

\section{Directions for future research}
\subsection{Building a patient model}
The ultimate goal for generative models of \gls{ohd} must be to develop an algorithm capable of learning an all encompassing patient model. It would then be possible to generate full \gls{ehr} records on demand, integrating genetic, lifestyle, environmental, biochemical, imaging, clinical information into high-resolution patient profiles \cite{Capobianco2020}. This is in fact the intention of the patient simulator Synthea. However, Synthea will eventually face a problem with scalability and the capacity of semi-independent state-transition models to coordinate in capturing long-range correlations.\par

Once basic models of health data, as described in Section \ref{sec:basic}, have been developed and validated, these can be progressively combined in a modular fashion to obtain increasingly complex patient simulators. Furthermore, having designed the architecture of these basic models on the underlying data in a way that is comprehensible, as described in \ref{sec:archi}, will facilitate the composition of more complex models. Inputs, outputs and parts of these models can be conditionally attached to others such that the generative process occurs in a way that reflects the real generative process.

\subsection{Evaluating complex patient models \label{sec:evaluation-cqm}}
Once more complex models are developed, the problem is again finding meaningful evaluation metrics of data realism. Capobiano et al. insist on the necessity for data performance metrics encompassing diagnostic accuracy, early intervention, targeted treatment and drug efficacy \cite{Capobianco2020}. In their publication exploring the validation of the data produced by Synthea, Chen et al. provide an interesting idea to achieve this \cite{Chen_2019}. Noting that the quality of care is the prime objective of a functional healthcare system, they suggest using \glspl{cqm} to evaluate the synthetic data. These measures "are evidence-based metrics to quantify the processes and outcomes of healthcare", such as "the level of effectiveness, safety and timeliness of the services that a healthcare provider or organization offers."(Chen 2019). High-level indicators such as \glspl{cqm} domain specific measures of quality, are specifically designed for higher level or multi-modal representations of healthcare data. The constraints introduced in \gls{heterogan} should be leverage to evaluate the realism of the synthetic data, rather than bias the generator training. Composing a comprehensive set of such constraints could possibly serve as a standardized benchmark.
At the individual level, Walsh et al. employ domain specific indicators of disease progression and worsening and compare agreement of the simulated patient trajectories with the factual timelines \cite{walsh2020generating}.\par
In addition to \gls{cqm}, we propose the use of the Care maps used by the Synthea model to simulate patient trajectories as evaluation metrics \cite{Walonoski_2017}. Care maps are transition graphs developed from clinician input and Clinical Practice Guidelines, of which the transition probabilities are gathered from health incidence statistics. While these allow the Synthea algorithm to simulate patient profile with realistic structure, they also prevent it from reproducing real-world variability. Conversely, while \glspl{gan} have the ability to reproduce the quirks of real data, they also lack the constraints preventing nonsensical outputs. As such, Care maps provide an ideal metric to check if the synthetic data conforms to medical processes.\par 
In fact, this has been used before in a competition where participants were given synthetic data from finite state transition machines with known probabilities and tasked to build and learn models that would reproduce those of the original, unseen models. The participants worked according to the Perplexity metric, commonly used in NLP, which quantifies how well a probability distribution or probability model predicts a sample \cite{Verwer_2013}. We postulate that the Synthea models built with real-world probabilities would provide a unique and robust way to evaluate synthetic data according to the metric proposed above, among other means to utilize the state-transition in Synthea and their modularity.

\subsubsection{Opportunities and application to current events}
Synthetic and external controls in clinical trials are becoming increasingly popular \cite{Thorlund2020}. Synthetic controls refer to cohorts that have been composed from real observational cohorts or \gls{ehr} using statistical methodologies. While the individuals included in the cohorts are usually left unchanged, micro-simulations of disease progression at the patient level are used to explore long-term outcomes and help in the estimation of treatment effects \cite{Thorlund2020, Etzioni2002}. Synthetic data generated by \glspl{gan} could be transformative for the problem of finding control cohorts.\par
With the COVID-19 pandemic scientists have become increasingly aware of and vocal about the need for data sharing between political borders \cite{Cosgriff_2020,Becker_2020,McLennan_2020}. An obvious application is generating additional amounts of data in the early stages of the pandemic, potentially creating opportunities earlier. Synthetic data is not only an opportunity to facilitate the exchange of data, but also to adjust the biases of samples obtained from different localities. Factors such as local hospital practices, different patient populations and equipment introduce feature and distribution mismatches \cite{Ghassemi2020}. These disparities can be mitigated by translation of \gls{gan} algorithms, such as \gls{cycle-gan} proposed by Yoon et al.