tutorials

sgbaird · Apr 17, 2024 · 8479432 · 8479432
1 parent c253927
commit 8479432
Show file tree

Hide file tree

Showing 13 changed files with 3,269 additions and 0 deletions.
diff --git a/docs/tutorials.md b/docs/tutorials.md
@@ -1 +1,6 @@
 # Tutorials
+
+
+```python
+a = 1
+```
diff --git a/docs/tutorials/FreqVsBayes/Comparison.jpg b/docs/tutorials/FreqVsBayes/Comparison.jpg
diff --git a/docs/tutorials/FreqVsBayes/FREQvsBAYES.pdf b/docs/tutorials/FreqVsBayes/FREQvsBAYES.pdf
diff --git a/docs/tutorials/FreqVsBayes/FREQvsBAYES.txt b/docs/tutorials/FreqVsBayes/FREQvsBAYES.txt
@@ -0,0 +1,42 @@
+# Frequentist Vs. Fully Bayesian Gaussian Process Models
+
+A distinction is sometimes made between "frequentist" and "fully Bayesian" gaussian process (GP) surrogate models. As GPs are often thrown under the Bayesian model umbrella, this distinction can be confusing to users who don't have a strong background in probabilistic modeling. In this write up, we hope to clarify the difference and provide some helpful recommendations of when to choose one over the other.
+
+## Gaussian Process Parameters
+In the interest of brevity, it is assumed that the reader is generally familiar with gaussian process models and their properties. If you are new to GPs, consider watching this excellent [lecture](https://www.youtube.com/watch?v=92-98SYOdlY) on the intuition and mathematics behind them before reading onward.
+
+At their simplest, GPs are defined by a mean and a covariance function. The mean function represents the average behavior of the process across different input points, serving as a baseline trend in the absence of observed data. The covariance function (sometimes called the kernel) defines the relationships between observed data points and is controlled by set of parameters. Common covariance functions, such as the radial basis function (RBF), are defined by `length scale` and `output scale` parameters. These parameters have a strong impact on the shape and fit of the GP to the data. The effects of the `length scale` parameter on GP behavior is shown in the figure below where its impact can be clearly seen.
+
+LengthScale.jpg
+
+Fitting a GP to a set of datapoints involves estimating the values of these parameters such that a fitting error is minimized. It is at this point where the distinction between frequentist and fully bayesian methods comes into play.
+
+## Frequentist GP Fitting
+
+The "frequentist" approach aims to provide point estimates for the parameters of the covariance function using *maximum a posteriori* (MAP) estimation. In practice, this results in a single estimate of the `length scale` and `output scale` parameters that are found to minimize model error given the data. This is the most common estimation scheme for GP models and is the default for many GP libraries.
+
+While MAP estimation works well in general scenarios, single point estimates of covariance function parameters can lead to unintended results when few data points are observed and constraints on the values of GP parameters are too loose. Consider the figure below where a GP is initialized with default parameters in the `botorch` library and fit to the data. We see that although the GP captures the spread of the data points (minimizes model error) it has little predictive strength.
+
+Frequentist.jpg
+
+In our toy example above, this error is easy to see and remedy. However, in higher dimensional space, fitting errors like this might be harder to diagnose. A completely flat data representation has consequences for optimization tasks and makes sequential point selection less efficient. Thus, care should be taken when specifying GP parameters and evaluating predictions in limited data scenarios.
+
+## Fully Bayesian GP Fitting
+
+By contrast, a fully Bayesian GP treats the covariance function parameters as distributions rather than single point values and tries of find a likely distribution from which those parameters are drawn. At an abstract level, the estimation procedure follows the familiar Bayes' rule, where the posterior is proportional to a prior multiplied by the likelihood. In practice, the posterior distributions are challenging to compute directly, so estimation methods such as Markov Chain Monte Carlo (MCMC) are used. The details of which are beyond the scope of this article.
+
+Given distributions of covariance function parameters, we can generate potential GP models by sampling parameter values and fitting GPs to the data with them. By then averaging over these potential GPs we can obtain a mean posterior prediction. A graphical representation of this process is shown in the figure below. The predictions of GPs with different parameters drawn from the distribution are shown in light blue, with the dark blue showing the mean.
+
+FullyBayesian.jpg
+
+Notice that although many of the sampled predictions are similar to the "frequentist" GP fit, the distribution gives some weight towards parameters that are closer to the true function, which leads to a non-constant GP mean that provides an improvement in predictive performance.
+
+In this way, fully Bayesian GPs tend to offer more robust estimations of model uncertainty.
+
+## Which GP is Right For Your Problem
+
+Looking at the above results, it might be tempting to choose the fully Bayesian GP every time. However, it's worth noting that the fully Bayesian approach comes with the downside of a higher computational cost in fitting and computing the next observation point. For online systems, this can result in a significant amount of computing resources being devoted to the optimizer which may reduce optimization efficiency and increase cost. Additionally, the GPs fit in the above examples made minimal assumptions about the problem and did not impose any hard constraints on covariance function parameters. If instead we place some reasonable bounds on our `lengthscale` and our data's `noise` parameters, we see that both fitting methods produce similar looking mean functions that are comparable in performance on an optimization task.
+
+Comparison.jpg
+
+In the above examples it should be clear that the key advantage of going fully Bayesian is that it can provide more robust models when knowledge about the domain is extremely limited, data is scarce relative to the number of input dimensions, and observation noise is difficult to measure. For many problems, a standard "frequentist" GP will give equivalent optimization performance and should be the default unless a fully Bayesian approach can be justified.
diff --git a/docs/tutorials/FreqVsBayes/Frequentist.jpg b/docs/tutorials/FreqVsBayes/Frequentist.jpg
diff --git a/docs/tutorials/FreqVsBayes/FullyBayesian.jpg b/docs/tutorials/FreqVsBayes/FullyBayesian.jpg
diff --git a/docs/tutorials/FreqVsBayes/LengthScale.jpg b/docs/tutorials/FreqVsBayes/LengthScale.jpg
diff --git a/docs/tutorials/MultiOpTutorial.ipynb b/docs/tutorials/MultiOpTutorial.ipynb
diff --git a/docs/tutorials/SingleOpTutorial.ipynb b/docs/tutorials/SingleOpTutorial.ipynb
diff --git a/docs/tutorials/SoboVsMobo/HVI.jpg b/docs/tutorials/SoboVsMobo/HVI.jpg
diff --git a/docs/tutorials/SoboVsMobo/MOBO.jpg b/docs/tutorials/SoboVsMobo/MOBO.jpg
diff --git a/docs/tutorials/SoboVsMobo/SOBO.jpg b/docs/tutorials/SoboVsMobo/SOBO.jpg
diff --git a/docs/tutorials/SoboVsMobo/SOBOMOBO.txt b/docs/tutorials/SoboVsMobo/SOBOMOBO.txt
@@ -0,0 +1,42 @@
+# Single- vs. Multi-Objective Optimization
+
+Optimization theory is usually presented within the framework of maximizing or minimizing a single objective of interest. While single objective problems are conceptually easy to understand, real world problems often feature multiple, competing objectives that a researcher might be interested in accounting for in an optimization campaign. Although the shift from single- to multi-objective optimization is a simple linear increase in the number of objectives, the methods and interpretations of the results change dramatically. The sections below provide an overview of the key differences and considerations in selecting a single- vs. multi-objective bayesian optimization approach.
+
+
+## Single-Objective Optimization
+
+As the name implies, single-objective bayesian optimization aims to maximize or minimize a singular property of interest. It's the go-to method when you have both a clear goal and metric of success. For example, you many wish to optimize the ratio of constituents in a material to maximize its strength as measured by a flexural test. Under the single-objective framework, a probabilistic model is trained to predict the property of interest as a function of one or more inputs. At each optimization iteration, the model is used to predict a set of one or more optimal inputs. These are then evaluated (computationally or experimentally) and the results are integrated back into the model. As more data is observed, the model will become more accurate and return increasingly better input parameter predictions. Determining the best set of inputs is then a matter of selecting the set of inputs that return the best property value. The figure below shows a single-objective optimization trace where the best value observed at a given optimization iteration is tracked by the blue line.
+
+SOBO.jpg
+
+## Multi-Objective Optimization
+
+Imagine now that we are interested in optimizing a set of inputs for two or more properties simultaneously. Building on the previous example we might now want to maximize the strength of a material in addition to minimizing fabrication cost. A material that offers the highest strength might be prohibitively expensive, and conversely the most cost-effective material might be low strength. Thus, when considering multiple objectives, the question of the "best" set of inputs becomes more complicated and will likely be a compromise between the two objectives based on preferences and priorities.
+
+When optimization preferences are known, a common approach is to combine several objectives into a single objective through a weighted sum, weighted product, or some other mathematical combination. This approach is called *scalarization* and is a means of encoding preferences into the optimization campaign and reducing problem complexity. However, such approaches risk biasing an optimization campaign towards a poor optimization region and limit fuller exploration of the tradeoffs between objectives.
+
+Objective preferences may not be clearly defined, and researchers are often interested in learning the tradeoffs between the objectives so as to make an informed decision. Returning to the material optimization example, a researcher might be interested in determining the relative cost of an increase in material strength. In other words, we would like to know the boundary of the cost vs. strength compromise. This boundary is referred to as the *Pareto Front* and is defined by a set of non-dominated solutions - each of which cannot be improved in one objective without making another worse. Finding boundary of optimal trade-offs provides a holistic view of the solution space and allows researchers to make informed decisions based on their constraints and priorities. As such, many multi-objective optimization schemes aim to identify a set of Pareto optimal solutions as quickly as possible. The figure below shows a hypothetical Pareto front for our materials optimization example. The points along the blue, dashed line represent points of maximum tradeoff between our two objectives.
+
+MOBO.jpg
+
+These Pareto optimal solutions are identified by optimizing a metric referred to as the hypervolume. The hyper volume is defined by the volume (or area in 2D) of the objective space that is dominated by a set of Pareto optimal solutions relative to a reference point. By expanding the Pareto front,  the measured hypervolume also expands, providing a convenient singular metric of the diversity and performance of a set of solutions. The goal of the multi-objective optimization campaign is then to identify points that will increase the size of the hypervolume, which will necessarily produce points that maximize objective tradeoffs. The figure below provides a visualization of a hypervolume in a multi-objective design space. The dashed square represents a computed improvement in the hypervolume if a predicted point was observed. This is the method by which the Pareto front is iteratively expanded.
+
+HVI.jpg
+
+
+## Which Approach is Right for Your Problem?
+
+In deciding between single or multi-objective optimization, it's important to consider the complexity of your goals and the trade-off's you're willing to navigate. Single-objective optimization is straightforward, which makes the optimization process simpler, more interpretable, and often faster. However, in omitting competing objectives you may be oversimplify complex problems where multiple, often conflicting objectives must be balanced. Multi-objective optimization allows for the simultaneous consideration of several goals and the ability to learn the bounds on the tradeoffs between them. This, however, comes at a higher computational cost and  requires more sophisticated decision making processes for selecting the optimal solution. When other objectives, such as cost, can be directly computed, consider representing them in the form of a constraint rather than a separate objective.
+
+
+## Additional Resources
+
+P. Frazier, A Tutorial on Bayesian Optimization
+- https://arxiv.org/abs/1807.02811
+
+Ax Multi-Objective Optimization Tutorial
+- https://ax.dev/tutorials/multiobjective_optimization.html
+
+
+Emmerich et al. A tutorial on multiobjective optimization: fundamentals and evolutionary methods
+- https://link.springer.com/article/10.1007/s11047-018-9685-y