diff --git a/data_services.tex b/data_services.tex index de1e2b4..3e68274 100644 --- a/data_services.tex +++ b/data_services.tex @@ -16,7 +16,7 @@ \subsection{Performance Portability: Kokkos Core}\label{subsec:kokkos} \subsection{Performance Portable Kernels: Kokkos Kernels}\label{subsec:kk} Kokkos Kernels~\cite{rajamanickam2021kokkoskernels} is part of the Kokkos ecosystem~\cite{trott2021kokkos} and provides node-local implementations of mathematical kernels widely used across packages in Trilinos. As a member of the Kokkos ecosystem, Kokkos Kernels is tightly integrated into Kokkos features and aims at delivering performance-portable algorithms across major CPU- and GPU-based HPC systems. Due to its node-local nature, Kokkos Kernels does not rely on MPI or other communication libraries unlike numerous other packages in Trilinos. -The implementations of Kokkos Kernels algorithms leverage the hierarchical parallelism exposed by the Kokkos library~\cite{kim2017designing} and increasingly provide coverage for stream-callable kernels. To ensure flexibility for the distributed libraries that might call its algorithms, Kokkos Kernels provides thread-safe and asynchronous implementations for most of its kernels. Kokkos Kernels also serves as a major point of integration for vendor optimized libraries such as cuBLAS, cuSPARSE, rocBLAS, rocSPARSE, MKL, ARMpl and others. +The implementations of Kokkos Kernels algorithms leverage the hierarchical parallelism exposed by the Kokkos library~\cite{kim2017designing}. To ensure flexibility for the distributed libraries that might call its algorithms, Kokkos Kernels provides thread-safe implementations for most of its kernels, with increasing coverage for asynchronous implementations that allow an execution space instance to be passed when calling a kernel. Kokkos Kernels also serves as a major point of integration for vendor optimized libraries such as cuBLAS, cuSPARSE, rocBLAS, rocSPARSE, MKL, ARMpl and others. The capabilities provided by Kokkos Kernels can be divided into four major categories: 1. BLAS algorithms, 2. sparse linear algebra and preconditioners, 3. graph algorithms, and @@ -39,16 +39,16 @@ \subsection{Performance Portable Kernels: Kokkos Kernels}\label{subsec:kk} \subsection{Tools: Teuchos} -Teuchos provides a suite of common tools for many Trilinos packages. These tools include memory management classes~\cite{bartlett2010} such as ``smart'' pointers and arrays, ``parameter lists'' for communicating hierarchical lists of parameters between library or application layers, templated wrappers for the BLAS and LAPACK, XML parsers, and other utilities. They provide a unified ``look and feel'' across Trilinos packages, and help avoid common programming mistakes. +Teuchos provides a suite of common tools for many Trilinos packages. These tools include memory management classes~\cite{bartlett2010} such as smart pointers and arrays, parameter lists for communicating hierarchical lists of parameters between library or application layers, templated wrappers for BLAS and LAPACK, XML parsers, MPI, and other utilities. They provide a unified ``look and feel'' across Trilinos packages, and help avoid common programming mistakes. \subsection{Distributed-Memory Linear Algebra: Tpetra}\label{subsec:tpetra} Tpetra \cite{hoemmen2015tpetra} provides the distributed-memory -infrastructure for sparse linear algebra computations. It implements -distributed-memory linear algebra objects, such as sparse graphs, -sparse matrices, and dense vectors, where Kokkos is employed for local data -storage. The provided objects are templated on scalar type (e.g. \texttt{double}, \texttt{float}, or a Sacado type), local index type, global index type and Kokkos backend and memory space. +infrastructure for sparse linear algebra objects, such as sparse graphs, +sparse matrices, and dense vectors. The implementation of these +distributed-memory linear algebra objects uses Kokkos Core and Kokkos Kernels linear algebra objects locally on a compute node. +The provided objects are templated on scalar type (e.g. \texttt{double}, \texttt{float}, or a Sacado or Stokhos type), local index type, global index type and Kokkos device (pair of execution and memory spaces). Distributed-memory sparse linear algebra operations, such as a sparse matrix-vector product, are implemented through on-node calls to Kokkos Kernels and inter-node MPI communication. Tpetra features @@ -59,7 +59,7 @@ \subsection{Distributed-Memory Linear Algebra: Tpetra}\label{subsec:tpetra} vectors (multivectors) and associated BLAS-1 like kernels (e.g., dot products, norms, scaling, vector addition, pointwise vector multiplication) as well as tall skinny QR (TSQR) factorization for multivectors. -\item \emph{Import/export:} moving vector, graph, and matrix data +\item \emph{Import/export:} communicating vector, graph, and matrix data between different distributions (maps). This is key for performing halo/boundary exchanges as well as other kernels such as sparse matrix-matrix multiplication. diff --git a/discretization.tex b/discretization.tex index 19383dd..a3881d2 100644 --- a/discretization.tex +++ b/discretization.tex @@ -1,21 +1,22 @@ % !TEX root = main.tex -This section describes Trilinos packages that provide tools for spatial and temporal discretization of integro-differential equations. Most of Trilinos discretization efforts have been devoted to implement tools for mesh-based discretizations of partial differential equations (PDEs) with a focus on high-order finite elements. A notable exception is constituted by the research package Compadre, which provides tools for meshless approximation of linear operators that can be used for the discretization of differential equations and for data transfer. -Trilinos discretization packages have been adopted by many applications addressing a wide range of physics problems, including solid mechanics, earth system modeling, semiconductor devices and electro-magnetics. These applications have taken different approaches in adopting Trilinos mesh-based discretization tools. The less intrusive approach is the adoption of Intrepid2 tools to perform local finite element assembly. The application has to manage the global assembly possibly using the \code{DoFManager} provided by Panzer and FE Crs matrix and vector structures provided by Tpetra. -A more intrusive approach is to additionally use the Phalanx package for managing dependencies of field evaluations in conjunction with the Thyra Model Evaluator and Sacado algorithmic differentiations, and possibly Tempus for time integration. This approach is particularly useful when developing complex multiphysics problems because it allows easy re-use of computational kernels and automates the computation of Jacobians and sensitivities. +This section describes Trilinos packages that provide tools for spatial and temporal discretization of integro-differential equations. Most discretization efforts in Trilinos have been devoted to implementing tools for mesh-based discretizations of partial differential equations (PDEs) with a focus on high-order finite elements. A notable exception is the research package Compadre, which provides tools for meshless approximation of linear operators that can be used for the discretization of differential equations and for data transfer. +Trilinos discretization packages have been adopted by many applications addressing a wide range of physics problems, including solid mechanics, earth system modeling, semiconductor devices modeling and electro-magnetics. These applications have taken different approaches in adopting Trilinos mesh-based discretization tools. The less intrusive approach is the adoption of Intrepid2 tools to perform local finite element assembly. The application has to manage the global assembly possibly using the \code{DoFManager} provided by Panzer and FE Crs matrix and vector structures provided by Tpetra. +A more intrusive approach is to additionally use the Phalanx package for managing dependencies of field evaluations in conjunction with the Thyra Model Evaluator and Sacado automatic differentiation, and possibly Tempus for time integration. This approach is particularly useful when developing solvers for complex multiphysics problems because it allows easy re-use of computational kernels and automates the computation of Jacobians and sensitivities. The most intrusive approach is to build the application around the Panzer package, which provides all of the above, plus the handling of linear and nonlinear solvers and integrated constrained optimization capabilities. -In the following we describe some of the Trilinos discretization packages. For brevity, we do not include the description of these packages: Sierra ToolKit (STK), Krino and Percept, which provide mesh and level-set tools, and Shards, which provides tools for mesh topology. We refer to Trilinos website (\url{https://trilinos.github.io}) for a brief description of these packages. - +In the following, we do not include the description of the Sierra ToolKit (STK), Krino and Percept packages which provide mesh and level-set tools, nor the description of the Shards package which provides tools for mesh cell topology. +The STK and Krino packages are snapshotted into Trilinos. +We refer to the Trilinos website (\url{https://trilinos.github.io}) for a brief description of these packages. \subsection{Local Assembly: Intrepid2} Intrepid2 provides interoperable tools for compatible discretizations of PDEs; it is a performance-portable re-implementation and extension of the legacy Intrepid package \cite{bochev2012}. Intrepid2 mainly focuses on local assembly of continuous and discontinuous finite elements. It also provides limited capabilities for finite volume discretization. Intrepid2 works on batches of elements (cells), and provides tools to efficiently compute discretized linear functionals (e.g., right-hand-side vectors) and differential operators (e.g., stiffness matrices) at the element level. Intrepid2 implements compatible finite element spaces of various polynomial orders for $H({\rm grad})$, $H({\rm curl})$, $H({\rm div})$ and $L^2$ function spaces on triangles, quadrilaterals, tetrahedrons, hexahedrons, wedges and pyramids. It provides both Lagrangian basis functions and hierarchical basis functions \cite{fuentes2015} and it implements performance optimizations (e.g., sum factorizations) exploiting the underlying structure of the problem (e.g., tensor-product elements or other symmetries). The degrees of freedom of $H({\rm div})$ and $H({\rm curl})$ finite elements as well as high-order $H({\rm grad})$ finite elements depend on the global orientation of edges and faces and Intrepid2 provides orientation tools for matching the degrees of freedom on shared edges and faces. It also provides interpolation-based projection tools for projecting functions in $H({\rm grad})$, $H({\rm curl})$, $H({\rm div})$ and $L^2$ to the respective discrete spaces. Intrepid2 implements these capabilities through the following classes: \begin{itemize} -\item \code{CellTools}: This class provides geometric operations on the reference and physical frame. This includes computation of tangents and normals to edges/faces in the physical frame, computation of Jacobian of the reference-to-physical frame maps, and other metric computations. -\item \code{CubatureFactory}: This class provides several quadrature rules (called \emph{cubatures} in Intrepid2) of various degrees of accuracy for approximating integrals over the elements and their boundaries. -\item \code{Basis:} This is the base class for a variety of basis functions for compatible finite element spaces. Each class includes a \code{getValues()} method that computes the value of the basis functions or their derivatives (e.g., gradient for $H({\rm grad})$ functions, curl for $H({\rm curl})$ functions) at a set of input points. The implementation of \code{getValues()} can be very different depending on the basis. Specific optimizations are available for tensor-product elements. Additionally, there is a \code{BasisFamily} class with a convenience method, \code{getBasis()}, which constructs a basis depending on a template argument specifying the type of basis (e.g., hierarchical or nodal), the cell topology and function space on which it is defined, and its polynomial degree. +\item \code{CellTools}: This class provides geometric operations in the reference and physical frames. This includes the computation of tangents and normals to edges and faces in the physical frame, the computation of the Jacobian of the reference-to-physical frame map, and other geometric computations. +\item \code{CubatureFactory}: This class provides quadrature rules (called \emph{cubatures} in Intrepid2) of various degrees of accuracy for approximating integrals over elements and their boundaries. +\item \code{Basis:} This is the base class that provides a common interface for functionalities related to finite element bases. Intrepid2 provides derived classes that implement this common interface for a variety of compatible finite element spaces. Each derived class implements the \code{getValues()} method that computes the values taken by the basis functions or their derivatives (e.g., the gradient for $H({\rm grad})$ functions, the curl for $H({\rm curl})$ functions) at a set of input points. The implementation of \code{getValues()} can be very different depending on the basis. Specific optimizations are available for tensor-product elements. Additionally, there is a \code{BasisFamily} class with a convenience method, \code{getBasis()}, which constructs a basis depending on a template argument specifying the type of basis (e.g., hierarchical or nodal), the cell topology and the function space on which it is defined, and its polynomial degree. \item \code{OrientationTools}: This class provides methods to orient the basis functions based on the global orientation of edges and faces, determined by the global numbering of the cell vertices. This is achieved by building a linear operator (a permutation for tensor-product elements) that encodes the orientation of a particular cell, and applying that operator to the reference basis functions. \item \code{ProjectionTools}: This class provides methods for interpolation-based projections of a given function into a compatible finite element space or between compatible finite element spaces~\cite{demkowicz2007}. The provided projections commute with the corresponding differential operators if the quadrature rules can exactly integrate the functions being projected. As an example, projecting an $H({\rm grad})$ function into the $H({\rm grad})$ finite element space and then taking its gradient gives the same result as taking the gradient of the function first and then projecting the gradient into the $H({\rm curl})$ finite element space. -\item \code{FunctionSpaceTools}: This class provides transformations of fields from reference to physical frame and back, computation of measures on edges, faces and cells, scalar/vector/tensor multiplications and contractions for computing integrals. +\item \code{FunctionSpaceTools}: This class provides transformations of fields from the reference to the physical frame and back, computation of measures on edges, faces and cells, scalar/vector/tensor multiplications and contractions for computing integrals. \item \code{IntegrationTools}: This class provides integration methods that can take advantage of tensor product structures in basis values, providing mechanisms for performance-portable, \emph{sum-factorized} assembly across $H({\rm grad})$, $H({\rm curl})$, $H({\rm div})$ and $L^2$ function spaces. In the future, we plan to provide similar interfaces to support matrix-free discretizations. \end{itemize} Intrepid2 makes use of Kokkos containers to enable memory layouts that are adapted to the computational platform. Intrepid2 also uses Kokkos for its core computational kernels, enabling threaded execution across a variety of architectures. The data types used by Intrepid2 are templated; it is therefore possible to propagate Sacado types through Intrepid2 to perform automatic differentiation. Current development of Intrepid2 focuses on providing efficient matrix-free discretizations to enhance efficiency on GPU architectures. @@ -33,7 +34,7 @@ \subsection{Time Integration: Tempus} sensitivity analysis for next-generation code architectures. Tempus provides “out-of-the-box” time-integration capabilities, which allows users to quickly and easily incorporate time-integration -capabilities to their applications and switch between various time +capabilities in their applications and switch between various time integrators depending on the simulation needs. Additionally, Tempus provides “build-your-own” capabilities, which allows applications to incorporate various Tempus components to augment or replace @@ -54,9 +55,9 @@ \subsection{Time Integration: Tempus} individually, depending on the needs of the application. \begin{itemize} \item Integrators are the time-loop structure for time integration - and provide several features, e.g., control the advancement of - the solution, selection of the next timestep size and solution - output. + and provide several features, e.g., controlling the advancement of + the solution, selecting the next timestep size and handling the + solution output. \item Time steppers are individual methods that advance the solution from one step to the next. A variety of time steppers @@ -76,8 +77,8 @@ \subsection{Time Integration: Tempus} \end{itemize} \item Solution history is used to maintain the solution during - time-step failure, solution restart/output, interpolation of - solution between time steps, and to provide the solution for + time-step failure, for solution restart/output, for interpolation of + the solution between time steps, and to provide the solution for transient adjoint sensitivities. \item Timestep control and strategies provide methods to select @@ -97,15 +98,15 @@ \subsection{Finite Element Analysis: Panzer} Panzer can provide low level utilities for application codes to build on, or can be used as a high level application framework. Important capabilities include the following: \begin{itemize} -\item \code{DOFManager}: Panzer provides a stand-alone DOF manager class in the dof-mgr subpackage. Given a list of DOFs and their corresponding basis and element blocks, the DOF manager can provide the mapping from DOFs on the mesh entities to the entries in linear algebra objects such as a residual vector or Jacobian matrix. The DOF manager can return the objects required to build distributed Tpetra +\item \code{DOFManager}: Panzer provides a stand-alone DOF manager class in the dof-mgr subpackage. Given a list of DOFs and their corresponding basis and element blocks, the DOF manager can provide the mapping from DOFs on the mesh entities to the entries in linear algebra objects such as residual vectors and Jacobian matrices. The DOF manager can return the objects required to build distributed Tpetra %and Epetra maps and graphs for both uniquely owned global indices and ghosted indices used during assembly. %\todo{Remove ``Epetra''?} It provides the local indexing used during assembly as well. The DOF manager contains a mesh abstraction called a connection manager that provides information about mesh connectivity, global numbering of mesh topological entities and element block groups. It is designed to support any underlying mesh database, allowing applications to use the DOF manager with any finite element application. -\item \code{STK\_Interface}: Panzer contains a concrete implementation of the connection manager API for the STK mesh database package in Trilinos. The \code{STK\_Interface} object wraps a STK mesh database. It provides a simple interface for accessing global indices on the mesh and can be used to read/write associated solution data to the database. It additionally supports SEACAS use for writing mesh data to disk. This \code{STK\_Interface} also includes support for periodic boundary conditions. This capability can match topological entities on periodic parallel distributed faces. Once matched, the DOFs are unified to enforce periodicity for the DOF manager. This capability is in the adapters-stk subpackage. +\item \code{STK\_Interface}: Panzer contains a concrete implementation of the connection manager API for the STK mesh database package in Trilinos. The \code{STK\_Interface} object wraps a STK mesh database. It provides a simple interface for accessing global indices on the mesh and can be used to read/write associated solution data to the database. It additionally supports SEACAS for writing mesh data to disk. This \code{STK\_Interface} also includes support for periodic boundary conditions. This capability can match topological entities on periodic parallel distributed faces. Once matched, the DOFs are unified to enforce periodicity for the DOF manager. This capability is in the adapters-stk subpackage. \item \emph{Linear Object Factory:} Panzer provides a linear object factory and linear object container designed to support parallel distributed assembly. %It supports both Epetra and Tpetra objects. -The returned containers hold the linear algebra objects for either a uniquely owned DOF map used for solving the linear system or a ghosted version of the linear objects that can be used for assembly. The containers support export and import operations between the unique and ghosted containers. These are used to simplify the assembly process and abstract the underlying Tpetra linear algebra objects. %(i.e., Epetra or Tpetra). +The returned containers hold either the linear algebra objects for a uniquely owned DOF map used for solving the linear system or a ghosted version of the linear objects that can be used for assembly. The containers support export and import operations between the unique and ghosted containers. These are used to simplify the assembly process and abstract the underlying Tpetra linear algebra objects. %(i.e., Epetra or Tpetra). The linear object factory can also create DOF gather and scatter functors for the corresponding (and possibly blocked) matrices. The gather and scatter operations are specialized on the assembly types, including residuals, Jacobians, and Tangents. Support for Hessians will be implemented as needed. The capability is found in the disc-fe subpackage. @@ -115,7 +116,7 @@ \subsection{Finite Element Analysis: Panzer} \item Panzer provides an example miniapp for implicit electro-magnetics that is used for benchmarking linear solver performance and for acceptance testing of new high performance computing systems. It demonstrates an $H(\rm{curl})$-$H(\rm{div})$ formulation for the electric and magnetic fields. This is found in the MiniEM subpackage. \end{itemize} -The Panzer package is intended to provide both low- and high-level tools for implicit finite elements discretizations. The high level tools aggregate many Trilinos discretization and solver packages. While the high level tools can be used as a rapid prototyping environment, the front end is fairly complex to set up, as opposed to true rapid prototyping frameworks such as deal.ii \cite{dealII95}, FEniCSx \cite{BarattaEtal2023} or Firedrake \cite{FiredrakeUserManual}. Panzer is not user friendly in this regard, however the examples and miniapps are a good starting point and can be quickly adapted to other physics. A number of performance portable applications are using Panzer tools at different levels of adoption; these include Albany \cite{Salinger2016}, Charon \cite{CharonUsersManual2020}, Drekar \cite{Crockatt2022,Miller2019,Shadid2016mhd} and EMPIRE \cite{BettencourtBrownEtAl2021_EmpirePic}. +The Panzer package is intended to provide both low- and high-level tools for implicit finite elements discretizations. The high-level tools aggregate many Trilinos discretization and solver packages. While the high-level tools can be used as a rapid prototyping environment, the front end is fairly complex to set up, as opposed to true rapid prototyping frameworks such as deal.ii \cite{dealII95}, FEniCSx \cite{BarattaEtal2023} or Firedrake \cite{FiredrakeUserManual}. Panzer is not user friendly in this regard, however the examples and miniapps are a good starting point and can be quickly adapted to other physics. A number of performance portable applications are using Panzer tools at different levels of adoption; these include Albany \cite{Salinger2016}, Charon \cite{CharonUsersManual2020}, Drekar \cite{Crockatt2022,Miller2019,Shadid2016mhd} and EMPIRE \cite{BettencourtBrownEtAl2021_EmpirePic}. \subsection{Approximation of Linear Operators: Compadre} diff --git a/framework.tex b/framework.tex index eb65358..82c5047 100644 --- a/framework.tex +++ b/framework.tex @@ -8,12 +8,12 @@ \subsection{Build and Test Infrastructure} The Trilinos Framework infrastructure is built on top of the Tribal Build, Integration, and Test System (TriBITS~\cite{Bartlett2014}) which is built on top of the open-source tools CMake and CTest\footnote{\url{https://cmake.org}}. The TriBITS framework allows building arbitrary subgraphs of dependent (Trilinos) CMake packages in one or more individual aggregated CMake projects (in any arrangement desired). -Each Trilinos/TriBITS package lists its direct (required and optional) dependent upstream packages which forms a package dependency graph. +Each Trilinos/TriBITS package lists its direct (required and optional) dependent upstream packages, thus forming a package dependency graph. The TriBITS framework uses this package dependency graph to automatically determine what indirect dependent internal packages must be enabled and processed (and built) and what external packages must be found. -TriBITS then orchestrates the processing of all of the required CMake code to find the needed external packages and configure, build (and optionally test and install) the selected set of internal packages. +TriBITS then orchestrates the processing of all of the required CMake code to find the needed external packages and configure and build (and optionally test and install) the selected set of internal packages. This allows a large number of (Trilinos) CMake packages to be configured, built, and tested in a flexible and efficient manner. -In addition, TriBITS provides support for a number of advanced features that are not available in raw CMake/CTest including: eliminating a large amount of boiler-plate CMake code and avoiding common mistakes; enabling and testing all downstream packages given a set of enabled (i.e. modified) upstream packages; managing the enable and disable of tests based on various criteria; producing build and test results submitted to a CDash site on a package-by-package basis; producing reduced source tarballs for only a desired subset of enabled packages. -As of Trilinos 14.4, TriBITS and Trilinos have been updated to allow integrating packages using raw CMake with just a few well-defined integration requirements. +In addition, TriBITS provides support for a number of advanced features that are not available in raw CMake/CTest including: eliminating a large amount of boiler-plate CMake code and avoiding common mistakes; enabling and testing all downstream packages given a set of enabled (i.e. modified) upstream packages; managing the enabling and disabling of tests based on various criteria; producing build and test results submitted to a CDash site on a package-by-package basis; producing reduced source tarballs for only a desired subset of enabled packages. +As of Trilinos 14.4, TriBITS and Trilinos have been updated to allow packages using raw CMake to be integrated with just a few well-defined integration requirements. The TriBITS framework has allowed Trilinos to scalably grow in the number of packages and the complexity without undue burdening of individual Trilinos developers and users. \subsection{Documentation Infrastructure} @@ -23,6 +23,6 @@ \subsection{Documentation Infrastructure} \subsection{Python Wrappers: PyTrilinos2} -PyTrilinos2 is a set of automatically generated Python wrappers for selected Trilinos packages including Tpetra, Teuchos and Thyra, and exposing solver capabilities from Amesos2, Belos, Ifpack2 and MueLu through Stratimikos. In the future, the list of wrapped packages will be enlarged to provide users with more features and to enable efficient prototyping of new algorithms for developers. +PyTrilinos2 is a set of automatically generated Python wrappers for selected Trilinos packages including Tpetra, Teuchos and Thyra, and for exposing solver capabilities from Amesos2, Belos, Ifpack2 and MueLu through Stratimikos. In the future, the list of wrapped packages will be enlarged to provide users with more features and to enable efficient prototyping of new algorithms for developers. % LocalWords: scalably diff --git a/introduction.tex b/introduction.tex index 266f0fe..143ff90 100644 --- a/introduction.tex +++ b/introduction.tex @@ -1,19 +1,20 @@ % !TEX root = main.tex Trilinos is a community-driven, open-source C++ software framework and collection of reusable scientific libraries designed to enable the development of scalable, high-performance algorithms for solving complex, multiscale, and multiphysics engineering and scientific problems on advanced computing architectures. -While Trilinos can run on a variety of hardware ranging from small workstations to large supercomputers, the typical use of Trilinos is on the leadership-class systems with new or emerging hardware architectures. +While Trilinos can run on a variety of hardware platforms ranging from small workstations to large supercomputers, the typical use of Trilinos is on the leadership-class systems with new or emerging hardware architectures. % History -Trilinos was originally conceived as a framework of three packages for distributed memory systems. The original Trilinos publication~\cite{Heroux2005a} describes the motivation, philosophy, and capabilities of Trilinos at that time. A few years later, the second Trilinos overview publication~\cite{Heroux2012} introduced the expanded set of capabilities then included in Trilinos as well as the strategic goals for Trilinos. Trilinos today is similar to the Trilinos that was envisioned two decades ago in some aspects. However, it is also very different in several other aspects. These changes were necessitated by the changes in programming models, application needs, hardware architectures, and algorithms. Trilinos has grown from a library of three packages to a library with more than fifty packages with functionality and features supporting a wide range of applications. +Trilinos was originally conceived as a framework of three packages for distributed memory systems. The original Trilinos publication~\cite{Heroux2005a} described the motivation, philosophy, and capabilities of Trilinos at that time. A few years later, the second Trilinos overview publication~\cite{Heroux2012} introduced the expanded set of capabilities then included in Trilinos as well as the strategic goals for Trilinos. Trilinos today is similar to the Trilinos that was envisioned two decades ago in some aspects. However, it is also very different in several other aspects. These changes were necessitated by the changes in programming models, application needs, hardware architectures, and algorithms. Trilinos has grown from a library of three packages to a library with more than fifty packages with functionality and features supporting a wide range of applications. Older packages rely heavily on the linear algebra classes provided by Epetra. Epetra does not support larger problems (2B+ unknowns), nor does it allow for accelerator offload. -Since the packages of the old Epetra stack are scheduled to be archived and removed from the main Trilinos repository on Github by the end of 2025 we limit ourselves to only describe the modern Trilinos software stack. +Since the packages of the old Epetra stack are scheduled to be archived and removed from the main Trilinos repository on Github by the end of 2025 we limit ourselves to describing only the modern Trilinos software stack, which builds upon the Tpetra stack and the Kokkos Ecosystem to achieve performance portability across hardware architectures. + % Purpose This article is an attempt to capture a snapshot of where Trilinos is today as opposed to twenty and thirteen years ago when the original Trilinos articles were written. Therefore it will focus on the major developments within Trilinos in the last decade as well as new features and functionality that have been added to advance scientific and engineering applications. -It will only give an overview of the features, and we refer to the extensive list of references for the details of these features. -We are also cognizant of the fact that this article describes software that is actively developed and constantly evolving. -Hence, we will focus on the high level features and concepts that we expect to remain stable for several years. +It will give only an overview of the features, and we refer to the extensive list of references for the details of these features. +We are also cognizant of the fact that this article describes software that is being actively developed and constantly evolving. +Hence, we will focus on the high-level features and concepts that we expect to remain stable for several years. @@ -22,11 +23,11 @@ \subsection{Modern Trilinos: Performance Portability through Kokkos} A key goal of modern Trilinos is to offer a performance-portable collection of reusable scientific libraries, allowing users to develop applications that achieve high efficiency across all modern High Performance Computing (HPC) hardware architectures. This goal emerged when the HPC architecture landscape started diversifying with the -introduction of GPU acceleration for scientific software. Today nine of the top ten systems in the Top500\footnote{https://top500.org} use GPUs, with only a single system being CPUs only. A complication in this shift is the diversification of vendor-provided programming models. -The aforementioned ten systems actually have four different vendor-preferred programming models: CUDA, HIP and OneAPI SYCL for the GPU based systems and OpenMP for the CPU system. +introduction of GPU acceleration for scientific software. Today nine of the top ten systems in the Top500\footnote{https://top500.org} use GPUs, with only a single system using CPUs only. A complication in this shift is the diversification of vendor-provided programming models. +The aforementioned ten systems actually have four different vendor-preferred programming models: CUDA, HIP and OneAPI SYCL for the GPU-based systems and OpenMP for the CPU system. \todo{MMW: We should verify these programming models... I think these are right but I'm not sure for Fugakyu } -To avoid the necessity of diverging code paths, Trilinos is leveraging the Kokkos Ecosystem~\cite{trott2021kokkos} to write hardware-agnostic code. The Kokkos Ecosystem started as the native Trilinos packages Kokkos and Kokkos Kernels but split off a decade ago into a standalone project hosted and developed independently from Trilinos\footnote{https://github.com/kokkos}. +To avoid the necessity of diverging code paths, Trilinos is leveraging the Kokkos Ecosystem~\cite{trott2021kokkos} to write performance portable code. The Kokkos Ecosystem started as the native Trilinos packages Kokkos and Kokkos Kernels but split off a decade ago into a standalone project hosted and developed independently from Trilinos\footnote{https://github.com/kokkos}. Now, Trilinos provides snapshots of the two primary Kokkos subprojects (Core and Kernels) that reflect the latest release of Kokkos. Trilinos can also be built against version-compatible external installations of Kokkos --- a capability required for interoperability with other Kokkos-based libraries. Kokkos enables Trilinos developers to write single source implementations of their packages that perform well on all major HPC hardware architectures. Some of its design principles are also reflected in Trilinos API designs. In particular, the @@ -39,15 +40,15 @@ \subsection{Trilinos Functionality} %MMW I really don't see the point of emphasizing products. We don't really give much explanation of what it means to be a Trilinos product. I suggest we either explain products better (much better explanation of what this means) or deemphasize them. I am going to deemphasize them for now. The features and capabilities of Trilinos are divided into several software units called \textit{packages}. -Each Trilinos package has a well-defined set of unique capabilities that is important for scientific or engineering applications. Packages are semi-autonomous, often having their own development team, users, and set of requirements and development principles. However, packages also follow a general set of Trilinos expectations such as having a designated point-of-contact and following the software quality expectations (e.g., sufficient documentation, continuous integration testing, clearly defined dependences, and using the Trilinos infrastructure for building and installation). +Each Trilinos package has a well-defined set of unique capabilities that is important for scientific or engineering applications. Packages are semi-autonomous, often having their own development team, users, and set of requirements and development principles. However, packages also follow a general set of Trilinos expectations such as having a designated point-of-contact and following the software quality expectations (e.g., sufficient documentation, continuous integration testing, clearly defined dependencies, and using the Trilinos infrastructure for building and installation). -In the following sections of this paper, we group the Trilinos packages that share common objectives (e.g., solving linear systems) together into five \textit{product areas}: Core, Linear Solvers and Preconditioners, Nonlinear Solvers and Analysis Tools, Discretization Tools, and Framework Infrastructure. These product areas we briefly described below. +In the following sections of this paper, we group the Trilinos packages that share common objectives (e.g., solving linear systems) together into five \textit{product areas}: Core, Linear Solvers and Preconditioners, Nonlinear Solvers and Analysis Tools, Discretization Tools, and Framework Infrastructure. We briefly describe these product areas below. -\paragraph{Core} Core packages cover all aspects of creating, distributing or mapping data to processing elements (cores, threads, nodes), load balancing, and redistributing data. They also include Trilinos abstractions for linear algebra data structures and algorithms and concrete implementations such as Tpetra linear algebra data structures. On a modern accelerator-based compute node, the abstractions provided by the Kokkos library become critical for Tpetra. These capabilities are described in detail in Section~\ref{sec:data_services}. +\paragraph{Core} Core packages cover all aspects of creating, distributing, and mapping data to processing elements (cores, threads, nodes), load balancing, and redistributing data. They also include Trilinos abstractions for linear algebra data structures and algorithms and concrete implementations such as Tpetra linear algebra data structures. On a modern accelerator-based compute node, the abstractions provided by the Kokkos library become critical for Tpetra. These capabilities are described in detail in Section~\ref{sec:data_services}. \paragraph{Linear Solvers and Preconditioners} The wide variety of applications that use Trilinos need a diverse set of linear solvers. Trilinos has support for both iterative and direct linear solvers, including interfaces to external solver packages. There are a number of preconditioner options from multithreaded or performance portable node-level preconditioners to scalable multilevel domain decomposition or multigrid preconditioners. The preconditioners and solvers use the data abstractions from the core packages. Section \ref{sec:lin_solve} provides a detailed description of these features. -\paragraph{Nonlinear Solvers and Analysis Tools} These packages provide high level algorithms for computational simulation and design. Capabilities include solvers for nonlinear equations, parameter continuation, bifurcation tracking, optimization, and uncertainty quantification. Trilinos also provides lower level utility packages to evaluate quantities of interest required by the analysis algorithms. Capabilities include automatic differentiation technology to evaluate derivatives and embedded ensemble propagation for uncertainty quantification. These packages/capabilities will be discussed further in Section~\ref{sec:nonlin_solve}. +\paragraph{Nonlinear Solvers and Analysis Tools} These packages provide high-level algorithms for computational simulation and design. Capabilities include solvers for nonlinear equations, parameter continuation, bifurcation tracking, optimization, and uncertainty quantification. Trilinos also provides lower level utility packages to evaluate quantities of interest required by the analysis algorithms. Capabilities include automatic differentiation technology to evaluate derivatives and embedded ensemble propagation for uncertainty quantification. These packages/capabilities will be discussed further in Section~\ref{sec:nonlin_solve}. \paragraph{Discretizations} This collection of packages provides functionality for the discretization of differential equations. In particular, it supports mesh-free and mesh-based spatial discretizations, with a focus on high-order finite elements, and time integration. Discretization tools also include cross-cutting utilities for algorithmic differentiation and for managing directed acyclic graphs of evaluation kernels. These capabilities are described further in Section~\ref{sec:discretization} in detail. diff --git a/linear_solvers.tex b/linear_solvers.tex index a2c9d47..b29d6e3 100644 --- a/linear_solvers.tex +++ b/linear_solvers.tex @@ -4,7 +4,7 @@ % %\todo{@Siva/Alexander: do this!} -Trilinos offers many linear solver capabilities: dense and sparse direct solvers, iterative solvers, shared-memory preconditioners local to a compute node, and scalable distributed memory domain decomposition and multigrid methods. Furthermore, interfaces to several third-party direct solvers are provided. The capabilities described in this section are focused on using the Tpetra software stack. All of the native solver capabilities are built on top of Kokkos and are GPU capable to varying degrees; any exceptions are noted in the detailed descriptions below. +Trilinos offers many linear solver capabilities: dense and sparse direct solvers, iterative solvers, shared-memory preconditioners local to a compute node, and scalable distributed memory domain decomposition and multigrid methods. Furthermore, interfaces to several third-party direct solvers are provided. All of the native solver capabilities are built on top of Kokkos and are GPU capable to varying degrees; any exceptions are noted in the detailed descriptions below. @@ -27,7 +27,7 @@ \subsection{One-Level Domain Decomposition and Basic Iterative Methods: Ifpack2} packages: Ifpack2 and ShyLU (specifically the ShyLU\_DD subpackage). Ifpack2 implements overlapping additive Schwarz approaches with several options for the local subdomain solves. The -local subdomain solvers may either be CPU-only versions of incomplete +local subdomain solvers may be either CPU-only versions of incomplete factorization preconditioners implemented in Ifpack2 itself, such as ILU(k) and ILUt (thresholded ILU), or architecture portable algorithms for incomplete factorizations and triangular solvers implemented in @@ -41,9 +41,9 @@ \subsection{One-Level Domain Decomposition and Basic Iterative Methods: Ifpack2} reduction in the number of iterations, or when the underlying problem is simply not amenable to multilevel methods. -Ifpack2 also supplies classic iterative methods based on matrix-splitting techniques, such as Jacobi iteration, Gauss-Seidel, and an MPI-oriented hybrid of Jacobi and Gauss-Seidel (e.g., Jacobi between ranks and Gauss-Seidel on them). Ifpack also provides preconditioners +Ifpack2 also supplies classic iterative methods based on matrix-splitting techniques, such as Jacobi iteration, Gauss-Seidel, and an MPI-oriented hybrid of Jacobi and Gauss-Seidel (e.g., Jacobi between ranks and Gauss-Seidel on them). Ifpack2 also provides preconditioners based on Chebyshev iterations. The aforementioned preconditioners are available both in point and block -forms and can operate on CSR or BSR matrices. In the block case, +forms and can operate on CSR and BSR matrices. In the block case, line relaxation is also supported, while in the point case, techniques like Vanka relaxation \cite{Vanka1986} are possible. Auxiliary-space smoothing for $H(curl)$ and $H(div)$ discretizations of the style of @@ -55,7 +55,7 @@ \subsection{Multilevel Domain Decomposition Methods: FROSch} \label{ssec:frosch} FROSch (Fast and Robust Overlapping Schwarz) is a framework for the construction of multilevel Schwarz domain decomposition preconditioners. Besides parallel scalability, FROSch emphasizes applicability and robustness across a wide range of challenging problems, while supporting an algebraic construction. Specifically, most preconditioners can be built using only the fully assembled system matrix, though some variants can take advantage of additional geometric inputs. The algebraic construction is enabled by the creation of an overlapping domain decomposition on the first level based on the sparsity pattern of the system matrix, similar to Ifpack2, along with the incorporation of extension-based coarse spaces, such as in the classical two-level generalized Dryja--Smith--Widlund (GDSW) preconditioner~\cite{dohrmann_domain_2008} and related variants. -While the initial version of FROSch~\cite{heinlein_parallel_2016} was based on the outdated Epetra linear algebra framework, the current implementation~\cite{heinlein_frosch_2020} leverages Xpetra. Over the years, Xpetra facilitated compatibility with both the Epetra and Tpetra linear algebra stacks via a lightweight interface but now exclusively provides access to the Tpetra stack. Algorithmic variants of Schwarz methods implemented in FROSch include: +While the initial version of FROSch~\cite{heinlein_parallel_2016} was based on the outdated Epetra linear algebra framework, the current implementation~\cite{heinlein_frosch_2020} leverages Xpetra. Originally designed as a lightweight wrapper around Epetra and Tpetra, the Xpetra package facilitated over the years compatibility with both the Epetra and Tpetra stacks. Now that the Epetra stack is phased out, it provides access to only the Tpetra stack. Algorithmic variants of Schwarz methods implemented in FROSch include: \begin{itemize} \item \emph{Extension-based coarse spaces based on a partition of unity on the interface}, such as classical GDSW, reduced dimension GDSW (RGDSW) coarse spaces, and multiscale finite element method (MsFEM) coarse spaces; cf.~\cite{heinlein_parallel_2016,heinlein_improving_2018}; \item \emph{Monolithic Schwarz preconditioners} for block systems; cf.~\cite{heinlein_monolithic_2019}. @@ -64,21 +64,21 @@ \subsection{Multilevel Domain Decomposition Methods: FROSch} FROSch has been applied to various challenging application problems, including scalar elliptic and elasticity problems~\cite{heinlein_parallel_2016}, possibly with heterogeneities~\cite{alves2024computationalstudyalgebraiccoarse}, computational fluid dynamics problems~\cite{heinlein_monolithic_2019}, time-harmonic Maxwell's and fluid-structure interaction problems~\cite{heinlein2024couplingdealiifroschsustainable}, pharmaco-mechanical interactions in arterial walls~\cite{balzani_computational_nodate}, and coupled multiphysics problems for land ice simulations~\cite{heinlein_frosch_2022}; the latter three have been solved using monolithic preconditioning techniques. To extend robustness for heterogeneous model problems, an implementation of spectral coarse spaces~\cite{heinlein_adaptive_2019} is currently under development. FROSch preconditioners have scaled to more than 200$k$ cores on the Theta Cray XC40 supercomputer at the Argonne Leadership Computing Facility (ALCF); cf.~\cite{heinlein_parallel_2022}. -In its current implementation, FROSch assumes a one-to-one correspondence of subdomains and MPI ranks, however, due to an interface to the other solver packages in Trilinos, inexact subdomain solvers can be employed on subdomains. An extension to multiple subdomains per MPI rank is currently being implemented. Using Kokkos and Kokkos Kernels, FROSch has recently also been ported to GPUs~\cite{yamazaki_experimental_2023} with performance gains for the triangular solve or inexact solves with ILU on GPUs. +In its current implementation, FROSch assumes a one-to-one correspondence of subdomains and MPI ranks. Using an interface to the other solver packages in Trilinos, inexact subdomain solvers can be employed on subdomains. An extension to multiple subdomains per MPI rank is currently being implemented. Using Kokkos and Kokkos Kernels, FROSch has recently also been ported to GPUs~\cite{yamazaki_experimental_2023} with performance gains for the triangular solve or inexact solves with ILU on GPUs. A demo/tutorial for FROSch can be found at the GitHub repository~\cite{frosch_demo}. \subsection{Multigrid Methods: MueLu} MueLu is a flexible and scalable high-performance multigrid solver library. -It provides a variety of multigrid algorithms for problems ranging from Poisson-like operators over elasticity, convection-diffusion, and Navier-Stokes, and Maxwell’s equations +It provides a variety of multigrid algorithms for problems ranging from Poisson-like operators, over elasticity, convection-diffusion, and Navier-Stokes, and Maxwell’s equations, all the way to multigrid methods for coupled multiphysics systems. Besides its strong focus on aggregation-based algebraic multigrid (AMG) methods, MueLu comes with specialized capabilities for (semi-)structured grids to perform semi-coarsening along grid lines, yet forming the coarse operator via a Galerkin product (in contrast to classical geometric multigrid methods). MueLu is extensible and allows for the research and development of new multigrid preconditioning methods. -Its weak and strong scalability even for vector-valued partial differential equations (PDEs) on unstructured meshes -up to 131,000 cores of a Cray XC40 and one million cores of a Blue Gene/Q system have been shown in~\cite{Lin2017a,Thomas2019a}. +Its weak and strong scalability, even for vector-valued partial differential equations (PDEs) on unstructured meshes +up to 131,000 cores of a Cray XC40 and one million cores of a Blue Gene/Q system, have been shown in~\cite{Lin2017a,Thomas2019a}. MueLu provides several approaches to constructing and solving the multilevel problem: @@ -139,7 +139,7 @@ \subsection{Multigrid Methods: MueLu} on the Amesos2, Ifpack2, and Zoltan2 libraries. MueLu also supports interfaces to abstraction layer packages such as Stratimikos and Thyra through the MueLu adapters library. These interfaces are required to use MueLu with the Teko block preconditioning package. -For more details on using MueLu with Teko, see the MueLu examples directory in the Trilinos source code repository. +For more details on using MueLu with Teko, see the MueLu examples directory. \subsection{Direct Linear Solvers: Amesos2, ShyLU} @@ -147,9 +147,9 @@ \subsection{Direct Linear Solvers: Amesos2, ShyLU} These include third-party direct solvers such as CHOLMOD, MUMPS, Pardiso\_MKL, SuperLU, SuperLU\_MT, SuperLU\_Dist, and STRUMPACK. Furthermore, Amesos2 also provides the interface to two native Trilinos on-node sparse direct solvers, Basker and Tacho implemented in ShyLU. Basker~\cite{Basker2017} is a sparse direct solver based on LU factorization for the problems that have the block triangular form (BTF) typically seen in circuit simulation applications. Basker uses these structures to factor and solve the diagonal blocks in parallel. The larger diagonal blocks can themselves be factored in parallel by discovering the parallelism available using a nested-dissection reordering. Basker focuses on exploiting thread-parallelism on the multi-core CPU architectures. %However, we have still implemented the solver using Kokkos. -Amesos2 also has a templated implementation of the sequential KLU solver called KLU2, which also exploits the BTF structure. +Amesos2 also has a templated implementation of a sequential KLU solver called KLU2, which also exploits the BTF structure. -Tacho is a sparse direct solver that exploits the supernodal block structures, commonly found in the sparse direct factorization of the matrices from mechanics applications. Tacho exploits this supernodal structure for both factorization and triangular solve phases. It is based on Kokkos, and hence it is portable to different node architectures (including NVIDIA or AMD GPUs). Originally, Tacho implemented task-parallel Cholesky of a sparse symmetric positive definite (SPD) matrix~\cite{Tacho2018}. However, to improve its portability, it has been extended to compute the sparse factorization based on level-set scheduling. Moreover, its functionality has been extended to compute LDLt factorization of symmetric indefinite matrix and LU factorization of a general matrix with a symmetric sparsity structure. +Tacho is a sparse direct solver that exploits supernodal block structure commonly found in sparse direct factorizations of matrices from mechanics applications. Tacho exploits this supernodal structure for both factorization and triangular-solve phases. It is based on Kokkos. Originally, Tacho implemented a task-parallel Cholesky factorization of sparse symmetric positive definite (SPD) matrices~\cite{Tacho2018}. However, to improve its portability, it has been extended to compute the sparse factorization based on level-set scheduling. Moreover, its functionality has been extended to the computation of an LDLt factorization of symmetric indefinite matrices, as well as the computation of an LU factorization of general matrices with a symmetric sparsity structure. In addition to their stand-alone use, the aforementioned node-level solvers may be used as the local solvers for domain decomposition preconditioners (Ifpack2 or FROSch) or as the coarse solvers for multilevel preconditioners (MueLu or FROSch). @@ -160,11 +160,11 @@ \subsection{Physics Block Operators and Preconditioners: Teko} \label{sec:teko} The Teko library~\cite{Cyr2016a} provides interfaces for operators and preconditioners that are constructed from large physics-based sub-blocks. -The sub-blocks are Thyra operators which themselves often are implemented using Tpetra matrices. -Generic block preconditioning strategies such as Jacobi and Gauss-Seidel and commonly used approximate inverse strategies for the Navier-Stokes equation such as SIMPLEC, LSC and PCD are provided \cite{CyrShadidEtAl2012_StabilizationScalableBlockPreconditioning}. -More complicated multilevel hierarchy of block solvers can be generated via Teuchos ParameterLists. -Block preconditioners for first order formulations of Maxwell's equations and Darcy flow are distributed with the Panzer package. -Teko's solvers can be registered in the Stratimikos interface for entirely ParameterList driven usage. +The sub-blocks are Thyra operators which are themselves often implemented using Tpetra matrices. +Generic block preconditioning strategies, such as the Jacobi and Gauss-Seidel strategies, as well as commonly used approximate inverse strategies for the Navier-Stokes equation such as SIMPLEC, LSC and PCD, are provided \cite{CyrShadidEtAl2012_StabilizationScalableBlockPreconditioning}. +More complicated multilevel hierarchies of block solvers can be generated via \code{Teuchos::ParameterList} objects. +Block preconditioners for first-order formulations of Maxwell's equations and Darcy flow are implemented in the Panzer package. +Teko's solvers can be registered in the Stratimikos interface for usage entirely driven by \code{Teuchos::ParameterList} objects. \subsection{Eigensolvers: Anasazi} @@ -175,15 +175,15 @@ \subsection{Eigensolvers: Anasazi} are provided for Tpetra and Thyra, while users can also implement their own interfaces to leverage any existing investment in their description of matrices and vectors. Any libraries that understand Tpetra and Thyra matrices and vectors, like Belos and Ifpack2, may also be used in conjunction with Anasazi. The suite of eigensolvers provided -by Anasazi includes locally-optimal block preconditioned conjugate gradient (LOBPCG), block Davidson, Riemannian Trust-Region +by Anasazi includes locally optimal block preconditioned conjugate gradient (LOBPCG), block Davidson, Riemannian Trust-Region (RTR), and block Krylov-Schur. Recently, there has been a family of trace minimization (TraceMin) methods and a generalized Davidson method added to the suite of eigensolvers in Anasazi. \subsection{Unified Solver Interface: Stratimikos} -The Stratimikos package provides a unified interface to linear solvers and preconditioners in Trilinos (e.g., from Amesos2, Belos, FROSch, Ifpack2, MueLu, Teko). -The matrix as well as right-hand side and solution vectors are required to support the Thyra interface. +The Stratimikos package provides a unified interface to linear solvers and preconditioners in Trilinos (e.g., to Amesos2, Belos, FROSch, Ifpack2, MueLu, and Teko). +The matrix as well as the right-hand side and the solution vectors are required to support the Thyra interface. Wrappers for Tpetra linear algebra are provided by Thyra. Solver and preconditioner parameters are specified via a \code{Teuchos::ParameterList}, which users can easily populate from an xml file. diff --git a/main.tex b/main.tex index 95a1db7..9f5c2fd 100644 --- a/main.tex +++ b/main.tex @@ -342,7 +342,7 @@ \section{Concluding remarks} \section{Acknowledgment} -Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. +Sandia National Laboratories is a multimission laboratory managed and operated by National Technology \& Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government. diff --git a/nonlinear_solvers.tex b/nonlinear_solvers.tex index 9c0b092..8271367 100644 --- a/nonlinear_solvers.tex +++ b/nonlinear_solvers.tex @@ -1,23 +1,23 @@ % !TEX root = main.tex -The packages included in this section provides the top level algorithms for a computational simulation or design study. +The packages included in this section provide the top-level algorithms for computational simulations and design studies. These include nonlinear solvers, bifurcation tracking, stability analysis, parameter continuation, optimization, and uncertainty quantification. -A common theme of this collection is the philosophy of ``analysis beyond simulation'', which aims to automate many computational tasks that are often performed by application code users by trial-and-error or repeated simulation. Tasks that can be automated include performing parameter studies, sensitivity analysis, calibration, optimization, and locating instabilities. -Additionally, utilities for the nonlinear analysis include automatic differentiation tools that can provide the derivatives critical to the analysis algorithms and the abstraction layers and interfaces for application callbacks. +A common theme of this collection is the philosophy of ``analysis beyond simulation'', which aims to automate many computational tasks that are often performed by application code users by trial-and-error or repeated simulation. Tasks that can be automated include performing parameter studies, sensitivity analysis, calibration, optimization, and the task of locating instabilities. +Additionally, utilities for nonlinear analysis include abstraction layers and interfaces for application callbacks, as well as automatic differentiation tools that can provide the derivatives critical to analysis algorithms. \subsection{Nonlinear Solvers: NOX, LOCA} \label{sec:nox} %\paragraph{NOX} %Short for , -NOX (Nonlinear Object-oriented Solutions) provides robust and efficient algorithms for solving sets of nonlinear equations. +NOX (Nonlinear Object-oriented Solutions) provides robust and efficient algorithms for solving systems of nonlinear equations. NOX implements a number of Newton-based globalization techniques including line search \cite{Pawlowski2006}, trust region \cite{Pawlowski2006,Pawlowski2008} and homotopy algorithms \cite{Coffey2003}. Additionally, it provides lower- and higher-order models including Broyden, Anderson acceleration \cite{Walker2011}, and tensor methods \cite{Bader2005}. -The algorithms have been designed for large-scale parallel inexact linear solvers using Krylov methods and supports Jacobian-free Newton-Krylov variants. +The algorithms have been designed for large-scale parallel inexact linear solvers using Krylov methods and support Jacobian-free Newton-Krylov variants. The library interacts with application codes through the Thyra model evaluator interface. NOX provides linear algebra abstractions for applications to use custom implementations with the algorithms. The library additionally contains abstractions for solvers, directions and line searches that allow users to further customize the algorithms. NOX provides various stopping criteria including absolute, relative, and weighted root mean square norms, as well as stagnation and NaN detection. Applications can build custom stopping criteria within a tree-based logical structure and provide additional criteria through the StatusTest abstraction. -Nox also provide capabilities for continuation and stability analysis through the subpackage LOCA. +NOX also provides capabilities for continuation and stability analysis through the subpackage LOCA. %Short for the LOCA (Library of Continuation Algorithms)~\cite{Salinger2005} @@ -30,7 +30,7 @@ \subsection{Nonlinear Solvers: NOX, LOCA} \label{sec:nox} \subsection{Numerical Optimization: ROL} Rapid Optimization Library (ROL) \cite{rol,ROL2022ICCOPT} is the -Trilinos library for numerical optimization. ROL brings an extensive +Trilinos package for numerical optimization. ROL brings an extensive collection of state-of-the-art optimization algorithms to virtually any application. Its programming interface supports any computational hardware, including heterogeneous many-core systems with digital and @@ -127,9 +127,9 @@ \subsection{Numerical Optimization: ROL} \subsection{Automatic Differentiation: Sacado} \label{sec:sacado} -Sacado \cite{SacadoURL,phipps2012efficient,phipps2008large} provides forward and reverse-mode operator overloading-based automatic differentiation (AD) tools within Trilinos. +Sacado \cite{SacadoURL,phipps2012efficient,phipps2008large} provides forward and reverse-mode operator-overloading-based automatic differentiation (AD) tools within Trilinos. %The package provides both forward and reverse-mode AD data types. -Sacado's forward AD tools have been integrated into Kokkos and have demonstrated good performance on GPU architectures~\cite{phipps2022automatic}. +Sacado's forward AD tools have been integrated with Kokkos and have demonstrated good performance on GPU architectures~\cite{phipps2022automatic}. Sacado, along with its Kokkos integration, provides high-performance derivative capabilities to numerous Office of Science and NNSA extreme scale applications, including Albany for solid mechanics and land ice modeling~\cite{Salinger2016,MPASAlbany2018}, Charon for semiconductor device modeling~\cite{CharonUsersManual2020} and multiphase chemically reacting flows~\cite{Musson2009}, Drekar for computational fluid dynamics (CFD)~\cite{Sondak2021,Shadid2016}, magnetohydrodynamics~\cite{Shadid2016mhd} and plasma physics~\cite{Crockatt2022,Miller2019}, Xyce for electronic circuit simulation~\cite{xyceTrilinos,xycePCE}, and SPARC for hypersonic fluid flows~\cite{SparcValidation}. @@ -137,20 +137,20 @@ \subsection{Automatic Differentiation: Sacado} \label{sec:sacado} \subsection{Uncertainty Quantification: Stokhos} Stokhos~\cite{phipps2015stokhos,Phipps2016,phipps2014exploring} provides implementations of two intrusive uncertainty quantification strategies: -the intrusive stochastic Galerkin uncertainty quantification method~\cite{ghanem1990polynomial,ghanem2003stochastic} and the embedded ensemble propagation~\cite{phipps2017embedded}. +the intrusive stochastic Galerkin uncertainty quantification method~\cite{ghanem1990polynomial,ghanem2003stochastic} and the embedded ensemble propagation method~\cite{phipps2017embedded}. -For the first one, Stokhos provides methods for computing intrusive stochastic Galerkin projections such as Polynomial Chaos and Generalized Polynomial Chaos, -interfaces for forming the resulting nonlinear systems, and linear solver methods for solving block stochastic Galerkin linear systems. -The implementation targets GPU performances using Kokkos and by commuting the layout of the Galerkin operator to be outer-spatial and inner-stochastic~\cite{phipps2014exploring}. -The stochastic Galerkin implementation of Stokhos has been used in~\cite{constantine2014efficient} to efficiently propagate uncertainty in multiphysics systems by reducing the full system with a nonlinear elimination method. +Stokhos's implementation of the intrusive stochastic Galerkin uncertainty quantification method allows stochastic projections to be computed, such as Polynomial Chaos and Generalized Polynomial Chaos expansions. +The implementation includes interfaces for forming the linear and/or nonlinear systems that follow from the stochastic Galerkin projection, as well as linear solver methods that can exploit the block structure of these systems. +GPU performance is targeted by using Kokkos and by commuting the layout of the Galerkin operator to be outer-spatial and inner-stochastic~\cite{phipps2014exploring}. +The implementation has been used in~\cite{constantine2014efficient} to efficiently propagate uncertainty in multiphysics systems by reducing the full system with a nonlinear elimination method. The embedded ensemble propagation consists in propagating a subset of samples gathered into a so-called ensemble through the forward simulation at once. -It builds on~\cite{pawlowski2012automating} for automating embedded analysis capabilities; Stokhos defines an ensemble type, a SIMD data type, that is able to store -the values of the input, output, and state variables for every sample of an ensemble. This type can then be used in the Tpetra solver stack as a template argument for the scalar type. +It builds on~\cite{pawlowski2012automating} for automating embedded analysis capabilities. Stokhos defines an ensemble type, a SIMD data type, that is able to store +the values of input, output, and state variables for every sample of an ensemble. This type can then be used in the Tpetra solver stack as a template argument for the scalar type. This approach allows to save computation time in four ways: the sample-independent data and computation can be reused for every sample of an ensemble, the memory access pattern is improved, the operations on the ensemble type can be vectorized efficiently, and the message passing costs are reduced by sending fewer but larger messages. However, the approach requires solvers and BLAS functions to be aware of the extra dimension associated to the ensemble; for example, a GMRES for ensemble types~\cite{liegeois2020gmres} needs to monitor -the convergence of the individual sample in order to decide when to stop based on the union of the information. +the convergence of the individual samples in order to decide when to stop based on the union of the information. \subsection{Nonlinear Analysis Tools: Piro} %Piro~\cite{osti_1231283} is the top-level, unifying package for nonlinear analysis. @@ -162,8 +162,8 @@ \subsection{Nonlinear Analysis Tools: Piro} %The main purpose of the package is to provide driver classes for the common uses of Trilinos nonlinear analysis tools. In particular, Piro implements main driver classes for: \begin{itemize} - \item \emph{Linear/nonlinear solvers and sensitivity analysis} for transient and nontransient problems. As an example, this capability can be used to compute the discrete solution $u$ of a partial differential equation depending on some parameter $p$, and compute the sensitivity of a quantity of interest of the solution with respect to the parameter $p$. Sensitivities can be computed in a forward or adjoint fashion, the latter being preferable in presence of high-dimensional parameters. This capability relies on NOX for the solution of nonlinear problems and Tempus for transient problems. - \item \emph{Constrained optimization problems} with linear/nonlinear equality constraints: Piro provides tools for transient and nontransient (snapshot) optimizations, featuring gradient-based reduced-space and full-space methods. This capability is used by applications such as Albany to perform large-scale PDE-constrained optimization problems \cite{Perego2022}. Piro interfaces with ROL for algorithms to solve constrained optimization problems. + \item \emph{Linear/nonlinear solvers and sensitivity analysis} for transient and nontransient problems. As an example, this capability can be used to compute the discrete solution of a partial differential equation depending on a parameter and the sensitivity of a quantity of interest with respect to this parameter. Sensitivities can be computed in a forward or adjoint fashion, the latter being preferable in the presence of high-dimensional parameters. This capability relies on NOX for the solution of nonlinear problems and Tempus for transient problems. + \item \emph{Constrained optimization problems} with linear/nonlinear equality constraints: Piro provides tools for transient and nontransient (snapshot) optimizations, featuring gradient-based reduced-space and full-space methods. This capability is used by applications such as Albany to solve large-scale PDE-constrained optimization problems \cite{Perego2022}. Piro interfaces with ROL for algorithms to solve constrained optimization problems. \item \emph{Parameter Continuation and Bifurcation analysis:} this capability is provided through the package LOCA. \end{itemize} These driver classes share a similar interface based on the \code{Thyra::ModelEvaluator} and the \code{Teuchos::ParameterList} classes. Applications define the problems to be targeted (e.g., the equations to be solved, the parameters, the quantities of interests) by providing a concrete implementations of the \code{Thyra::ModelEvaluator}; see \cite{pawlowski2012automating,pawlowski2012automatingpart2}.