Skip to content

Commit

Permalink
Update paper.md
Browse files Browse the repository at this point in the history
  • Loading branch information
RMeli authored and amkrajewski committed Sep 24, 2024
1 parent 2c70b19 commit 3408070
Showing 1 changed file with 7 additions and 5 deletions.
12 changes: 7 additions & 5 deletions paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,10 +26,12 @@ authors:
orcid: 0000-0003-3346-3696
affiliation: 1
affiliations:
- name: Department of Materials Science and Engineering, The Pennsylvania State University, USA
- name: Department of Materials Science and Engineering, The Pennsylvania State University, United States of America
index: 1
- name: Institute for Computational and Data Sciences, The Pennsylvania State University, USA
ror: 04p491231
- name: Institute for Computational and Data Sciences, The Pennsylvania State University, United States of America
index: 2
ror: 04p491231
date: 14 September 2023
#bibliography: [paper/paper.bib]
bibliography: paper.bib
Expand All @@ -40,16 +42,16 @@ bibliography: paper.bib

`nimCSO` is a high-performance tool implementing several methods for selecting components (data dimensions) in compositional datasets, which optimize the data availability and density for applications such as machine learning. Making said choice is a combinatorically hard problem for complex compositions existing in high-dimensional spaces due to the interdependency of components being present. Such spaces are encountered across many scientific disciplines (see [Statement of Need](#statement-of-need)), including materials science, where datasets on Compositionally Complex Materials (CCMs) often span 20-45 chemical elements, 5-10 processing types, and several temperature regimes, for up to 60 total data dimensions. This challenge also exists in everyday contexts, such as study of cooking ingredients [@Ahn2011], which interact in various recipes, giving rise to questions like *"Given 100 spices at the supermarket, which 20, 30, or 40 should I stock in my pantry to maximize the number of unique dishes I can spice according to recipe?"*. Critically, this is not as simple as frequency-based selection because, e.g., removing less common nutmeg and cinnamon from your shopping list will prevent many recipes with the frequent vanilla, but won't affect those using black pepper [@Edmisten2022].

At its core, `nimCSO` leverages the metaprogramming ability of the Nim language [@Rumpf2023] to optimize itself at compile time, both in terms of speed and memory handling, to the specific problem statement and dataset at hand based on a human-readable configuration file. As demonstrated in the [Methods and Performance](#methods-and-performance) section, `nimCSO` reaches the physical limits of the hardware (L1 cache latency) and can outperform an efficient native Python implementation over 100 times in terms of speed and 50 times in terms of memory usage (*not* counting interpreter), while also outperforming NumPy implementation 37 and 17 times, respectively, when checking a candidate solution.
At its core, `nimCSO` leverages the metaprogramming ability of the Nim language [@Rumpf2023] to optimize itself at compile time, both in terms of speed and memory handling, to the specific problem statement and dataset at hand based on a human-readable configuration file. As demonstrated in the [Methods and Performance](#methods-and-performance) section, `nimCSO` reaches the physical limits of the hardware (L1 cache latency) and can outperform an efficient native Python implementation over 100 times in terms of speed and 50 times in terms of memory usage (*not* counting the interpreter), while also outperforming the NumPy implementation 37 and 17 times, respectively, when checking a candidate solution.

`nimCSO` is designed to be both (1) a user-ready tool, implementing two efficient brute-force approaches (for handling up to 25 dimensions), a custom search algorithm (for up to 40 dimensions), and a genetic algorithm (for any dimensionality), and (2) a scaffold for building even more elaborate methods in the future, including heuristics going beyond data availability. All configuration is done with a simple human-readable `YAML` config file and plain text data files, making it easy to modify the search method and its parameters with no knowledge of programming and only basic command line skills.
`nimCSO` is designed to be both (1) a user-ready tool, implementing two efficient brute-force approaches (for handling up to 25 dimensions), a custom search algorithm (for up to 40 dimensions), and a genetic algorithm (for any dimensionality), and (2) a scaffold for building even more elaborate methods in the future, including heuristics going beyond data availability. All configuration is done with a simple human-readable `YAML` file and plain text data files, making it easy to modify the search method and its parameters with no knowledge of programming and only basic command line skills.


# Statement of Need

`nimCSO` is an interdisciplinary tool applicable to any field where data is composed of a large number of independent components and their interaction is of interest in a modeling effort, ranging from economics where factor selection affects performance of analytical [@Fan2013] and ML [@Peng2021] models, through medicine where drug interactions can have a significant impact on the treatment [@Maher2014] (an escalating problem [@Guthrie2015]) and understanding of microbial interactions can help fight gastrointenstant problems [@Leeuwen2023; @Berg2022], to materials science, where the composition and processing history are critical to resulting properties. The latter has been the root motivation for the development of `nimCSO` within the [ULTERA Project (ultera.org)](https://ultera.org) carried under the [US DOE ARPA-E ULTIMATE](https://arpa-e.energy.gov/?q=arpa-e-programs/ultimate) program, which aims to develop a new generation of ultra-high temperature materials for aerospace applications, through generative machine learning models [@Debnath2021] driving thermodynamic modeling, alloy design, and manufacturing [@Li2024].

One of the most promising materials for such applications are the aforementioned CCMs and their metal-focused subset of Refractory High Entropy Alloys (RHEAs) [@Senkov2018], which have rapidly grown since first proposed by [@Cantor2004] and [@Yeh2004]. Contrary to most of the traditional alloys, they contain many chemical elements (typically 4-9) in similar proportions in the hope of thermodynamically stabilizing the material by increasing its configurational entropy ($\Delta S_{conf} = \Sigma_i^N x_i \ln{x_i}$ for ideal mixing of $N$ elements with fractions $x_i$), which encourages sampling from a large palette of chemical elements. At the time of writing, the ULTERA Database is the largest collection of HEA data, containing over 7,000 points manually extracted from 560 publications. It covers 37 chemical elements resulting in extremely large compositional spaces [@Krajewski2024Nimplex]; thus, it becomes critical to answer questions like *"Which combination of how many elements will unlock the most expansive and simultaneously dense dataset?"* which has $2^{37}-1$ or 137 billion possible solutions.
One of the most promising materials for such applications are the aforementioned CCMs and their metal-focused subset of Refractory High Entropy Alloys (RHEAs) [@Senkov2018], which have rapidly grown since first proposed by @Cantor2004 and @Yeh2004. Contrary to most of the traditional alloys, they contain many chemical elements (typically 4-9) in similar proportions in the hope of thermodynamically stabilizing the material by increasing its configurational entropy ($\Delta S_{conf} = \Sigma_i^N x_i \ln{x_i}$ for ideal mixing of $N$ elements with fractions $x_i$), which encourages sampling from a large palette of chemical elements. At the time of writing, the ULTERA Database is the largest collection of HEA data, containing over 7,000 points manually extracted from 560 publications. It covers 37 chemical elements resulting in extremely large compositional spaces [@Krajewski2024Nimplex]; thus, it becomes critical to answer questions like *"Which combination of how many elements will unlock the most expansive and simultaneously dense dataset?"* which has $2^{37}-1$ or 137 billion possible solutions.

Another significant example of intended use is to perform similar optimizations over large (many millions) datasets of quantum mechanics calculations spanning 93 chemical elements and accessible through the OPTIMADE API [@Evans2024].

Expand Down

0 comments on commit 3408070

Please sign in to comment.