diff --git a/.github/workflows/documentation.yaml b/.github/workflows/documentation.yaml index 1ada69c..2ba6727 100644 --- a/.github/workflows/documentation.yaml +++ b/.github/workflows/documentation.yaml @@ -35,6 +35,8 @@ jobs: nim doc --project --index:on --outdir:docs --git.url:https://github.com/amkrajewski/nimCSO --git.commit:main src/nimcso sed -i '0,/src\/nimcso/s//nimCSO/;0,/src\/nimcso/s//nimCSO/' docs/nimcso.html cp docs/nimcso.html docs/index.html + mkdir -p docs/assets + cp -r paper/assets docs/assets - name: Setup Pages uses: actions/configure-pages@v4 diff --git a/docs/docs.nim b/docs/docs.nim index 95f8a2e..7a557c7 100644 --- a/docs/docs.nim +++ b/docs/docs.nim @@ -1,22 +1,34 @@ ## **Navigation:** [nimCSO](nimcso.html) (core library) | [Changelog](docs/changelog.html) | [nimcso/bitArrayAutoconfigured](nimcso/bitArrayAutoconfigured.html) -## -## **nim** **C**omposition **S**pace **O**ptimization is a high-performance, low-level tool for selecting sets of components (dimensions) in compositional spaces, which optimize the data availability -## given a constraint on the number of components to be selected. Ability to do so is crucial for deploying machine learning (ML) algorithms, so that they can be designed in a way balancing their -## accuracy and domain of applicability. Howerver, this becomes a combinatorically hard problem for complex compositions existing in highly dimensional spaces due to the interdependency of components -## being present. For instance, removing datapoints many low-frequency components -## -## -## -## Such spaces are often encountered in materials science, where datasets on Compositionally Complex Materials (CCMs) often span 20-40 chemical elements, while each data point contains -## several of them. -## -## -## -## -## This tool employs a set of methods, ranging from (1) brute-force search through (2) genetic algorithms to (3) a newly designed search method. They use custom data structures and procedures written in Nim language, which are compile-time optimized for the specific problem statement and dataset pair, which allows nimCSO to run faster and use 1-2 orders of magnitude less memory than general-purpose data structures. All configuration is done with a simple human-readable config file, allowing easy modification of the search method and its parameters. -## -## + +## **nim** **C**omposition **S**pace **O**ptimization is a high-performance tool implementing several methods for selecting components (data dimensions) in compositional datasets, which +## optimize the data availability and density for applications such as machine learning (ML) given a constraint on the number of components to be selected. Ability to do so is crucial for +## deploying machine learning (ML) algorithms, so that they can be designed in a way balancing their accuracy and domain of applicability. Making said choice is a combinatorically hard +## problem when data is composed of a large number of independent components due to the interdependency of components being present. Thus, efficiency of the search becomes critical for any +## application where interaction between components is of interest in a modeling effort, ranging from market economics, through medicine where drug interactions can have a significant +## impact on the treatment, to materials science, where the composition and processing history are critical to resulting properties. +## +## We are particularily interested in the latter case of materials science, where we utilize `nimCSO` to optimize ML deployment over our datasets on Compositionally Complex Materials (CCMs) +## which are largest ever collected (from almost 550 publications) spanning up to 60 dimensions and developed within the [ULTERA Project (ultera.org)](https://ultera.org) carried under the +## [US DOE ARPA-E ULTIMATE](https://arpa-e.energy.gov/?q=arpa-e-programs/ultimate) program which aims to develop +## a new generation of ultra-high temperature materials for aerospace applications, through generative machine learning models [10.20517/jmi.2021.05](https://doi.org/10.20517/jmi.2021.05) +## driving thermodynamic modeling and experimentation [10.2139/ssrn.4689687](https://dx.doi.org/10.2139/ssrn.4689687). +## +## At its core, `nimCSO` leverages the metaprogramming ability of the [Nim language](https://nim-lang.org) to optimize itself at the compile time, both in terms of speed and memory handling, +## to the specific problem statement and dataset at hand based on a human-readable configuration file. As demonstrated later in benchamrks, `nimCSO` reaches the physical limits of the hardware +## (L1 cache latency) and can outperform an efficient native Python implementation over 400 times in terms of speed and 50 times in terms of memory usage (*not* counting interpreter), while +## also outperforming NumPy implementation 35 and 17 times, respectively, when checking a candidate solution. +## +## .. figure:: assets/nimCSO_mainFigure.png +## :alt: Main nimCSO figure +## +## `nimCSO` is designed to be both (1) a user-ready tool (see figure above), implementing efficient brute force approaches (for handling up to 25 dimensions), a custom search algorithm +## (for up to 40 dimensions), and a genetic algorithm (for any dimensionality), and (2) a scaffold for building even more elaborate methods in the future, including heuristics going beyond +## data availability. All configuration is done with a simple human-readable `YAML` config file and plain text data files, making it easy to modify the search method and its parameters with +## no knowledge of programming and only basic command line skills. A single command is used to recompile (`nim c -f`) and run (`-r`) problem (`-d:configPath=config.yaml`) with `nimCSO` +## (`src/nimcso`) using one of several methods. Advanced users can also quickly customize the provided methods with brief scripts using the `nimCSO` as a data-centric library. + + ## # Usage ## ## diff --git a/src/nimcso.nim b/src/nimcso.nim index 5af6ecf..3d35730 100644 --- a/src/nimcso.nim +++ b/src/nimcso.nim @@ -6,13 +6,15 @@ {.passL: "-flto".} when defined(nimdoc): - # Core documentation living in the root of the project + # Core documentation living in the root of the project. include ../docs/docs when defined(nimdoc): + # Documentation on benchmarks, living alongside them. include ../benchmarks/docs when defined(nimdoc): + # Documentation on the tests being run, living alongside them. include ../tests/docs # Standard library imports. One per line for easy change tracking.