Skip to content

Commit

Permalink
(paper) added NumPy citation per suggestion in #6
Browse files Browse the repository at this point in the history
  • Loading branch information
amkrajewski authored Sep 24, 2024
1 parent afd364f commit bcf4bf4
Show file tree
Hide file tree
Showing 2 changed files with 24 additions and 1 deletion.
23 changes: 23 additions & 0 deletions paper/paper.bib
Original file line number Diff line number Diff line change
Expand Up @@ -296,3 +296,26 @@ @misc{Edmisten2022
url = {https://wcedmisten.fyi/post/analyzing-all-recipes/#spices},
year = {2022},
}

@Article{Harris2020array,
title = {Array programming with {NumPy}},
author = {Charles R. Harris and K. Jarrod Millman and St{\'{e}}fan J.
van der Walt and Ralf Gommers and Pauli Virtanen and David
Cournapeau and Eric Wieser and Julian Taylor and Sebastian
Berg and Nathaniel J. Smith and Robert Kern and Matti Picus
and Stephan Hoyer and Marten H. van Kerkwijk and Matthew
Brett and Allan Haldane and Jaime Fern{\'{a}}ndez del
R{\'{i}}o and Mark Wiebe and Pearu Peterson and Pierre
G{\'{e}}rard-Marchant and Kevin Sheppard and Tyler Reddy and
Warren Weckesser and Hameer Abbasi and Christoph Gohlke and
Travis E. Oliphant},
year = {2020},
month = sep,
journal = {Nature},
volume = {585},
number = {7825},
pages = {357--362},
doi = {10.1038/s41586-020-2649-2},
publisher = {Springer Science and Business Media {LLC}},
url = {https://doi.org/10.1038/s41586-020-2649-2}
}
2 changes: 1 addition & 1 deletion paper/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ bibliography: paper.bib

`nimCSO` is a high-performance tool implementing several methods for selecting components (data dimensions) in compositional datasets, which optimize the data availability and density for applications such as machine learning. Making said choice is a combinatorically hard problem for complex compositions existing in high-dimensional spaces due to the interdependency of components being present. Such spaces are encountered across many scientific disciplines (see [Statement of Need](#statement-of-need)), including materials science, where datasets on Compositionally Complex Materials (CCMs) often span 20-45 chemical elements, 5-10 processing types, and several temperature regimes, for up to 60 total data dimensions. This challenge also exists in everyday contexts, such as study of cooking ingredients [@Ahn2011], which interact in various recipes, giving rise to questions like *"Given 100 spices at the supermarket, which 20, 30, or 40 should I stock in my pantry to maximize the number of unique dishes I can spice according to recipe?"*. Critically, this is not as simple as frequency-based selection because, e.g., removing less common nutmeg and cinnamon from your shopping list will prevent many recipes with the frequent vanilla, but won't affect those using black pepper [@Edmisten2022].

At its core, `nimCSO` leverages the metaprogramming ability of the Nim language [@Rumpf2023] to optimize itself at compile time, both in terms of speed and memory handling, to the specific problem statement and dataset at hand based on a human-readable configuration file. As demonstrated in the [Methods and Performance](#methods-and-performance) section, `nimCSO` reaches the physical limits of the hardware (L1 cache latency) and can outperform an efficient native Python implementation over 100 times in terms of speed and 50 times in terms of memory usage (*not* counting the interpreter), while also outperforming the NumPy implementation 37 and 17 times, respectively, when checking a candidate solution.
At its core, `nimCSO` leverages the metaprogramming ability of the Nim language [@Rumpf2023] to optimize itself at compile time, both in terms of speed and memory handling, to the specific problem statement and dataset at hand based on a human-readable configuration file. As demonstrated in the [Methods and Performance](#methods-and-performance) section, `nimCSO` reaches the physical limits of the hardware (L1 cache latency) and can outperform an efficient native Python implementation over 100 times in terms of speed and 50 times in terms of memory usage (*not* counting the interpreter), while also outperforming the NumPy [@Harris2020array] based implementation 37 and 17 times, respectively, when checking a candidate solution.

`nimCSO` is designed to be both (1) a user-ready tool, implementing two efficient brute-force approaches (for handling up to 25 dimensions), a custom search algorithm (for up to 40 dimensions), and a genetic algorithm (for any dimensionality), and (2) a scaffold for building even more elaborate methods in the future, including heuristics going beyond data availability. All configuration is done with a simple human-readable `YAML` file and plain text data files, making it easy to modify the search method and its parameters with no knowledge of programming and only basic command line skills.

Expand Down

0 comments on commit bcf4bf4

Please sign in to comment.