TinyBase: Status and Roadmap #766

pmrv · 2023-07-11T07:40:11Z

Here I just want to briefly collect my todos.

What works

the latest iteration already looks and feels like "usual" pyiron (imo).
jobs and tasks can be created and used without imports
new database and storage interfaces to serialize and load jobs and their dependent objects
the new interfaces are flexible enough to support multiple types of projects, storage backends and databases
objects that implements the "old" HasHDF work natively with the new interfaces

What should be done

The text was updated successfully, but these errors were encountered:

jan-janssen · 2023-07-11T13:29:30Z

I have a couple of questions, to clarify the current state:

Does tinybase support the submission to an HPC cluster/ queuing system? Can it use the existing pysqa interface?
Does tinybase support MPI parallel tasks? I see that in the shell.py class there is support for multiple versions, but I did not see anything about MPI parallel tasks.
Does tinybase require a database? My hope was that with the use of executors we no longer need a database for tracking the status of a given job during the execution, so that the database becomes optional to accelerate the indexing of large datasets as well as the sharing of data. Is it possible to make the database optional?
Does tinybase implement the map-reduce pattern we have for the pyiron-table? Many of my projects these days just start a large number of DFT calculation and aggregate the resulting data in a pyiron-table for future processing. If it is not yet implemented it should not be too hard. Here is some minimalistic pyiron-table implementation I did a couple of years ago https://github.com/scisweeper/scisweeper
Are jobs by default interactive ? I guess in many cases we want to execute a series of similar calculation, now there are two options to do this, either we submit a series of jobs to the queuing system and aggregate the resulting data in a pyiron-table, or we use one interactive job which receives a list of parameters and executes those. I played around with this concept in https://github.com/pyiron/pyiron_lammps for example when a lammps job receives an array of structures, then it should return an array of a given property. In the case of VASP it would be great to provide an array of energy cutoffs and then have it iterate over them.

pmrv · 2023-07-11T18:48:42Z

I have a couple of questions, to clarify the current state:

Does tinybase support the submission to an HPC cluster/ queuing system? Can it use the existing pysqa interface?

It's not aware of queuing systems, but I also hope it won't need to. You should be able to plug in a parsl and dask executor immediately, but I haven't tried. Wrapping pysqa in an executor interface was on my mind, but I see this is already discussed in #765, which I like!

Does tinybase support MPI parallel tasks? I see that in the shell.py class there is support for multiple versions, but I did not see anything about MPI parallel tasks.

Yes and no, the executor/task setup doesn't know about internal parallelism of tasks. It's not clear to me yet, how to make this general, since most executor implementations won't support this. On the other hand you should be able to just plug in the corresponding pool from pympipool without other code changes. We had already discussed in #718 that that one doesn't support multiple tasks at the same time though, so nothing really seems to get this 100% yet. At least by the time pysqa/flux are there we probably want some information for task level parallelism, but I'd like to leave this open for now.

Does tinybase require a database? My hope was that with the use of executors we no longer need a database for tracking the status of a given job during the execution, so that the database becomes optional to accelerate the indexing of large datasets as well as the sharing of data. Is it possible to make the database optional?

It requires a database interface, but it doesn't care what it does. The interface takes tuples with the job information and should give it back with a job id, that's all. I added InMemoryProject to show that this idea is fairly flexible. Adding a DevNullProject that just throws everything away, should be trivial. More useful in practice is likely an adapter of the file table in base. It's not added yet, because I don't have a lot of time.

Does tinybase implement the map-reduce pattern we have for the pyiron-table? Many of my projects these days just start a large number of DFT calculation and aggregate the resulting data in a pyiron-table for future processing. If it is not yet implemented it should not be too hard. Here is some minimalistic pyiron-table implementation I did a couple of years ago https://github.com/scisweeper/scisweeper

Good that you mention this! It's naturally on the list, but I forgot to add it (this is also mostly how I work). Like you said it shouldn't be hard and it will definitely benefit a lot from the improved parallelism of tinybase.

Are jobs by default interactive ? I guess in many cases we want to execute a series of similar calculation, now there are two options to do this, either we submit a series of jobs to the queuing system and aggregate the resulting data in a pyiron-table, or we use one interactive job which receives a list of parameters and executes those. I played around with this concept in https://github.com/pyiron/pyiron_lammps for example when a lammps job receives an array of structures, then it should return an array of a given property. In the case of VASP it would be great to provide an array of energy cutoffs and then have it iterate over them.

Ja, difficult question. tasks, not jobs, don't save their output, you can therefor trivially do something like

for eps in np.linspace(-.1,.1):
  task.input.structure.apply_strain(eps)
  output = task.execute() # save this however you want

Is that always what we want? For things like lammps were we have a proper library interface this is probably a bit wasteful though. On the other hand the current conceptualization of interactive jobs as it exists in atomistics is maybe not something that we want, serial as it is. On the third hand it won't be difficult to add tasks that wrap the lammps library or the functions from pyiron_lammps. I've toyed a bit with meta nodes (ListTaskGenerator, e.g. but it's not so userfriendly) that could be extended, but in light of the work flow developments, I've decided not to pursue this for now.

jan-janssen · 2023-07-11T19:50:13Z

Yes and no, the executor/task setup doesn't know about internal parallelism of tasks. It's not clear to me yet, how to make this general, since most executor implementations won't support this. On the other hand you should be able to just plug in the corresponding pool from pympipool without other code changes. We had already discussed in #718 that that one doesn't support multiple tasks at the same time though, so nothing really seems to get this 100% yet. At least by the time pysqa/flux are there we probably want some information for task level parallelism, but I'd like to leave this open for now.

Yes, this is what I currently see as the biggest limitation of the Executor based approach, or why I do not see how dask can help us address this challenge. From my perspective we need some executor which implements the same Executor interface but adds the option to call external executables and to use more than one core. With the pympipool we have this support for serial and parallel python functions and with flux we have support for serial and MPI parallel executables, but flux is currently restricted to Linux and pympipool requires the user to manage multiple schedulers to support MPI parallel tasks. So to me this seems like something we have to develop.

pmrv · 2023-07-14T10:19:41Z

Agree, it looks like we'll have to do it. I'd prefer to keep this orthogonal as much as possible though. That is we define what a task should specify about its internal parallelism and then any executor can act on this accordingly. I guess a task would need to keep track at least of:

cores
threads
gpus
max memory
max runtime
whether it can run on a remote machine/filesystem (e.g. a pyiron table cannot, it needs access to the file system and job table)

pmrv · 2023-09-25T10:55:38Z

The example notebooks are now in a working shape again.

pmrv added the .tinybase label Jul 11, 2023

jan-janssen mentioned this issue Jul 11, 2023

Executor submodule #765

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TinyBase: Status and Roadmap #766

TinyBase: Status and Roadmap #766

pmrv commented Jul 11, 2023 •

edited

Loading

jan-janssen commented Jul 11, 2023

pmrv commented Jul 11, 2023

jan-janssen commented Jul 11, 2023

pmrv commented Jul 14, 2023

pmrv commented Sep 25, 2023

TinyBase: Status and Roadmap #766

TinyBase: Status and Roadmap #766

Comments

pmrv commented Jul 11, 2023 • edited Loading

What works

What should be done

jan-janssen commented Jul 11, 2023

pmrv commented Jul 11, 2023

jan-janssen commented Jul 11, 2023

pmrv commented Jul 14, 2023

pmrv commented Sep 25, 2023

pmrv commented Jul 11, 2023 •

edited

Loading