Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TinyBase: Status and Roadmap #766

Open
1 of 19 tasks
pmrv opened this issue Jul 11, 2023 · 5 comments
Open
1 of 19 tasks

TinyBase: Status and Roadmap #766

pmrv opened this issue Jul 11, 2023 · 5 comments

Comments

@pmrv
Copy link
Contributor

pmrv commented Jul 11, 2023

Here I just want to briefly collect my todos.

What works

  • the latest iteration already looks and feels like "usual" pyiron (imo).
  • jobs and tasks can be created and used without imports
  • new database and storage interfaces to serialize and load jobs and their dependent objects
  • the new interfaces are flexible enough to support multiple types of projects, storage backends and databases
  • objects that implements the "old" HasHDF work natively with the new interfaces

What should be done

  • broker access to working directories via the project interface. Jobs should ask the project for a directory and pass it to the task. Tasks should mark themselves whether they may or may not run on remote machines.
  • expand database interface wrt to @tnecnivkcots renormalized database structure
  • clarify the precedence between project implementations and database implementations; my current thinking is that there should be a central database configured (as outlined below this may not be the same as the central database that we have now) in which projects of multiple types live (some normal, some archived, some scratch?). I.e. the database is the sole arbiter of truth. Right now however each project implementation can bring its own database.
  • more database and storage interfaces:
    • global database
    • file table database
    • the null database/project
    • archived projects, exported projects
    • S3 storage
  • revisit the executor class.
    • In the current usage the underlying state machine is not as useful as initially expected and can probably be substantially simplified
    • this will include setting on tasks information about their internal parallelism
    • rename vis-a-vis Executor submodule #765 maybe TaskExecutor?
    • remove Submitters and just use the plain executors plus the ExecutionContext
  • interaction with workflow developments. I see tinybase as mostly adding persistence and search-ability to the tools developed there. It should be quite straightforward already to use tiny jobs inside @liamhuber's nodes and vice-versa. The preferred way of integrating (nodes < tiny jobs or tiny jobs < nodes) will likely depend on exact requirements in terms of the number and calculations cost of the workflow and the nodes in question. Both ways are imo worthwhile, but this should be a bit formalized.
  • adding more spec, especially for database and storage interfaces. The respective classes already document expected use and assumptions, but it'll be useful to have it in one document somewhere
  • tests. tests. tests.
  • some more cosmetic work on creators.
  • Storable needs an auto update interface; classes implementing it should provide a dict[version_number, update_function] as a class attribute so that GenericStorage can patch these things as it goes
@pmrv pmrv added the .tinybase label Jul 11, 2023
@jan-janssen
Copy link
Member

I have a couple of questions, to clarify the current state:

  • Does tinybase support the submission to an HPC cluster/ queuing system? Can it use the existing pysqa interface?
  • Does tinybase support MPI parallel tasks? I see that in the shell.py class there is support for multiple versions, but I did not see anything about MPI parallel tasks.
  • Does tinybase require a database? My hope was that with the use of executors we no longer need a database for tracking the status of a given job during the execution, so that the database becomes optional to accelerate the indexing of large datasets as well as the sharing of data. Is it possible to make the database optional?
  • Does tinybase implement the map-reduce pattern we have for the pyiron-table? Many of my projects these days just start a large number of DFT calculation and aggregate the resulting data in a pyiron-table for future processing. If it is not yet implemented it should not be too hard. Here is some minimalistic pyiron-table implementation I did a couple of years ago https://github.com/scisweeper/scisweeper
  • Are jobs by default interactive ? I guess in many cases we want to execute a series of similar calculation, now there are two options to do this, either we submit a series of jobs to the queuing system and aggregate the resulting data in a pyiron-table, or we use one interactive job which receives a list of parameters and executes those. I played around with this concept in https://github.com/pyiron/pyiron_lammps for example when a lammps job receives an array of structures, then it should return an array of a given property. In the case of VASP it would be great to provide an array of energy cutoffs and then have it iterate over them.

@pmrv
Copy link
Contributor Author

pmrv commented Jul 11, 2023

I have a couple of questions, to clarify the current state:

  • Does tinybase support the submission to an HPC cluster/ queuing system? Can it use the existing pysqa interface?

It's not aware of queuing systems, but I also hope it won't need to. You should be able to plug in a parsl and dask executor immediately, but I haven't tried. Wrapping pysqa in an executor interface was on my mind, but I see this is already discussed in #765, which I like!

  • Does tinybase support MPI parallel tasks? I see that in the shell.py class there is support for multiple versions, but I did not see anything about MPI parallel tasks.

Yes and no, the executor/task setup doesn't know about internal parallelism of tasks. It's not clear to me yet, how to make this general, since most executor implementations won't support this. On the other hand you should be able to just plug in the corresponding pool from pympipool without other code changes. We had already discussed in #718 that that one doesn't support multiple tasks at the same time though, so nothing really seems to get this 100% yet. At least by the time pysqa/flux are there we probably want some information for task level parallelism, but I'd like to leave this open for now.

  • Does tinybase require a database? My hope was that with the use of executors we no longer need a database for tracking the status of a given job during the execution, so that the database becomes optional to accelerate the indexing of large datasets as well as the sharing of data. Is it possible to make the database optional?

It requires a database interface, but it doesn't care what it does. The interface takes tuples with the job information and should give it back with a job id, that's all. I added InMemoryProject to show that this idea is fairly flexible. Adding a DevNullProject that just throws everything away, should be trivial. More useful in practice is likely an adapter of the file table in base. It's not added yet, because I don't have a lot of time.

  • Does tinybase implement the map-reduce pattern we have for the pyiron-table? Many of my projects these days just start a large number of DFT calculation and aggregate the resulting data in a pyiron-table for future processing. If it is not yet implemented it should not be too hard. Here is some minimalistic pyiron-table implementation I did a couple of years ago https://github.com/scisweeper/scisweeper

Good that you mention this! It's naturally on the list, but I forgot to add it (this is also mostly how I work). Like you said it shouldn't be hard and it will definitely benefit a lot from the improved parallelism of tinybase.

  • Are jobs by default interactive ? I guess in many cases we want to execute a series of similar calculation, now there are two options to do this, either we submit a series of jobs to the queuing system and aggregate the resulting data in a pyiron-table, or we use one interactive job which receives a list of parameters and executes those. I played around with this concept in https://github.com/pyiron/pyiron_lammps for example when a lammps job receives an array of structures, then it should return an array of a given property. In the case of VASP it would be great to provide an array of energy cutoffs and then have it iterate over them.

Ja, difficult question. tasks, not jobs, don't save their output, you can therefor trivially do something like

for eps in np.linspace(-.1,.1):
  task.input.structure.apply_strain(eps)
  output = task.execute() # save this however you want

Is that always what we want? For things like lammps were we have a proper library interface this is probably a bit wasteful though. On the other hand the current conceptualization of interactive jobs as it exists in atomistics is maybe not something that we want, serial as it is. On the third hand it won't be difficult to add tasks that wrap the lammps library or the functions from pyiron_lammps. I've toyed a bit with meta nodes (ListTaskGenerator, e.g. but it's not so userfriendly) that could be extended, but in light of the work flow developments, I've decided not to pursue this for now.

@jan-janssen
Copy link
Member

Yes and no, the executor/task setup doesn't know about internal parallelism of tasks. It's not clear to me yet, how to make this general, since most executor implementations won't support this. On the other hand you should be able to just plug in the corresponding pool from pympipool without other code changes. We had already discussed in #718 that that one doesn't support multiple tasks at the same time though, so nothing really seems to get this 100% yet. At least by the time pysqa/flux are there we probably want some information for task level parallelism, but I'd like to leave this open for now.

Yes, this is what I currently see as the biggest limitation of the Executor based approach, or why I do not see how dask can help us address this challenge. From my perspective we need some executor which implements the same Executor interface but adds the option to call external executables and to use more than one core. With the pympipool we have this support for serial and parallel python functions and with flux we have support for serial and MPI parallel executables, but flux is currently restricted to Linux and pympipool requires the user to manage multiple schedulers to support MPI parallel tasks. So to me this seems like something we have to develop.

@pmrv
Copy link
Contributor Author

pmrv commented Jul 14, 2023

Agree, it looks like we'll have to do it. I'd prefer to keep this orthogonal as much as possible though. That is we define what a task should specify about its internal parallelism and then any executor can act on this accordingly. I guess a task would need to keep track at least of:

  1. cores
  2. threads
  3. gpus
  4. max memory
  5. max runtime
  6. whether it can run on a remote machine/filesystem (e.g. a pyiron table cannot, it needs access to the file system and job table)

@pmrv
Copy link
Contributor Author

pmrv commented Sep 25, 2023

The example notebooks are now in a working shape again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants