Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't start from interactive Slurm session #47

Open
kcgthb opened this issue Sep 22, 2017 · 5 comments
Open

Can't start from interactive Slurm session #47

kcgthb opened this issue Sep 22, 2017 · 5 comments

Comments

@kcgthb
Copy link

kcgthb commented Sep 22, 2017

Hi!

When trying to start remora from an interactive Slurm session, it immediately exits with the following error:

$ srun --pty bash
[cn] $ remora --help
REMORA Error:  Incorrect syntax: REMORA can't run in parallel
REMORA Howto
remora ./myapp [args]                     (serial applications)
remora ibrun [options] ./myapp [args]     (parallel MPI applications)

Our Slurm setup has MpiDefault=pmi2, and when starting an interactive session with srun --pty bash, the PMI environment is set, so PMI_RANK is defined in the environment, and remora thinks that it's running in a parallel mode (checks are in check_running_parallel() in aux/extra).

Unsetting the PMI_RANK variable allowsremora to start:

$ srun --pty bash
[cn] $ unset PMI_RANK
[cn] $ remora --help
 SYNOPSIS
  remora ./myapp [args]                     (serial applications)
  remora ibrun [options] ./myapp [args]     (parallel MPI applications)

 DESCRIPTION
 REMORA: REsource MOnitoring for Remote Applications
[...]
@antoniogi
Copy link
Contributor

Thanks for reporting this. Can you tell me the values of PMI_RANK and PMI_SIZE when you start your job with "srun --pty bash"?

@kcgthb
Copy link
Author

kcgthb commented Oct 4, 2017

Sure!

$ srun --pty bash
$ echo $PMI_RANK
0
$ echo $PMI_SIZE
1

@antoniogi
Copy link
Contributor

Ok, add this to aux/extra, line 61:

if [ -n "${PMI_SIZE+1}" ] && [ $PMI_SIZE -gt 1 ] ; then
print_error "Incorrect syntax: REMORA can't run in parallel"
usage
exit
fi

After this, leave the rest of the function as it is (starting with "if [ -n "${PMI_ID+1}" ] || [ -n "${PMI_RANK+1}" ] || [ -n "${MPIRUN_RANK+1}" ]; then").

I'll add this to the code, but there are things that I need to figure out before committing new changes.

@kcgthb
Copy link
Author

kcgthb commented Oct 4, 2017

Thanks! I did what you suggested, but it doesn't seem to change anything.

When $PMI_SIZE=1, ie. in serial jobs, the rest of that function will still lead to a non-null value of $my_rank and display the REMORA can't run in parallel error.

@antoniogi
Copy link
Contributor

Can you send me the output of the env command to my email? [email protected] Let see what we can do :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants