Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement dora run command #703

Merged
merged 7 commits into from
Nov 14, 2024
Merged

Implement dora run command #703

merged 7 commits into from
Nov 14, 2024

Conversation

phil-opp
Copy link
Collaborator

@phil-opp phil-opp commented Nov 6, 2024

Runs a dataflow locally, without requiring any any daemon or coordinator processes. Multi-machine dataflows are not supported. The default log level is set to INFO, overwriting it is possible by setting the RUST_LOG environment variable.

This exposes the internal dora daemon --run-dataflow command that we use for testing.

This addition was proposed in https://github.com/orgs/dora-rs/discussions/698#discussioncomment-11125465 .

The second commit adds a ctrl-c handler. On first ctrl-c, we send a stop command to all nodes. On second ctrl-c, we exit immediately and kill all spawned nodes. On third ctrl-c, we abort the process directly without waiting (child processes keep running).

@phil-opp phil-opp requested a review from haixuanTao November 6, 2024 12:35
@haixuanTao
Copy link
Collaborator

This makes sense to me.

But could be good to:

  • Not print error on SIGINT ( Ctrl-C ) as it is what is expected
  • Make the Ctrl-C return result faster. I'm slightly confuse why ctrl-c a process takes couple of seconds to clear as it should almost be instant.

@haixuanTao
Copy link
Collaborator

haixuanTao commented Nov 7, 2024

Screencast.from.2024-11-07.05-28-42.webm

In this example, I hit ctrl-C at t=10
and the dataflow finish 10s later.

Runs a dataflow locally, without requiring any any daemon or coordinator processes.

This exposes the internal `dora daemon --run-dataflow` command that we use for testing.

This addition was proposed in https://github.com/orgs/dora-rs/discussions/698#discussioncomment-11125465 .
On first ctrl-c, send a stop command to all nodes. On second ctrl-c, exit immediately and kill all spawned nodes. On third ctrl-c, abort the process directly without waiting (child processes keep running).

This change affects both `dora run` and `dora daemon` commands.
@phil-opp
Copy link
Collaborator Author

Rebased original commits on top of main.

@phil-opp
Copy link
Collaborator Author

But could be good to:

* Not print error on SIGINT ( Ctrl-C ) as it is what is expected

* Make the Ctrl-C return result faster. I'm slightly confuse why ctrl-c a process takes couple of seconds to clear as it should almost be instant.

See the description of the second commit:

On first ctrl-c, send a stop command to all nodes. On second ctrl-c, exit immediately and kill all spawned nodes. On third ctrl-c, abort the process directly without waiting (child processes keep running).

The reason that the ctrl-c takes multiple seconds in your case is that your dataflow doesn't react to the stop command properly for some reason (the warning message indicates a connection with #625). The nodes are then killed when the grace period runs out. This is considered an abnormal exit, so we print an error.

Could you try pressing ctrl-c twice? This should lead to an immediate kill and no additional error.

@haixuanTao
Copy link
Collaborator

So, when I press Ctrl-C, all process receive a SIGINT on the first ctrl-c. And therefore they return an error.

Isn't it what your experiencing as well?

@phil-opp
Copy link
Collaborator Author

Usually only the foreground process receives the ctrl-c. The spawned child processes should continue running. It's also how it behaves on my machine.

Could you paste the complete terminal output here? In your video I only see the end of it.

For me it looks like this when I do a single ctrl-c with the python-dataflow example:

2024-11-13T14:07:36.656457Z  INFO run_inner: dora_daemon: received ctrlc signal -> stopping all dataflows self.machine_id=
2024-11-13T14:07:51.792383Z  WARN dora_daemon: object-detection was killed due to not stopping within the 15s grace period
2024-11-13T14:07:51.810161Z  WARN dora_daemon: plot was killed due to not stopping within the 15s grace period
[...]
[ERROR]
Dataflow failed:

Node `camera` failed: exited because of signal SIGINT with stderr output:
[...]
Node `plot` failed: node was killed by dora because it didn't react to a stop message in time (SIGKILL)
Node `object-detection` failed: node was killed by dora because it didn't react to a stop message in time (SIGKILL)

@haixuanTao
Copy link
Collaborator

Ah i see!

So, it might be an OS thing. This is what I have:

(lebai) ~/D/w/d/e/rust-dataflow ❯❯❯ dora run dataflow.yml                                                                                    (lebai) dora-run ✭ ◼
2024-11-14T01:04:43.443567Z  INFO run_inner: dora_daemon::spawn: spawning: /home/peter/Documents/work/dora/target/debug/rust-dataflow-example-node self.machine_id=
2024-11-14T01:04:43.444772Z  INFO run_inner: dora_daemon::spawn: spawning: /home/peter/Documents/work/dora/target/debug/rust-dataflow-example-status-node self.machine_id=
2024-11-14T01:04:43.445278Z  INFO run_inner: dora_daemon::spawn: spawning: /home/peter/Documents/work/dora/target/debug/rust-dataflow-example-sink self.machine_id=
2024-11-14T01:04:43.445739Z  INFO run_inner: dora_daemon::spawn: spawning: /home/peter/Documents/work/dora/target/debug/dora-record self.machine_id=
2024-11-14T01:04:43.468757Z  INFO run_inner: dora_daemon: all nodes are ready, starting dataflow `01932833-7db0-7b0b-bcfe-dc0c3016a200` self.machine_id=
^C2024-11-14T01:04:44.148733Z  INFO run_inner: dora_daemon: received ctrlc signal -> stopping all dataflows self.machine_id=
2024-11-14T01:04:44.154888Z  WARN dora_daemon::node_communication: failed to send NextFinishedDropTokens reply: NextDropEvents([])

Caused by:
   0: failed to send reply to node `rust-status-node`
   1: failed to send DaemonReply
   2: Broken pipe (os error 32)

Location:
    binaries/daemon/src/node_communication/tcp.rs:91:14
2024-11-14T01:04:44.156431Z  WARN dora_daemon::node_communication: failed to send NextFinishedDropTokens reply: NextDropEvents([])

Caused by:
   0: failed to send reply to node `dora-record`
   1: failed to send DaemonReply
   2: Broken pipe (os error 32)

Location:
    binaries/daemon/src/node_communication/tcp.rs:91:14
2024-11-14T01:04:44.296294Z  WARN dora_daemon::node_communication: failed to send NextFinishedDropTokens reply: NextDropEvents([])

Caused by:
   0: failed to send reply to node `rust-node`
   1: failed to send DaemonReply
   2: Broken pipe (os error 32)

Location:
    binaries/daemon/src/node_communication/tcp.rs:91:14
2024-11-14T01:04:44.297052Z  INFO run_inner: dora_daemon: Dataflow `01932833-7db0-7b0b-bcfe-dc0c3016a200` finished on machine `` self.machine_id=
2024-11-14T01:04:44.297065Z  INFO run_inner: dora_daemon: exiting daemon because all required dataflows are finished self.machine_id=
2024-11-14T01:04:44.297066Z  WARN dora_daemon::node_communication: failed to send NextFinishedDropTokens reply: NextDropEvents([])

Caused by:
   0: failed to send reply to node `rust-sink`
   1: failed to send DaemonReply
   2: Broken pipe (os error 32)

Location:
    binaries/daemon/src/node_communication/tcp.rs:91:14


[ERROR]
Dataflow failed:

Node `rust-status-node` failed: exited because of signal SIGINT
Node `dora-record` failed: exited because of signal SIGINT
Node `rust-node` failed: exited because of signal SIGINT
Node `rust-sink` failed: exited because of signal SIGINT
Screencast.from.2024-11-14.02-04-27.webm

I'm using rust node as it might be easier to debug on.

@phil-opp
Copy link
Collaborator Author

So, it might be an OS thing.

I'm on an ubuntu-based Linux distro. You?

@haixuanTao
Copy link
Collaborator

I have fixed my issue with #712

@phil-opp phil-opp enabled auto-merge November 14, 2024 16:09
@phil-opp phil-opp merged commit 70ca30d into main Nov 14, 2024
51 of 52 checks passed
@phil-opp phil-opp deleted the dora-run branch November 14, 2024 16:25
@haixuanTao
Copy link
Collaborator

So FYI, there was a race condition that was happening because of the node process stopping before the daemon could send the stop message and so the daemon was stuck.

By splitting process group, this race condition does not happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants