Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize work-stealing deque #124

Merged
merged 6 commits into from
Nov 23, 2024
Merged

Optimize work-stealing deque #124

merged 6 commits into from
Nov 23, 2024

Conversation

polytypic
Copy link
Contributor

@polytypic polytypic commented Jan 26, 2024

This PR optimizes the work-stealing deque.

The graphs on current-bench show the progress of optimizing the work-stealing deque quite nicely:

image

Here is a run of the benchmarks on my M3 Max before the optimizations:

➜  saturn git:(rewrite-bench) ✗ dune exec --release -- ./bench/main.exe -budget 1 'Work' | jq '[.results.[].metrics.[] | select(.name | test("over")) | {name, value}]'
[                                    
  {
    "name": "spawns over time/1 worker",
    "value": 49.86378542204304
  },
  {
    "name": "spawns over time/2 workers",
    "value": 44.30827252087946
  },
  {
    "name": "spawns over time/4 workers",
    "value": 84.66239655969726
  },
  {
    "name": "spawns over time/8 workers",
    "value": 177.69250680679536
  }
]

Here is a run after the optimizations:

➜  saturn git:(optimize-ws-deque) ✗ dune exec --release -- ./bench/main.exe -budget 1 'Work' | jq '[.results.[].metrics.[] | select(.name | test("over")) | {name, value}]'
[                                     
  {
    "name": "spawns over time/1 worker",
    "value": 61.20210013480126
  },
  {
    "name": "spawns over time/2 workers",
    "value": 121.42420051856216
  },
  {
    "name": "spawns over time/4 workers",
    "value": 235.72974008156237
  },
  {
    "name": "spawns over time/8 workers",
    "value": 428.2572470382373
  }
]

General approach:

  1. Add benchmark(s). In this case the benchmark simulates a scheduler running parallel fibonacci.
  2. Avoid false sharing. With false sharing there will be too much noise to get any other useful optimizations done. In this case the top and bottom indices and the root record had to be padded to avoid false sharing. This already gave a big performance improvement.
  3. Avoid other forms of contention. Contention tends to mask any benefits from further optimizations. In this case there wasn't much of that except for avoiding some unnecessary loads and stores. With parallel data structures, any reads and writes of shared locations (atomic or non-atomic) should be viewed with suspicion.
  4. Usual micro-optimizations such as avoiding float array pessimization, indirections, costly operations, avoiding dependency chains, avoiding unnecessary work, avoiding unnecessary fences, making the common case fast (at the expense of less common case), ... Unless main contention issues are addressed micro-optimizations are difficult to make as noise and stalls from contention mask the improvements.

@polytypic polytypic force-pushed the optimize-ws-deque branch 19 times, most recently from ff22c9f to 1c98c25 Compare January 27, 2024 08:02
@polytypic polytypic marked this pull request as ready for review January 27, 2024 08:02
@polytypic polytypic requested a review from a team January 27, 2024 08:04
@polytypic polytypic force-pushed the optimize-ws-deque branch 4 times, most recently from 0d2e6ea to 1cfc137 Compare February 17, 2024 17:20
Copy link
Collaborator

@lyrm lyrm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

I haven't noticed any issue with theses changes. I will push a PR soon with more DSCheck tests for this queue though. The ones that we currently have where written for a way slower version of DSCheck and we should be able to do way more now :)

Comment on lines 86 to 88
(*
WSDT_dom.neg_agree_test_par ~count
~name:"STM Saturn_lockfree.Ws_deque test parallel, negative";
~name:"STM Saturn_lockfree.Ws_deque test parallel, negative"; *)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will say that this should either be removed or a note should be added to explain why this is a bad idea to launch such a test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the test was causing problems at one point, which is why I commented it out. I was planning to take another look to understand the issue better. The work-stealing deque data structure is specifically designed to take advantage of the asymmetry between the (single) owner and the (multiple) thiefs. That negative test specifically uses the work-stealing deque in a manner in which it is not intended to work. So, I'm inclided to simply remove the test.

Copy link
Contributor Author

@polytypic polytypic Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, with this change to the code in this PR

modified   src_lockfree/ws_deque.ml
@@ -54,7 +54,7 @@ module M : S = struct
   let create () =
     let top = Atomic.make 0 |> Multicore_magic.copy_as_padded in
     let bottom = Atomic.make 0 |> Multicore_magic.copy_as_padded in
-    let tab = Array.make min_capacity (Obj.magic ()) in
+    let tab = Array.make min_capacity (ref (Obj.magic ())) in
     { top; bottom; tab } |> Multicore_magic.copy_as_padded
 
   let realloc a t b sz new_sz =

the negative test passes.

The original code (before this PR) also has the same behavior:

tab = Atomic.make (CArray.create min_size (Obj.magic ()));

In other words, the circular array is initialized with Obj.magic () values rather than ref (Obj.magic ()) values. I haven't analyzed why exactly the original code doesn't crash, but I suspect it is purely out of luck.

Even with the ref (Obj.magic ()) things are not guaranteed to work. If I change the test use floats, i.e.

     QCheck.make ~print:show_cmd
       (Gen.oneof
          [
-           Gen.map (fun i -> Push i) int_gen;
+           Gen.map (fun i -> Push i) Gen.float;
            Gen.return Pop;
            (*Gen.return Steal;*)

(with corresponding changes to use float instead of int) the negative test will cause a segmentation fault even with ref (Obj.magic ()). I would expect the original code to also exhibit a segmentation fault with this change.

The way I see it, the whole point of this particular data structure is to exploit the asymmetry of having one owner and multiple thiefs. If we would instead choose to make this data structure safe to use with multiple owners, then we would be implementing a fundamentally different data structure (i.e. a multi consumer, single producer deque). Implementing a data structure that doesn't cause a segmentation fault, but instead returns wrong result in case of misuse is not something I would recommend. So, personally I would just recommend embracing the fact that calling pop in parallel from multiple domains is not safe and would remove the negative test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could also have a "safe" implementation that raises an issue in case of wrong use (a thief calling push or pop), but I am guessing this could have quite a cost in term of performance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would expect the cost (of detecting misuse) to be relatively high in case it needs to be reliable. It would be great if misuse could be detected for free and an error produced reliably. In all likelyhood, detecting misuse reliably would be expensive and detecting misuse only partially while avoiding segmentation faults would also likely be expensive and could also lead to the program silently giving incorrect results, which I consider worse than crashing.

@polytypic polytypic force-pushed the optimize-ws-deque branch 3 times, most recently from e409e1a to 5bd372d Compare March 3, 2024 12:00
@polytypic polytypic force-pushed the optimize-ws-deque branch 12 times, most recently from 204ae86 to 6052cd0 Compare March 3, 2024 19:24
@polytypic
Copy link
Contributor Author

polytypic commented Mar 3, 2024

I created a PR adding new benchmarks for the work-stealing deque #130 and I added a commit to this PR to optimize the work-stealing deque to avoid performance pitfalls on those benchmarks.

Here is a run from before adding the commit:

➜  saturn git:(optimize-ws-deque) ✗ dune exec --release -- ./bench/main.exe -budget 1 -diff bench-high.json deque
Saturn_lockfree Work_stealing_deque: 
  time per spawn/1 worker:
    16.60 ns = 0.85 x 19.52 ns
  spawns over time/1 worker:
    60.23 M/s = 1.18 x 51.23 M/s
  time per spawn/2 workers:
    16.73 ns = 0.37 x 45.64 ns
  spawns over time/2 workers:
    119.58 M/s = 2.73 x 43.82 M/s
  time per spawn/4 workers:
    17.38 ns = 0.51 x 34.40 ns
  spawns over time/4 workers:
    230.12 M/s = 1.98 x 116.27 M/s
  time per spawn/8 workers:
    18.65 ns = 0.28 x 65.65 ns
  spawns over time/8 workers:
    429.00 M/s = 3.52 x 121.86 M/s
  time per message/1 adder, 1 taker:
    110.91 ns = 1.07 x 103.71 ns
  messages over time/1 adder, 1 taker:
    18.03 M/s = 0.94 x 19.28 M/s
  time per message/1 adder, 2 takers:
    167.84 ns = 0.90 x 187.33 ns
  messages over time/1 adder, 2 takers:
    17.87 M/s = 1.12 x 16.01 M/s
  time per message/1 adder, 4 takers:
    261.86 ns = 0.94 x 278.85 ns
  messages over time/1 adder, 4 takers:
    19.09 M/s = 1.06 x 17.93 M/s
  time per message/one domain (FIFO):
    16.34 ns = 0.91 x 17.86 ns
  messages over time/one domain (FIFO):
    61.18 M/s = 1.09 x 55.98 M/s
  time per message/one domain (LIFO):
    18.95 ns = 1.00 x 18.89 ns
  messages over time/one domain (LIFO):
    52.78 M/s = 1.00 x 52.93 M/s

Here is a run after the commit:

➜  saturn git:(optimize-ws-deque) ✗ dune exec --release -- ./bench/main.exe -budget 1 -diff bench-high.json deque
Saturn_lockfree Work_stealing_deque: 
  time per spawn/1 worker:
    16.60 ns = 0.85 x 19.52 ns
  spawns over time/1 worker:
    60.26 M/s = 1.18 x 51.23 M/s
  time per spawn/2 workers:
    16.66 ns = 0.37 x 45.64 ns
  spawns over time/2 workers:
    120.04 M/s = 2.74 x 43.82 M/s
  time per spawn/4 workers:
    17.29 ns = 0.50 x 34.40 ns
  spawns over time/4 workers:
    231.33 M/s = 1.99 x 116.27 M/s
  time per spawn/8 workers:
    18.60 ns = 0.28 x 65.65 ns
  spawns over time/8 workers:
    430.18 M/s = 3.53 x 121.86 M/s
  time per message/1 adder, 1 taker:
    35.15 ns = 0.34 x 103.71 ns
  messages over time/1 adder, 1 taker:
    56.89 M/s = 2.95 x 19.28 M/s
  time per message/1 adder, 2 takers:
    47.34 ns = 0.25 x 187.33 ns
  messages over time/1 adder, 2 takers:
    63.37 M/s = 3.96 x 16.01 M/s
  time per message/1 adder, 4 takers:
    85.12 ns = 0.31 x 278.85 ns
  messages over time/1 adder, 4 takers:
    58.74 M/s = 3.28 x 17.93 M/s
  time per message/one domain (FIFO):
    16.61 ns = 0.93 x 17.86 ns
  messages over time/one domain (FIFO):
    60.21 M/s = 1.08 x 55.98 M/s
  time per message/one domain (LIFO):
    15.84 ns = 0.84 x 18.89 ns
  messages over time/one domain (LIFO):
    63.12 M/s = 1.19 x 52.93 M/s

Performance on the SPMC style benchmarks improved significantly.

The optimization in the last commit (caching the index used by thieves) is mentioned in the original paper that introduces the work-stealing deque (section 2.3 Avoid top accesses in pushBottom).

@polytypic polytypic force-pushed the optimize-ws-deque branch 2 times, most recently from d36ac07 to 8eb58ee Compare March 13, 2024 07:46
@polytypic polytypic force-pushed the optimize-ws-deque branch from 8eb58ee to fbcbd98 Compare April 2, 2024 16:19
@polytypic polytypic force-pushed the optimize-ws-deque branch 4 times, most recently from ca1f35e to cd21775 Compare August 11, 2024 08:52
@lyrm lyrm force-pushed the optimize-ws-deque branch from 98f99bd to 5f0534d Compare November 17, 2024 18:53
polytypic and others added 4 commits November 23, 2024 17:29
- Add padding to avoid false sharing
- Use a GADT to express desired result type
- Use various tweaks to improve performance
- Remove negative test that uses the WS deque in an invalid unsafe way
- Implement caching of the thief side index
@lyrm lyrm force-pushed the optimize-ws-deque branch from 5f0534d to 29a52e6 Compare November 23, 2024 16:30
@lyrm lyrm merged commit 2035010 into main Nov 23, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants