Run on Google Cloud nodes
Let's test different depths (N-ary) of trees. We first want to understand the structures of trees that flux generates depending on the topology spec and number of nodes. Then, for each depth we will test:
- Distribution from the root to all leaves (lowest level)
- Distribution from the root to all nodes (regardless of level)
- Distribution from root to middle level, and then to leaves.
We would want to see if there is a more efficient strategy, and then we would want to be able to combine Flux, a snapshotter, and possibly a CSI to distribute large files in Kubernetes. The idea would be that:
- The snapshotter (rank 0) would retrieve from the registry
- Rank 0 would distribute to workers
- The other ranks would have a CSI to bind to the node.
We could also JUST use a snapshotter OR the CSI.
This advice comes from garlick
the performance will be sensitive to the tree fanout because each level of the tree will fetch data once from its parent, then provide it once to each child that is requesting it. Well, that would assume perfect caching but the LRU cache tries to maintain itself below 16MB so for large amounts of data the cache may thrash a bit. If you want to play with that limit, you could do something like
flux module reload content purge-target-size=104857600 # 100mb
flux exec -r all flux module reload content purge-target-size=104857600 # 100mb
Not sure what effect that would have since it kind of depends on how the timing works out. You can peek at the cache size with
flux module stats content | jq
Note that we likely want to update the size 30 experiment by:
- Adding the unset of the cache
- Have the data file generated programatically (instead of needing to build into container, which will get large).
- Test a smaller number of kary sizes (1, 2, and then possibly evens up to the largest size)
- Also go up by even sizes for the GB sizes - it takes too long to do every single one!
Note that the flux-design* files were generated in the kind-experiment and you will need them here.
Run the experiments!
First we will do a max size of 2 on 6 nodes.
time gcloud container clusters create test-cluster \
--threads-per-core=1 \
--num-nodes=6 \
--machine-type=c2d-standard-32 \
--enable-gvnic \
--network=mtu9k \
--placement-type=COMPACT \
--region=us-central1-a \
--project=${GOOGLE_PROJECT}
kubectl apply -f https://raw.githubusercontent.com/flux-framework/flux-operator/refs/heads/main/examples/dist/flux-operator.yaml
python run-experiment.py --data ./kary-designs.json --max-nodes=6 --max-size=2 --data-dir ./data/raw-max-6 --template ./templates/minicluster-test.yaml
time gcloud container clusters delete test-cluster --region=us-central1-a
python run-analysis.py --out ./data/parsed-max-6 --data ./data/raw-max-6
Experiments are done!
total time to run is 5793.205878019333 seconds
Next, let's just test a large size (30)
time gcloud container clusters create test-cluster \
--threads-per-core=1 \
--num-nodes=30 \
--machine-type=c2d-standard-32 \
--enable-gvnic \
--network=mtu9k \
--placement-type=COMPACT \
--region=us-central1-a \
--project=${GOOGLE_PROJECT}
kubectl apply -f https://raw.githubusercontent.com/flux-framework/flux-operator/refs/heads/main/examples/dist/flux-operator.yaml
python run-experiment.py --data ./kary-designs.json --max-size=2 --exact-nodes=30 --data-dir ./data/raw-exact-30 --template ./templates/minicluster-test.yaml
time gcloud container clusters delete test-cluster --region=us-central1-a
python run-analysis.py --out ./data/parsed-exact-30 --data ./data/raw-exact-30
Experiments are done!
total time to run is 9767.250812530518 seconds
Google cloud was issuing an error, I switched to aws and it went away.
eksctl create cluster --config-file ./eks-config-6.yaml
aws eks update-kubeconfig --region us-east-2 --name topology-study
kubectl apply -f https://raw.githubusercontent.com/flux-framework/flux-operator/refs/heads/main/examples/dist/flux-operator.yaml
# Don't bother with smaller sizes, just 6
python run-experiment.py --data ./kary-designs.json --exact-nodes=6 --min-size=1 --max-size=10 --data-dir ./data/raw-exact-6-aws --template ./templates/minicluster.yaml --iters 3
eksctl delete cluster --config-file ./eks-config-6.yaml --wait
python run-analysis.py --out ./data/parsed-exact-6-aws --data ./data/raw-exact-6-aws
Experiments (N=12) are done!
total time to run is 9715.973370552063 seconds
For this updated setup without a view, here is how to connect to the broker's socket:
flux proxy local:///mnt/flux/view/run/flux/local bash
flux dmesg
flux module stats content | jq
For cost, these are 0.6160 * 30 == 18.48/hour. For the size 30 test, we previously did 16 runs with 5 iterations each. If for the 6 node test we did 12 and it took 161 minutes, assuming the same time (which we can't because the experiment is about half the size) if we did that same size it would take ~17 hours, which is too long. If we assume 16 kary designs * 2 iterations each * 2.02375 minutes / iteration, then that's 64.76 minutes. Since we just want a sample, let's start with just one iteration for 30 nodes (and time it) and we can always do another one.
For actual timings:
- the cluster is 18.48/hour
- the first run takes 20 minutes because of the image pull
- subsequent runs take 13 minutes
- that means for 1 iteration and 16 topologies, the experiment should take 208 (let's say 215) minutes (rounded up) and thus (215 / 60) * 18.48 == $66.22 and we can also round up to $70. I was aiming for under $100 so that is within budget.
time eksctl create cluster --config-file ./eks-config-30.yaml
2024-12-01 08:04:14 [ℹ] cluster should be functional despite missing (or misconfigured) client binaries
2024-12-01 08:04:14 [✔] EKS cluster "topology-study" in "us-east-2" region is ready
real 15m46.129s
user 0m0.420s
sys 0m0.189s
aws eks update-kubeconfig --region us-east-2 --name topology-study
kubectl apply -f https://raw.githubusercontent.com/flux-framework/flux-operator/refs/heads/main/examples/dist/flux-operator.yaml
python run-experiment.py --data ./kary-designs.json --exact-nodes=30 --min-size=1 --max-size=10 --data-dir ./data/raw-exact-30-aws --template ./templates/minicluster.yaml --iters 1
eksctl delete cluster --config-file ./eks-config-30.yaml --wait
python run-analysis.py --out ./data/parsed-exact-30-aws --data ./data/raw-exact-30-aws
Experiments (N=16) are done!
total time to run is 12480.895931243896 seconds
For cost, these are 0.6160 * 256 == 157.70/hour. I am just going to do one iteration here, the idea being we want to see the result For this larger test. Let's limit the kary's to 2 and start with 2 iterations, and we can increase if they are quicker than we expect.
For actual timings:
- the cluster is 18.48/hour
- the first run takes 20 minutes because of the image pull
- subsequent runs take 13 minutes
- that means for 1 iteration and 16 topologies, the experiment should take 208 (let's say 215) minutes (rounded up) and thus (215 / 60) * 18.48 == $66.22 and we can also round up to $70. I was aiming for under $100 so that is within budget.
# Going up at 11:00
time eksctl create cluster --config-file ./eks-config-256.yaml
aws eks update-kubeconfig --region us-east-2 --name topology-study
# Note that topology does not work for these instances
aws ec2 describe-instances --filters "Name=instance-type,Values=c5a.4xlarge" --region us-east-2 > aws-instances-256.json
kubectl apply -f https://raw.githubusercontent.com/flux-framework/flux-operator/refs/heads/main/examples/dist/flux-operator.yaml
time python run-experiment.py --data ./kary-designs.json --exact-nodes=256 --min-size=1 --max-size=10 --data-dir ./data/raw-exact-256-aws --template ./templates/minicluster.yaml --topo kary:1 --topo kary:16 --iters 1
[{'nodes': 256, 'topo': 'kary:1'}, {'nodes': 256, 'topo': 'kary:16'}]
🧪️ Experiments:
🪴️ Planning to run:
Output Data : ./data/raw-exact-256-aws
Experiments : 2
Exact Nodes : 256
Min Size : 1
Max Size : 10
Iters : 1
Would you like to continue? (yes/no)? yes
== Running experiment {'nodes': 256, 'topo': 'kary:1'}: 0 of 2
🍔 Running topology experiment size 256
minicluster.flux-framework.org/flux-sample created
job.batch/flux-sample condition met
Writing topology log and recordings to ./data/raw-exact-256-aws/256/kary-1/0/topology-experiment.out
minicluster.flux-framework.org "flux-sample" deleted
== Running experiment {'nodes': 256, 'topo': 'kary:16'}: 1 of 2
🍔 Running topology experiment size 256
minicluster.flux-framework.org/flux-sample created
job.batch/flux-sample condition met
Writing topology log and recordings to ./data/raw-exact-256-aws/256/kary-16/0/topology-experiment.out
minicluster.flux-framework.org "flux-sample" deleted
Experiments (N=2) are done!
total time to run is 4008.657091140747 seconds
real 66m58.074s
user 0m6.175s
sys 0m1.354s
And then:
eksctl delete cluster --config-file ./eks-config-256.yaml --wait
python run-analysis.py --out ./data/parsed-exact-256-aws --data ./data/raw-exact-256-aws
Experiments (N=16) are done!
total time to run is 12480.895931243896 seconds
Let's look at 6 "nodes"
This will allow for more "kary" designs.
Without any bugs. I removed the middle distribution layer to reduce experiment running time.
Based on the time, that did come out to cost what I estimated! We likely should discuss what we see and decide what to test next. I'm not seeing any differences with respect to distribution, but maybe creation? We would also want to compare this strategy against just downloading the same size archive, by each node.
This took about 66 minutes, just 1 iteration for 2 sizes. The result is flipped from what I would expect (and I checked the data, indeed kary-16 was slower).