Skip to content

Commit

Permalink
custom: add support for custom container (#84)
Browse files Browse the repository at this point in the history
* custom: add support for custom container

We should be able to support custom containers, and
configuration of addons to them. I am not liking the
design to have addons defined in parallel, and want to
refactor so they are part of the metric. I am also
wondering if the metrics themselves are more akin to
apps. I have not looked at this project in a bit and
need to think about it.

Signed-off-by: vsoch <[email protected]>
  • Loading branch information
vsoch authored Sep 24, 2024
1 parent 8835f15 commit 8f15755
Show file tree
Hide file tree
Showing 23 changed files with 340 additions and 76 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/build-deploy.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ jobs:
uses: actions/checkout@v3
- uses: actions/setup-go@v3
with:
go-version: ^1.18.1
go-version: ^1.22
- name: GHCR Login
if: (github.event_name != 'pull_request')
uses: docker/login-action@v2
Expand Down Expand Up @@ -48,7 +48,7 @@ jobs:
uses: actions/checkout@v3
- uses: actions/setup-go@v3
with:
go-version: ^1.18.1
go-version: ^1.22
- name: GHCR Login
if: (github.event_name != 'pull_request')
uses: docker/login-action@v2
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/helm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ jobs:
uses: actions/checkout@v3
- uses: actions/setup-go@v3
with:
go-version: ^1.18.1
go-version: ^1.22
- name: GHCR Login
if: (github.event_name != 'pull_request')
uses: docker/login-action@v2
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/main.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ jobs:
- name: Setup Go
uses: actions/setup-go@v3
with:
go-version: ^1.20
go-version: ^1.22
- name: fmt check
run: make fmt

Expand Down Expand Up @@ -88,7 +88,7 @@ jobs:
- name: Setup Go
uses: actions/setup-go@v3
with:
go-version: ^1.20
go-version: ^1.22

- name: Start minikube
uses: medyagh/setup-minikube@697f2b7aaed5f70bf2a94ee21a4ec3dde7b12f92 # v0.0.9
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/python.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ jobs:
- name: Setup Go
uses: actions/setup-go@v3
with:
go-version: ^1.20
go-version: ^1.22

- name: Start minikube
uses: medyagh/setup-minikube@697f2b7aaed5f70bf2a94ee21a4ec3dde7b12f92 # v0.0.9
Expand Down
6 changes: 3 additions & 3 deletions .github/workflows/release.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ jobs:
echo "tag=${{ inputs.release_tag }}" >> ${GITHUB_ENV}
- uses: actions/setup-go@v3
with:
go-version: ^1.20
go-version: ^1.22
- name: GHCR Login
uses: docker/login-action@v2
with:
Expand Down Expand Up @@ -51,7 +51,7 @@ jobs:
uses: actions/checkout@v3
- uses: actions/setup-go@v3
with:
go-version: ^1.20
go-version: ^1.22
- name: Set tag
run: |
echo "Tag for release is ${{ inputs.release_tag }}"
Expand Down Expand Up @@ -86,7 +86,7 @@ jobs:
uses: actions/checkout@v3
- uses: actions/setup-go@v3
with:
go-version: ^1.20
go-version: ^1.22
- name: Set tag
run: |
echo "Tag for release is ${{ inputs.release_tag }}"
Expand Down
2 changes: 1 addition & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# Build the manager binary
FROM golang:1.20 as builder
FROM golang:1.22 as builder
ARG TARGETOS
ARG TARGETARCH

Expand Down
16 changes: 15 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -173,6 +173,19 @@ deploy: manifests kustomize ## Deploy controller to the K8s cluster specified in
undeploy: ## Undeploy controller from the K8s cluster specified in ~/.kube/config. Call with ignore-not-found=true to ignore resource not found errors during deletion.
$(KUSTOMIZE) build config/default | kubectl delete --ignore-not-found=$(ignore-not-found) -f -


.PHONY: test-deploy
test-deploy: manifests kustomize
docker build --no-cache -t ${DEVIMG} .
docker push ${DEVIMG}
cd config/manager && $(KUSTOMIZE) edit set image controller=${DEVIMG}
$(KUSTOMIZE) build config/default > examples/dist/metrics-operator-dev.yaml

.PHONY: test-deploy-recreate
test-deploy-recreate: test-deploy
kubectl delete -f ./examples/dist/metrics-operator-dev.yaml || echo "Already deleted"
kubectl apply -f ./examples/dist/metrics-operator-dev.yaml

##@ Build Dependencies

## Location to install dependencies to
Expand All @@ -187,7 +200,7 @@ ENVTEST ?= $(LOCALBIN)/setup-envtest

## Tool Versions
KUSTOMIZE_VERSION ?= v3.8.7
CONTROLLER_TOOLS_VERSION ?= v0.11.1
CONTROLLER_TOOLS_VERSION ?= v0.14.0

KUSTOMIZE_INSTALL_SCRIPT ?= "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh"
.PHONY: kustomize
Expand All @@ -205,6 +218,7 @@ $(CONTROLLER_GEN): $(LOCALBIN)
test -s $(LOCALBIN)/controller-gen && $(LOCALBIN)/controller-gen --version | grep -q $(CONTROLLER_TOOLS_VERSION) || \
GOBIN=$(LOCALBIN) go install sigs.k8s.io/controller-tools/cmd/controller-gen@$(CONTROLLER_TOOLS_VERSION)


.PHONY: envtest
envtest: $(ENVTEST) ## Download envtest-setup locally if necessary.
$(ENVTEST): $(LOCALBIN)
Expand Down
13 changes: 8 additions & 5 deletions api/v1alpha2/zz_generated.deepcopy.go

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

58 changes: 34 additions & 24 deletions config/crd/bases/flux-framework.org_metricsets.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,7 @@ apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
annotations:
controller-gen.kubebuilder.io/version: v0.11.1
creationTimestamp: null
controller-gen.kubebuilder.io/version: v0.14.0
name: metricsets.flux-framework.org
spec:
group: flux-framework.org
Expand All @@ -21,14 +20,19 @@ spec:
description: MetricSet is the Schema for the metrics API
properties:
apiVersion:
description: 'APIVersion defines the versioned schema of this representation
of an object. Servers should convert recognized schemas to the latest
internal value, and may reject unrecognized values. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources'
description: |-
APIVersion defines the versioned schema of this representation of an object.
Servers should convert recognized schemas to the latest internal value, and
may reject unrecognized values.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#resources
type: string
kind:
description: 'Kind is a string value representing the REST resource this
object represents. Servers may infer this from the endpoint the client
submits requests to. Cannot be updated. In CamelCase. More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds'
description: |-
Kind is a string value representing the REST resource this object represents.
Servers may infer this from the endpoint the client submits requests to.
Cannot be updated.
In CamelCase.
More info: https://git.k8s.io/community/contributors/devel/sig-architecture/api-conventions.md#types-kinds
type: string
metadata:
type: object
Expand All @@ -37,21 +41,23 @@ spec:
properties:
deadlineSeconds:
default: 31500000
description: Should the job be limited to a particular number of seconds?
description: |-
Should the job be limited to a particular number of seconds?
Approximately one year. This cannot be zero or job won't start
format: int64
type: integer
dontSetFQDN:
description: Don't set JobSet FQDN
type: boolean
logging:
description: Logging spec, preparing for other kinds of logging Right
now we just include an interactive option
description: |-
Logging spec, preparing for other kinds of logging
Right now we just include an interactive option
properties:
interactive:
description: Don't allow the application, metric, or storage test
to finish This adds sleep infinity at the end to allow for interactive
mode.
description: |-
Don't allow the application, metric, or storage test to finish
This adds sleep infinity at the end to allow for interactive mode.
type: boolean
type: object
metrics:
Expand All @@ -60,15 +66,15 @@ spec:
items:
properties:
addons:
description: A Metric addon can be storage (volume) or an application,
It's an additional entity that can customize a replicated
job, either adding assets / features or entire containers
to the pod
description: |-
A Metric addon can be storage (volume) or an application,
It's an additional entity that can customize a replicated job,
either adding assets / features or entire containers to the pod
items:
description: 'A Metric addon is an interface that exposes
extra volumes for a metric. Examples include: A storage
volume to be mounted on one or more of the replicated jobs
A single application container.'
description: |-
A Metric addon is an interface that exposes extra volumes for a metric. Examples include:
A storage volume to be mounted on one or more of the replicated jobs
A single application container.
properties:
listOptions:
additionalProperties:
Expand Down Expand Up @@ -129,7 +135,9 @@ spec:
- type: string
x-kubernetes-int-or-string: true
type: array
description: Metric List Options Metric specific options
description: |-
Metric List Options
Metric specific options
type: object
mapOptions:
additionalProperties:
Expand All @@ -149,7 +157,9 @@ spec:
- type: integer
- type: string
x-kubernetes-int-or-string: true
description: Metric Options Metric specific options
description: |-
Metric Options
Metric specific options
type: object
resources:
description: Resources include limits and requests for the metric
Expand Down
1 change: 0 additions & 1 deletion config/rbac/role.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
creationTimestamp: null
name: manager-role
rules:
- apiGroups:
Expand Down
2 changes: 1 addition & 1 deletion controllers/metric/metric_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ func (r *MetricSetReconciler) Reconcile(ctx context.Context, req ctrl.Request) (

// Ensure the metricset is mapped to a JobSet. For design:
// 1. If an application is provided, we pair the application at some scale with each metric as a contaienr
// 2. If storage is provided, we create the volumes for the metric containers
// 2. If storage or other addons are provided, we create the volumes for the metric containers
result, err := r.ensureMetricSet(ctx, &spec, &set)
if err != nil {
r.Log.Error(err, "🟥️ Issue ensuring metric set")
Expand Down
7 changes: 7 additions & 0 deletions docs/_static/data/metrics.json
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,13 @@
"image": "ghcr.io/converged-computing/metric-cabanapic:latest",
"url": "https://github.com/ECP-copa/CabanaPIC"
},
{
"name": "app-custom",
"description": "Provide a custom application for MPI trace",
"family": "proxyapp",
"image": "",
"url": "https://converged-computing.github.io/metrics-operator"
},
{
"name": "app-hpl",
"description": "High-Performance Linpack (HPL)",
Expand Down
2 changes: 1 addition & 1 deletion docs/getting_started/addons.md
Original file line number Diff line number Diff line change
Expand Up @@ -270,7 +270,7 @@ wrapper to the actual executable.

### perf-mpitrace

- *[perf-mpitrace](https://github.com/converged-computing/metrics-operator/tree/main/examples/addons/perf-mpitrace)*
- *[perf-mpitrace](https://github.com/converged-computing/metrics-operator/tree/main/examples/addons/mpitrace-lammps)*

This metric provides [mpitrace](https://github.com/IBM/mpitrace) to wrap an MPI application. The setup is the same as hpctoolkit, and we
currently only provide a rocky base (please let us know if you need another). It works by way of wrapping the mpirun command with `LD_PRELOAD`.
Expand Down
43 changes: 43 additions & 0 deletions docs/getting_started/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -251,6 +251,49 @@ Here are some useful resources for the benchmarks:
- [HPC Council](https://hpcadvisorycouncil.atlassian.net/wiki/spaces/HPCWORKS/pages/1284538459/OSU+Benchmark+Tuning+for+2nd+Gen+AMD+EPYC+using+HDR+InfiniBand+over+HPC-X+MPI)
- [AWS Tutorials](https://www.hpcworkshops.com/08-efa/04-complie-run-osu.html)

### app-custom

A custom application can support any application to be used as a metric app. For the following parameters, "command" and "container" are required.

| Name | Description | Option Key | Type | Default |
|-----|-------------|------------|------|---------|
| command | The full mpirun command | options->command |string | unset |
| workdir | The working directory for the command | options->workdir | string | unset |
| soleTenancy | require each pod to have sole tenancy | command->soleTenancy | string | "false" |

As an example, here is running mpitrace (an addon) with a custom container.

```yaml
apiVersion: flux-framework.org/v1alpha2
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
# Number of pods for lammps (one launcher, the rest workers)
pods: 4
metrics:
- name: app-custom
image: ghcr.io/converged-computing/<your-container>
options:
command: mpirun --hostfile ./hostlist.txt -mca orte_keep_fqdn_hostnames t -np 4 --map-by socket <app> <options>
workdir: <workdir>

# Add on hpctoolkit, will mount a volume and wrap lammps
addons:
- name: perf-mpitrace
options:
mount: /opt/mnt
image: ghcr.io/converged-computing/metric-mpitrace:ubuntu-jammy
workdir: <workdir>
# this is the target of the replicated job "l" means launcher
target: l
# This is the target container, with full name "launcher"
containerTarget: launcher
```
### app-lammps
- *[app-lammps](https://github.com/converged-computing/metrics-operator/tree/main/examples/tests/app-lammps)*
Expand Down
1 change: 0 additions & 1 deletion examples/addons/mpitrace-lammps/metrics-rocky.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@ spec:
command: /opt/intel/mpi/2021.8.0/bin/mpirun --hostfile ./hostlist.txt -np 4 --map-by socket lmp -v x 2 -v y 2 -v z 2 -in in.reaxc.hns -nocite
workdir: /opt/lammps/examples/reaxff/HNS

# Add on hpctoolkit, will mount a volume and wrap lammps
addons:
- name: perf-mpitrace
options:
Expand Down
Loading

0 comments on commit 8f15755

Please sign in to comment.