You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug a clear and concise description of what the bug is.
When I execute this, I always receive this error. If I use the pip agent on the same machine, it works fine:
[notice] To update, run: python3.10 -m pip install --upgrade pip
/root/start_agent.sh: line 14: 684 Segmentation fault (core dumped) $LOCAL_PYTHON -m clearml_agent execute --full-monitoring --require-queue --id 5c06200ada0f4aa2bfe9a16e0ebee84b
What's your helm version?
v3.16.3
What's your kubectl version?
v1.27.16 +rke2r2
What's the chart version?
5.2.2
Enter the changed values of values.yaml?
# -- Global parameters sectionglobal:
# -- Images registryimageRegistry: "docker.io"# -- Private image registry configurationimageCredentials:
# -- Use private authentication modeenabled: false# -- If this is set, chart will not generate a secret but will use what is defined hereexistingSecret: ""# -- Registry nameregistry: docker.io# -- Registry usernameusername: someone# -- Registry passwordpassword: pwd# -- Emailemail: [email protected]# -- ClearMl generic configurationsclearml:
# -- If this is set, chart will not generate a secret but will use what is defined hereexistingAgentk8sglueSecret: ""# -- Agent k8s Glue basic auth keyagentk8sglueKey: "xxx"# -- Agent k8s Glue basic auth secretagentk8sglueSecret: "xxx"# -- If this is set, chart will not generate a secret but will use what is defined hereexistingClearmlConfigSecret: ""# The secret should be defined as the following example## apiVersion: v1# kind: Secret# metadata:# name: secret-name# stringData:# clearml.conf: |-# sdk {# }# -- ClearML configuration fileclearmlConfig: |- sdk { }# -- This agent will spawn queued experiments in new pods, a good use case is to combine this with# GPU autoscaling nodes.# https://github.com/allegroai/clearml-agent/tree/master/docker/k8s-glueagentk8sglue:
# -- Glue Agent image configurationimage:
registry: ""repository: "allegroai/clearml-agent-k8s-base"tag: "1.24-21"# -- Glue Agent number of podsreplicaCount: 1# -- Glue Agent pod resourcesresources: {}# -- Glue Agent pod initContainers configsinitContainers:
# -- Glue Agent initcontainers pod resourcesresources: {}# -- if set, don't create a serviceAccountName but use defined existing oneserviceExistingAccountName: ""# -- Check certificates validity for evefry UrlReference below.clearmlcheckCertificate: true# -- Reference to Api server urlapiServerUrlReference: "https://api-clearml.xxx.com"# -- Reference to File server urlfileServerUrlReference: "https://files-clearml.xxx.com"# -- Reference to Web server urlwebServerUrlReference: "https://app-clearml.xxx.com"# -- default container image for ClearML Task poddefaultContainerImage: clearml/fractional-gpu:u22-cu11.7-12gb# -- ClearML queue this agent will consume. Multiple queues can be specified with the following format: queue1,queue2,queue3queue: kollama# -- if ClearML queue does not exist, it will be create it if the value is set to truecreateQueueIfNotExists: false# -- labels setup for Agent pod (example in values.yaml comments)labels: {}# schedulerName: scheduler# -- annotations setup for Agent pod (example in values.yaml comments)annotations: {}# key1: value1# -- Extra Environment variables for Glue AgentextraEnvs: []# - name: PYTHONPATH# value: "somepath"# -- container securityContext setup for Agent pod (example in values.yaml comments)podSecurityContext: {}# runAsUser: 1001# fsGroup: 1001# -- container securityContext setup for Agent pod (example in values.yaml comments)containerSecurityContext: {}# runAsUser: 1001# fsGroup: 1001# -- additional existing ClusterRoleBindingsadditionalClusterRoleBindings: []# - privileged# -- additional existing RoleBindingsadditionalRoleBindings: []# - privileged# -- nodeSelector setup for Agent pod (example in values.yaml comments)nodeSelector: {}# fleet: agent-nodes# -- tolerations setup for Agent pod (example in values.yaml comments)tolerations: []# -- affinity setup for Agent pod (example in values.yaml comments)affinity: {}# -- volumes definition for Glue Agent (example in values.yaml comments)volumes: []# - name: "yourvolume"# nfs:# server: 192.168.0.1# path: /var/nfs/mount# -- volume mounts definition for Glue Agent (example in values.yaml comments)volumeMounts: []# - name: yourvolume# mountPath: /yourpath# subPath: userfolder# -- file definition for Glue Agent (example in values.yaml comments)fileMounts: []# - name: "integration.py"# folderPath: "/mnt/python"# fileContent: |-# def get_template(*args, **kwargs):# print("args: {}".format(args))# print("kwargs: {}".format(kwargs))# return {# "template": {# }# }# -- base template for pods spawned to consume ClearML TaskbasePodTemplate:
# -- labels setup for pods spawned to consume ClearML Task (example in values.yaml comments)labels: {}# schedulerName: scheduler# -- annotations setup for pods spawned to consume ClearML Task (example in values.yaml comments)annotations: {}# key1: value1# -- initContainers definition for pods spawned to consume ClearML Task (example in values.yaml comments)hostPID: true# initContainers:# - name: train-container# image: clearml/fractional-gpu:u22-cu11.7-12gb# command: ['python3', '-c', 'print(f"Free GPU Memory: (free, global) {torch.cuda.mem_get_info()}")']# initContainers: []# - name: volume-dirs-init-cntr# image: busybox:1.35# command:# - /bin/bash# - -c# - ># /bin/echo "this is an init";# -- schedulerName setup for pods spawned to consume ClearML TaskschedulerName: ""# -- volumes definition for pods spawned to consume ClearML Task (example in values.yaml comments)volumes: []# - name: "yourvolume"# nfs:# server: 192.168.0.1# path: /var/nfs/mount# -- volume mounts definition for pods spawned to consume ClearML Task (example in values.yaml comments)volumeMounts: []# - name: yourvolume# mountPath: /yourpath# subPath: userfolder# -- file definition for pods spawned to consume ClearML Task (example in values.yaml comments)fileMounts: []# - name: "mounted-file.txt"# folderPath: "/mnt/"# fileContent: |-# this is a test file# with test content# -- environment variables for pods spawned to consume ClearML Task (example in values.yaml comments)env: []# # to setup access to private repo, setup secret with git credentials:# - name: CLEARML_AGENT_GIT_USER# value: mygitusername# - name: CLEARML_AGENT_GIT_PASS# valueFrom:# secretKeyRef:# name: git-password# key: git-password# - name: CURL_CA_BUNDLE# value: ""# - name: PYTHONWARNINGS# value: "ignore:Unverified HTTPS request"# -- resources declaration for pods spawned to consume ClearML Task (example in values.yaml comments)resources:
limits:
nvidia.com/gpu: 1# -- priorityClassName setup for pods spawned to consume ClearML TaskpriorityClassName: ""# -- nodeSelector setup for pods spawned to consume ClearML Task (example in values.yaml comments)nodeSelector:
nvidia.com/gpu.product: NVIDIA-L40S-SHARED# fleet: gpu-nodes# -- tolerations setup for pods spawned to consume ClearML Task (example in values.yaml comments)tolerations:
- key: "nvidia.com/gpu"operator: Existseffect: "NoSchedule"# -- affinity setup for pods spawned to consume ClearML Taskaffinity: {}# -- securityContext setup for pods spawned to consume ClearML Task (example in values.yaml comments)podSecurityContext: {}# runAsUser: 1001# fsGroup: 1001# -- securityContext setup for containers spawned to consume ClearML Task (example in values.yaml comments)containerSecurityContext: {}# runAsUser: 1001# fsGroup: 1001# -- hostAliases setup for pods spawned to consume ClearML Task (example in values.yaml comments)hostAliases: []# - ip: "127.0.0.1"# hostnames:# - "foo.local"# - "bar.local"# -- Sessions internal service configurationsessions:
# -- Enable/Disable sessions portmode WARNING: only one Agent deployment can have this set to trueportModeEnabled: false# -- specific annotations for session servicessvcAnnotations: {}# -- service type ("NodePort" or "ClusterIP" or "LoadBalancer")svcType: "NodePort"# -- External IP sessions clients can connect toexternalIP: 0.0.0.0# -- starting range of exposed NodePortsstartingPort: 30000# -- maximum number of NodePorts exposedmaxServices: 20
The text was updated successfully, but these errors were encountered:
Describe the bug a clear and concise description of what the bug is.
When I execute this, I always receive this error. If I use the pip agent on the same machine, it works fine:
[notice] To update, run: python3.10 -m pip install --upgrade pip
/root/start_agent.sh: line 14: 684 Segmentation fault (core dumped) $LOCAL_PYTHON -m clearml_agent execute --full-monitoring --require-queue --id 5c06200ada0f4aa2bfe9a16e0ebee84b
What's your helm version?
v3.16.3
What's your kubectl version?
v1.27.16 +rke2r2
What's the chart version?
5.2.2
Enter the changed values of values.yaml?
The text was updated successfully, but these errors were encountered: