Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drones may be "deleted" if service is restarted #171

Open
olifre opened this issue Mar 18, 2021 · 0 comments
Open

Drones may be "deleted" if service is restarted #171

olifre opened this issue Mar 18, 2021 · 0 comments

Comments

@olifre
Copy link

olifre commented Mar 18, 2021

I don't have a full understanding on what actually happens in this case, but I found this strange behaviour:

2021-03-18 23:20:23 [INFO][cobald.runtime.tardis.resources.dronestates]: Drone {'site_name': 'uni-bonn', 'machine_type': 'atlas_singlecore', 'machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1024, 'Disk': 1048576}, 'remote_resource_uuid': '502898', 'created': datetime.datetime(2021, 3, 18, 22, 38, 16, 486107), 'updated': datetime.datetime(2021, 3, 18, 23, 19, 23, 671984), 'drone_uuid': 'uni-bonn-3ca24d036b', 'resource_status': <ResourceStatus.Running: 2>} in DrainingState
2021-03-18 23:21:23 [INFO][cobald.runtime.tardis.resources.dronestates]: Drone {'site_name': 'uni-bonn', 'machine_type': 'atlas_singlecore', 'machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1024, 'Disk': 1048576}, 'remote_resource_uuid': '502898', 'created': datetime.datetime(2021, 3, 18, 22, 38, 16, 486107), 'updated': datetime.datetime(2021, 3, 18, 23, 19, 23, 671984), 'drone_uuid': 'uni-bonn-3ca24d036b', 'resource_status': <ResourceStatus.Running: 2>} in DrainingState
2021-03-18 23:22:23 [INFO][cobald.runtime.tardis.resources.dronestates]: Drone {'site_name': 'uni-bonn', 'machine_type': 'atlas_singlecore', 'machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1024, 'Disk': 1048576}, 'remote_resource_uuid': '502898', 'created': datetime.datetime(2021, 3, 18, 22, 38, 16, 486107), 'updated': datetime.datetime(2021, 3, 18, 23, 19, 23, 671984), 'drone_uuid': 'uni-bonn-3ca24d036b', 'resource_status': <ResourceStatus.Running: 2>} in DrainingState
2021-03-18 23:44:13 [INFO][cobald.runtime.tardis.resources.dronestates]: Drone {'site_name': 'uni-bonn', 'machine_type': 'atlas_singlecore', 'machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1024, 'Disk': 1048576}, 'remote_resource_uuid': '502898', 'created': datetime.datetime(2021, 3, 18, 22, 38, 16, 486107), 'updated': datetime.datetime(2021, 3, 18, 23, 19, 23, 671984), 'drone_uuid': 'uni-bonn-3ca24d036b'} in DrainingState
2021-03-18 23:45:18 [INFO][cobald.runtime.tardis.resources.dronestates]: Drone {'site_name': 'uni-bonn', 'machine_type': 'atlas_singlecore', 'machine_meta_data_translation_mapping': {'Cores': 1, 'Memory': 1024, 'Disk': 1048576}, 'remote_resource_uuid': '502898', 'created': datetime.datetime(2021, 3, 18, 22, 38, 16, 486107), 'updated': datetime.datetime(2021, 3, 18, 23, 44, 16, 687803), 'drone_uuid': 'uni-bonn-3ca24d036b', 'resource_status': <ResourceStatus.Deleted: 4>} in DownState

The interesting part is that the drone job has been actively removed:

RemoveReason = "via condor_rm (by user cobald)"

The time gap in the logs is explained by a crash due to #168 and subsequent service restart by Puppet.

So it seems that:

  1. A drone is in DrainingState first, ResourceStatus.Running.
  2. The COBalD/TARDIS service crashes (or is otherwise restarted).
  3. Afterwards, the drone is found back in DrainingState, but the ResourceStatus is not defined / re-checked yet (note it is missing from the line in the log).
  4. At that point the resource appears to be deleted (i.e. condor_rm) even though it is not fully drained yet.

I'm not sure from the code: How are drone state changes handled when the ResourceStatus is not yet defined after startup?
Do you understand what is happening?

(pinging also @wiene )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant