Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dest snapshot wait for readyToUse=true #1520

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tesshuflower
Copy link
Contributor

Describe what this PR does

When reconciling the snapshot on the destination side, wait for status.ReadyToUse to get set to true.

Is there anything that requires special attention?

Related issues:
#1504

Copy link
Contributor

openshift-ci bot commented Jan 14, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tesshuflower

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

codecov bot commented Jan 14, 2025

Codecov Report

Attention: Patch coverage is 44.44444% with 5 lines in your changes missing coverage. Please review.

Project coverage is 66.2%. Comparing base (55aef2f) to head (1c28f25).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
controllers/volumehandler/volumehandler.go 44.4% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #1520     +/-   ##
=======================================
- Coverage   66.2%   66.2%   -0.1%     
=======================================
  Files         57      57             
  Lines       7474    7484     +10     
=======================================
+ Hits        4950    4955      +5     
- Misses      2236    2240      +4     
- Partials     288     289      +1     
Files with missing lines Coverage Δ
controllers/volumehandler/volumehandler.go 57.3% <44.4%> (-0.2%) ⬇️

@bornav
Copy link

bornav commented Jan 15, 2025

manually build the image and overwrote the helmchart value to use it

 values:
    image:
      repository: harbor.icylair.com/library/volsync_1504
      tag: v0.0.1
    manageCRDs: true
    metrics:
      disableAuth: true
    replicaCount: 1
    restic:
      repository: harbor.icylair.com/library/volsync_1504
      tag: v0.0.1
    serviceAccount:
      create: true

and the issue still persists for me

of note, did not use the makefile to build it but did the docker-buildx steps from the makefile manually build it while removing --platform from the first line and removing the -tags "${tags_arg}" from the go build, since with this in docker buildx did not compile
hope this does not affect the result

@tesshuflower
Copy link
Contributor Author

tesshuflower commented Jan 16, 2025

manually build the image and overwrote the helmchart value to use it

 values:
    image:
      repository: harbor.icylair.com/library/volsync_1504
      tag: v0.0.1
    manageCRDs: true
    metrics:
      disableAuth: true
    replicaCount: 1
    restic:
      repository: harbor.icylair.com/library/volsync_1504
      tag: v0.0.1
    serviceAccount:
      create: true

and the issue still persists for me

of note, did not use the makefile to build it but did the docker-buildx steps from the makefile manually build it while removing --platform from the first line and removing the -tags "${tags_arg}" from the go build, since with this in docker buildx did not compile hope this does not affect the result

That's unfortunate - and might possibly point to some issue with the storage provider? I've been assuming the issue is around deleting the PVC the volumesnapshot was taken from, and then trying to use the volumesnapshot (create a PVC from this snapshot) and somehow it doesn't work.

However, there's still a couple things:

  1. I haven't actually seen the errors - am I correct that the error itself is that a PVC cannot be created from the volumesnapshot after the PVC the volumesnapshot was taken from is deleted? If this is really the case, it should be possible to recreate without Volsync.
  2. I'm not sure what to say about how you built the image - Normally I would use make docker-build to build a local image... I'm not sure what you could do wrong though - if you do use the Makefile to build, it does inject the BUILD_VERSION and then you could see the commit that it was compiled with (which volsync will print out in the logs at startup, something like Operator Version: vX.Y.Z-<shortcommitid>). Right now on my fork for this PR the commit should be: 18f35de3. Let me know if you really aren't sure and I could add some debug logging or something (assuming you're willing to try again) to help confirm you're using the modified code.

@bornav
Copy link

bornav commented Jan 16, 2025

@tesshuflower
redeployed and checked logs of volsync, and there seems to be errors in there, also for any additional changes it's no problem to rebuild it and test again.

As for why i did not use the make docer build is since i need to crosscompile it to both amd and arm, and make docker-buildx was not spinning up the actual command but was complaining about missing go binary, so manualy did the steps specified in the make file

temp.json

Signed-off-by: Tesshu Flower <[email protected]>
@tesshuflower
Copy link
Contributor Author

@tesshuflower redeployed and checked logs of volsync, and there seems to be errors in there, also for any additional changes it's no problem to rebuild it and test again.

As for why i did not use the make docer build is since i need to crosscompile it to both amd and arm, and make docker-buildx was not spinning up the actual command but was complaining about missing go binary, so manualy did the steps specified in the make file

temp.json

Thanks for that - I don't necessarily see anything wrong in the logs. I pushed another commit to my branch with some debug logging - would it be possible to try again with the latest in my branch?

Additionally - when you do hit the error - I've been assuming the issue is with the PVC from snapshot itself? (i.e. the volumepopulator pvc). This PVC is really just creating a PVC from the snapshot in the replicationdestination latestImage. Are you able to create PVCs from that snapshot yourself to see if they have the same issue? The assumption so far is that we're trying to use the volumesnapshot before it's really ready (and that it doesn't like us removing the PVC the volumesnapshot was based on). If the volumesnapshot really is ready, you should be able to create a PVC from the snapshot even if that PVC no longer exists.

@bornav
Copy link

bornav commented Jan 16, 2025

ill test it in a bit, but before what i forgot to mention in my last comment, is the behavior changed a bit now, the vs-prime no longer exists, and it releases the pvc, however when looking at the pvc data it is 0B, so nothing was populated, and the pvc is still stuck in maintenance mode

@bornav
Copy link

bornav commented Jan 16, 2025

think i got it, cant replicate the issue anymore, seems i found a second bug
so with it properly configured tried sync 20 times, all passed, now for the previous comment

if the snapshot size is 0, seems it is not unlocking the pvc, is this intentional?(happens once every 10 runs)

@tajinder400 could you try it with your setup and post the result
x86 image is now here: harbor.icylair.com/library/volsync_1504:v0.0.2

@tajinder400
Copy link

@bornav Sorry to say I don't get consistent results with this. I've been testing by destroying then deploying 3 different services at the same time. I've found that 2/3 times it'll create volsync-src which has data in it. but the PVC which attaches directly to the container is completely empty.

This is me just using your image and using this exactly how I might normally use it. Bulk redeployments.

@bornav
Copy link

bornav commented Jan 16, 2025

@bornav Sorry to say I don't get consistent results with this. I've been testing by destroying then deploying 3 different services at the same time. I've found that 2/3 times it'll create volsync-src which has data in it. but the PVC which attaches directly to the container is completely empty.

This is me just using your image and using this exactly how I might normally use it. Bulk redeployments.

but is this the same behavior as before or is this different from what happens on the official release version?

here is the manifest that with while true loop for over 1k iterations did not fail once

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  labels:
    app: test
  name: test
  namespace: apps
  annotations:
    numberOfReplicas: '1'
spec:
  accessModes: ["ReadWriteMany"]
  storageClassName: longhorn-storage-im
  dataSourceRef:
    kind: ReplicationDestination
    apiGroup: volsync.backube
    name: test-restore  # Name of the ReplicationDestination
  resources:
    requests:
      storage: 1Gi
status: {}
---
# yaml-language-server: $schema=https://kubernetes-schemas.pages.dev/volsync.backube/replicationdestination_v1alpha1.json
apiVersion: volsync.backube/v1alpha1
kind: ReplicationDestination
metadata:
  name: test-restore
  namespace: apps
spec:
  trigger:
    manual: restore-once
  restic:
    repository: restic-test
    storageClassName: longhorn-storage-im
    accessModes: ["ReadWriteOnce"]
    copyMethod: Snapshot
    capacity: 1Gi
    enableFileDeletion: false
    cleanupCachePVC: true
    cleanupTempPVC: true

@tajinder400
Copy link

tajinder400 commented Jan 16, 2025

@bornav Sorry to say I don't get consistent results with this. I've been testing by destroying then deploying 3 different services at the same time. I've found that 2/3 times it'll create volsync-src which has data in it. but the PVC which attaches directly to the container is completely empty.
This is me just using your image and using this exactly how I might normally use it. Bulk redeployments.

but is this the same behavior as before or is this different from what happens on the official release version?

here is the manifest that with while true loop for over 1k iterations did not fail once

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  labels:
    app: test
  name: test
  namespace: apps
  annotations:
    numberOfReplicas: '1'
spec:
  accessModes: ["ReadWriteMany"]
  storageClassName: longhorn-storage-im
  dataSourceRef:
    kind: ReplicationDestination
    apiGroup: volsync.backube
    name: test-restore  # Name of the ReplicationDestination
  resources:
    requests:
      storage: 1Gi
status: {}
---
# yaml-language-server: $schema=https://kubernetes-schemas.pages.dev/volsync.backube/replicationdestination_v1alpha1.json
apiVersion: volsync.backube/v1alpha1
kind: ReplicationDestination
metadata:
  name: test-restore
  namespace: apps
spec:
  trigger:
    manual: restore-once
  restic:
    repository: restic-test
    storageClassName: longhorn-storage-im
    accessModes: ["ReadWriteOnce"]
    copyMethod: Snapshot
    capacity: 1Gi
    enableFileDeletion: false
    cleanupCachePVC: true
    cleanupTempPVC: true

Yes, it's the same behavior as with the normal release. The only difference between your options and mine is the enableFileDeletion tag in mine is enabled

If there's specific log output that would be helpful here let me know.

@bornav
Copy link

bornav commented Jan 17, 2025

one thing im interested in, you mention they are completely empty, so the pvc-s are created, and released, and pods spin up with them attached?
since for me it was usually everything cleaned up, but pvc cant attach, in the ui under pvc big red circle "in maintenance mode"

@tajinder400
Copy link

one thing im interested in, you mention they are completely empty, so the pvc-s are created, and released, and pods spin up with them attached? since for me it was usually everything cleaned up, but pvc cant attach, in the ui under pvc big red circle "in maintenance mode"

Firstly, I'll ask which UI you're talking about, since I think we're looking at difference things which makes it a bit difficult to align.

Secondly, you're mostly right, from my perspective volsync-src is empty, and the services which are trying to start waiting for their PVC are permanently in a ContainerCreating stating or Pending.

I've been monitoring this directly in the longhorn webui. If you're looking elsewhere, I can also re-test and check there to give results in the same format you're getting.

@bornav
Copy link

bornav commented Jan 17, 2025

by ui i ment the longhorn ui, and yes this is what i ment, so they are stuck in pending/containerCreating, which correlates with the second "bug" that started happening

if the snapshot size is 0, seems it is not unlocking the pvc, is this intentional?(happens once every 10 runs)

but tldr seems it is still broken

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants