-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronization of audio and video dimensions before calculating the contrastive loss #2
Comments
Hi, During contrastive pre-training (stage I), the time dimension is avg pooled (similar for audio): Synchformer/configs/segment_avclip.yaml Line 40 in 814f3ff
So you end up with features At this stage, the input clips are in-sync and you could take a window of consecutive features, e.g. 8, and compute dot product between all windows of 8 in another modality. This is shown on the gifs in README. If the highest dot product response of two windows starts at the same time stamp, the prediction is correct (in zero-shot setting). Notice, that the granularity is 1 feature = segment size = 0.64 seconds. |
During stage II, we remove the |
Thank you for the response! :) If I understand correctly, in the example.py file, the classification model is used, so it flattens the segment dimension instead of averaging over time, right? Also, I just want to confirm—are the weights used to generate features at the end of the first training phase kept constant during the second phase of training? Thanks! |
yes, the model is defined in the
yes, see Line 7 in 814f3ff
and Line 19 in 814f3ff
|
Thank you! |
Hi,
Could you please clarify how synchronization happens in terms of dimensionality between the video and audio representations? Specifically, I’m curious about how this works during training with the contrastive loss and during zero-shot evaluation.
From what I observed, the feature extractors output dimensions of (1,14,8,768) and (1,14,6,768), and the projection layer seems to operate on the 768-dimension, not the time dimension. I couldn’t fully understand how this is handled in the code.
Thank you!
The text was updated successfully, but these errors were encountered: