Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synchronization of audio and video dimensions before calculating the contrastive loss #2

Open
roibenita opened this issue Oct 28, 2024 · 5 comments

Comments

@roibenita
Copy link

Hi,

Could you please clarify how synchronization happens in terms of dimensionality between the video and audio representations? Specifically, I’m curious about how this works during training with the contrastive loss and during zero-shot evaluation.

From what I observed, the feature extractors output dimensions of (1,14,8,768) and (1,14,6,768), and the projection layer seems to operate on the 768-dimension, not the time dimension. I couldn’t fully understand how this is handled in the code.

Thank you!

@v-iashin
Copy link
Owner

Hi,

During contrastive pre-training (stage I), the time dimension is avg pooled (similar for audio):

agg_time_module: 'AveragePooling' # 'AveragePooling' or 'TransformerEncoderLayer'

So you end up with features (1, 14, 768) for audio and RGB.

At this stage, the input clips are in-sync and you could take a window of consecutive features, e.g. 8, and compute dot product between all windows of 8 in another modality. This is shown on the gifs in README.

If the highest dot product response of two windows starts at the same time stamp, the prediction is correct (in zero-shot setting). Notice, that the granularity is 1 feature = segment size = 0.64 seconds.

@v-iashin
Copy link
Owner

v-iashin commented Oct 28, 2024

During stage II, we remove the agg_time_module to keep the time dimensions (8 for rgb and 6 for audio). Then we could flatten the segment dimension to get features with finer time granularity: 1, 14*8, 768 and 1, 14*6, 768. These will be used to train the synchronization module as a classified onto (21, for 0.2s-grid as in paper, or more classes if you wish higher sync precision).

@roibenita
Copy link
Author

Thank you for the response! :)

If I understand correctly, in the example.py file, the classification model is used, so it flattens the segment dimension instead of averaging over time, right?

Also, I just want to confirm—are the weights used to generate features at the end of the first training phase kept constant during the second phase of training?

Thanks!

@v-iashin
Copy link
Owner

If I understand correctly, in the example.py file, the classification model is used, so it flattens the segment dimension instead of averaging over time, right?

yes, the model is defined in the cfg.model in the config of the experiment that you are providing. see the agg_time_module in cfg-24-01-04T16-39-21.yaml.

are the weights used to generate features at the end of the first training phase kept constant during the second phase of training?

yes, see

is_trainable: False

and
is_trainable: False

@roibenita
Copy link
Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants