Synchronization of audio and video dimensions before calculating the contrastive loss #2

roibenita · 2024-10-28T14:34:58Z

Hi,

Could you please clarify how synchronization happens in terms of dimensionality between the video and audio representations? Specifically, I’m curious about how this works during training with the contrastive loss and during zero-shot evaluation.

From what I observed, the feature extractors output dimensions of (1,14,8,768) and (1,14,6,768), and the projection layer seems to operate on the 768-dimension, not the time dimension. I couldn’t fully understand how this is handled in the code.

Thank you!

v-iashin · 2024-10-28T14:55:04Z

Hi,

During contrastive pre-training (stage I), the time dimension is avg pooled (similar for audio):

Synchformer/configs/segment_avclip.yaml

Line 40 in 814f3ff

    
           agg_time_module: 'AveragePooling'  # 'AveragePooling' or 'TransformerEncoderLayer'

So you end up with features (1, 14, 768) for audio and RGB.

At this stage, the input clips are in-sync and you could take a window of consecutive features, e.g. 8, and compute dot product between all windows of 8 in another modality. This is shown on the gifs in README.

If the highest dot product response of two windows starts at the same time stamp, the prediction is correct (in zero-shot setting). Notice, that the granularity is 1 feature = segment size = 0.64 seconds.

v-iashin · 2024-10-28T14:59:03Z

During stage II, we remove the agg_time_module to keep the time dimensions (8 for rgb and 6 for audio). Then we could flatten the segment dimension to get features with finer time granularity: 1, 14*8, 768 and 1, 14*6, 768. These will be used to train the synchronization module as a classified onto (21, for 0.2s-grid as in paper, or more classes if you wish higher sync precision).

roibenita · 2024-10-28T19:16:15Z

Thank you for the response! :)

If I understand correctly, in the example.py file, the classification model is used, so it flattens the segment dimension instead of averaging over time, right?

Also, I just want to confirm—are the weights used to generate features at the end of the first training phase kept constant during the second phase of training?

Thanks!

v-iashin · 2024-10-28T20:08:29Z

If I understand correctly, in the example.py file, the classification model is used, so it flattens the segment dimension instead of averaging over time, right?

yes, the model is defined in the cfg.model in the config of the experiment that you are providing. see the agg_time_module in cfg-24-01-04T16-39-21.yaml.

are the weights used to generate features at the end of the first training phase kept constant during the second phase of training?

yes, see

Synchformer/configs/sync.yaml

Line 7 in 814f3ff

is_trainable: False

and

Synchformer/configs/sync.yaml

Line 19 in 814f3ff

is_trainable: False

roibenita · 2024-10-29T08:21:23Z

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Synchronization of audio and video dimensions before calculating the contrastive loss #2

Synchronization of audio and video dimensions before calculating the contrastive loss #2

roibenita commented Oct 28, 2024

v-iashin commented Oct 28, 2024

v-iashin commented Oct 28, 2024 •

edited

Loading

roibenita commented Oct 28, 2024

v-iashin commented Oct 28, 2024

roibenita commented Oct 29, 2024

Synchronization of audio and video dimensions before calculating the contrastive loss #2

Synchronization of audio and video dimensions before calculating the contrastive loss #2

Comments

roibenita commented Oct 28, 2024

v-iashin commented Oct 28, 2024

v-iashin commented Oct 28, 2024 • edited Loading

roibenita commented Oct 28, 2024

v-iashin commented Oct 28, 2024

roibenita commented Oct 29, 2024

v-iashin commented Oct 28, 2024 •

edited

Loading