Improve block weighting with uniform and hat functions #147

lsorber · 2025-01-02T14:30:42Z

This PR makes the current uniform weighting scheme explicit, and adds an improved hat weighting scheme.

The rationale behind hat weighting is that predictions for tokens near the beginning or end of the block will be less accurate than predictions for tokens near the middle of the block, where the model has maximal context.

For instance, let's say we use stride=128 and block_size=256 and compare the predictions for the token with index 128:

With uniform weighting, its prediction will be 0.5 * first_block[128] + 0.5 * second_block[0].
With hat weighting, its prediction will (approximately) be 1 * first_block[128] + 1/256 * second_block[0].

In this example, hat weighting is preferable because the first token of the second block is likely to be much less accurate than the middle token of first block.

Anecdotally, I've also observed that hat weighting improves output quality on test data.

markus583 · 2025-01-12T15:38:06Z

Hi! Thanks a lot for implementing this. Interesting idea, cool stuff! It intuitively makes sense, but I'm unsure if it makes a practical difference. It would be interesting to test it on some benchmarks. For the time being, I'd be happy to add it as a feature and leave the default to uniform. Would you agree @bminixhofer?

improve block weighting

16fba36

markus583 requested review from markus583 and bminixhofer January 12, 2025 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve block weighting with uniform and hat functions #147

Improve block weighting with uniform and hat functions #147

lsorber commented Jan 2, 2025

markus583 commented Jan 12, 2025

Improve block weighting with uniform and hat functions #147

Are you sure you want to change the base?

Improve block weighting with uniform and hat functions #147

Conversation

lsorber commented Jan 2, 2025

markus583 commented Jan 12, 2025