PyTorch implementation of transformer model architecture presented in the paper "Attention Is All You Need"
Original Paper : https://arxiv.org/abs/1706.03762
My Implementation : see Notebook
Original Paper : https://arxiv.org/abs/2010.11929
My Implementation : see Notebook
Note: It uses the high-performance PyTorch Scaled Dot Product Attention (SDPA). I tried to keep the code as clean as possible without sacrificing on performance.