Name		Name	Last commit message	Last commit date
parent directory ..
Paper.pdf		Paper.pdf
README.md		README.md
Reference.pdf		Reference.pdf

README.md

Pre-training of Deep Bidirectional Transformers for Language Understanding

Summary

A new language representation model BERT (Bidirectional Encoder Representations from Transformers). BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.

A left to right architecture, where every token can only tend to previous tokens, could be very harmful when applying fine-tuning tasks, where it is crucial to incorporate context from both directions.

Framework

Pre-Training - During Pre-Training, the model is trained on unlabelled data over different pre-training tasks.
Fine-Tuning - BERT Model is first initialized with pre-trained parameters and all of the parameters are fine-tuned using labeled data from the downstream tasks.

Architecture

Model Architecture is a multi-layer bidirectional Transformer encoder.

The encoder is composed of a stack of identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a position-wise fully connected feed-forward network. A residual connection is employed around each of the two sub-layers.

Attention

An attention function is a mapping between a query and a set of key-value pairs to an output. The output is computed as a weighted sum of the values where the weight assigned is computed by a compatibility function of the query with the corresponding key.

Scaled Dot-Product Attention

The input consists of queries and keys of dimension and values of dimension . A softmax function is applied to the ratio of the dot product of the query with the keys and which is then multiplied with the values to get the outputs.

Multi-Head Attention

The queries, keys and the values are linearly projected h times with different learned linear projections to , and dimensions. Attention function is performed on each of the projected versions of the queries, keys and values, parallelly. These are then concatenated and then projected again resulting in the final values.

BERT Architecture

Number of Layers (i.e., Transformer Blocks) - L
Hidden Size - H
Attention Heads - A

BERT_base uses (L=12, H=768 and A=12) while BERT_large uses (L=24, H=1024 and A=16).

Implementation

BERT's model architecture is a multi-layered bidirectional Transfoemer encoder.

Input/Output Representations

The input representation is able to unambiguously represent both a single sentence or a pair os sentences in one token sequence using the WordPiece embeddings.

The fitst token of every sentence is always a special classification token [CLS]. The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks.

The sentences are sperated with a special token [SEP] and a learned emvedding to every token indicating the sentence it belongs to.

For a given token, its input representation is constructed by summing the corresponding token, segment and the position embeddings.

Pre-Training

Task 1 - Masked Language Models (MLM)

Standard conditional language models can only be trained either from left-to-right or right-to-left, since bidirectional conditioning would allow the model to trivially predict the target word. Therefore, In order to train a deep bidirectional representation, some percentage of the tokens at random are masked. In the experiments 15% of the WordPiece tokens are masked at random.

Masking is a downside as it creates a mismatch between pre-trained and fine-tuning as the MASK token is not present during fine-tuning. Hence, the masked words are replaced with the actual MASK token.

Task 2 - Next Sequence Prediction (NSP)

A binarized next sentence prediction task that can be trivially generated from any corpus is pre-trained. When choosing the sentences A and B for each pre-training example, 50% of the time B (labeled as IsNext) is actual and 50% time it is a random sentence from the corpus (labeled NotNext).

Pre-Training Data

The procedure largely follows the existing literature on language model pre-training.

It is critcal that a document-level corpus is used rather than a shuffled sentence-level corpus.

Fine-Tuning

Fine-Tuning is straightforward since the self-attention mechanism in the Transformer allows BERT to model many downstream tasks by swapping out the appropriate inputs and outputs.

For each task, the task specific inputs and outputs are plugged into BERT and the parameters are fine-tuned end-to-end.

End Note

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BERT - Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT - Pre-training of Deep Bidirectional Transformers for Language Understanding

README.md

Pre-training of Deep Bidirectional Transformers for Language Understanding

Summary

Framework

Architecture

Attention

Scaled Dot-Product Attention

Multi-Head Attention

BERT Architecture

Implementation

Input/Output Representations

Pre-Training

Task 1 - Masked Language Models (MLM)

Task 2 - Next Sequence Prediction (NSP)

Pre-Training Data

Fine-Tuning

End Note

Files

BERT - Pre-training of Deep Bidirectional Transformers for Language Understanding

Directory actions

More options

Directory actions

More options

Latest commit

History

BERT - Pre-training of Deep Bidirectional Transformers for Language Understanding

Folders and files

parent directory

README.md

Pre-training of Deep Bidirectional Transformers for Language Understanding

Summary

Framework

Architecture

Attention

Scaled Dot-Product Attention

Multi-Head Attention

BERT Architecture

Implementation

Input/Output Representations

Pre-Training

Task 1 - Masked Language Models (MLM)

Task 2 - Next Sequence Prediction (NSP)

Pre-Training Data

Fine-Tuning

End Note