A Commit History of BERT and its Forks

I recently came across an interesting thread on Twitter discussing a hypothetical scenario where research papers are published on GitHub and subsequent papers are diffs over the original paper. Information overload has been a real problem in ML with so many new papers coming every month.

If you could represent a paper as a code diff, many papers could be compressed down to <50 lines :) The diff would also be more intuitive to read and eval standardized.

Some ideas are so different that this wouldn’t apply, but I think it would work well for the majority. https://t.co/JoAcIK9Cm7
— Denny Britz (@dennybritz) April 25, 2020

This post is a fun experiment showcasing how the commit history could look like for the BERT paper and some of its subsequent variants.

commit arXiv:1810.04805

Author: Devlin et al.

Date: Thu Oct 11 00:50:01 2018 +0000

Initial Commit: BERT

-Transformer Decoder

+Masked Language Modeling

+Next Sentence Prediction

+WordPiece 30K

commit arXiv:1901.07291

Author: Lample et al.

Date: Tue Jan 22 13:22:34 2019 +0000

Cross-lingual Language Model Pretraining

+Translation Language Modeling(TLM)

+Causal Language Modeling(CLM)

commit arXiv:1901.08746

Author: Lee et al.

Date: Fri Jan 25 05:57:24 2019 +0000

BioBERT: a pre-trained biomedical language representation model for biomedical text mining

+PubMed Abstracts data

+PubMed Central Full Texts data

commit arXiv:1901.11504

Author: Liu et al.

Date: Thu Jan 31 18:07:25 2019 +0000

Multi-Task Deep Neural Networks for Natural Language Understanding

+Multi-task Learning

commit arXiv:1903.10676

Author: Beltagy et al.

Date: Tue Mar 26 05:11:46 2019 +0000

SciBERT: A Pretrained Language Model for Scientific Text

-BERT WordPiece Vocabulary

-English Wikipedia

-BookCorpus

+1.14M Semantic Scholar Papers(Biomedial + Computer Science)

+ScispaCy segmentation

+SciVOCAB WordPiece Vocabulary

commit arXiv:1906.08237

Author: Yang et al.

Date: Wed Jun 19 17:35:48 2019 +0000

XLNet: Generalized Autoregressive Pretraining for Language Understanding

-Masked Language Modeling

-BERT Transformer

+Permutation Language Modeling

+Transformer-XL

+Two-stream self-attention

+SentencePiece Tokenizer

commit arXiv:1907.10529

Author: Joshi et al.

Date: Wed Jul 24 15:43:40 2019 +0000

SpanBERT: Improving Pre-training by Representing and Predicting Spans

-Random Token Masking

-Next Sentence Prediction

-Bi-sequence Training

+Continuous Span Masking

+Span-Boundary Objective(SBO)

+Single-Sequence Training

commit arXiv:1907.11692

Author: Liu et al.

Date: Fri Jul 26 17:48:29 2019 +0000

RoBERTa: A Robustly Optimized BERT Pretraining Approach

-Next Sentence Prediction

-Static Masking of Tokens

+Dynamic Masking of Tokens

+Byte Pair Encoding(BPE) 50K

+Large batch size

+CC-NEWS(76G) dataset

+OpenWebText(38G) dataset

+Stories(31G) dataset

commit arXiv:1908.10084

Author: Reimers et al.

Date: Tue Aug 27 08:50:17 2019 +0000

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

+Siamese Network Structure

+Finetuning on SNLI and MNLI

commit arXiv:1909.11942

Author: Lan et al.

Date: Thu Sep 26 07:06:13 2019 +0000

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

-Next Sentence Prediction

+Sentence Order Prediction

+Cross-layer Parameter Sharing

+Factorized Embeddings

+SentencePiece Tokenizer

commit arXiv:1910.01108

Author: Sanh et al.

Date: Wed Oct 2 17:56:28 2019 +0000

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

-Next Sentence Prediction

-Token-Type Embeddings

-[CLS] pooling

+Knowledge Distillation

+Cosine Embedding Loss

+Dynamic Masking

commit arXiv:1911.03894

Author: Martin et al.

Date: Sun Nov 10 10:46:37 2019 +0000

CamemBERT: a Tasty French Language Model

-BERT

-English

+ROBERTA

+French OSCAR dataset(138GB)

+Whole-word Masking(WWM)

+SentencePiece Tokenizer

commit arXiv:1912.05372

Author: Le et al.

Date: Wed Dec 11 14:59:32 2019 +0000

FlauBERT: Unsupervised Language Model Pre-training for French

-BERT

-English

+ROBERTA

+fastBPE

+Stochastic Depth

+French dataset(71GB)

+FLUE(French Language Understanding Evaluation) benchmark