Overview

Developed by Meta (Facebook) AI, RoBERTa showed that BERT was significantly under-trained. By training for longer, on more data, and with larger batches, RoBERTa achieved much better results.

Key Improvements

  • Removed Next Sentence Prediction: Found it wasn't necessary for performance.
  • Dynamic Masking: Changes the masked tokens during each training epoch.
  • Larger Vocabulary: Uses a larger byte-level BPE vocabulary.

Significance

It demonstrated the importance of optimization and data scale in training foundation models.

Related Terms