Overview
Developed by Meta (Facebook) AI, RoBERTa showed that BERT was significantly under-trained. By training for longer, on more data, and with larger batches, RoBERTa achieved much better results.
Key Improvements
- Removed Next Sentence Prediction: Found it wasn't necessary for performance.
- Dynamic Masking: Changes the masked tokens during each training epoch.
- Larger Vocabulary: Uses a larger byte-level BPE vocabulary.
Significance
It demonstrated the importance of optimization and data scale in training foundation models.