Comparison of the architecture, training data, and training approach for the protein language model (LM) ESM-2 (Lin et al. 2023), the antibody-specific LMs AntiBERTy (Ruffolo et al. 2021) and AbLang-1 (Olsen et al. 2022b), and our new selection of antibody-specific LMs.a
. | Architecture . | Training data . | Paired . | Loss function . | Training objective . | Training steps . | Batch size . |
---|---|---|---|---|---|---|---|
ESM-2 |
|
| N | CE | MLM | 500K | 2M tokens |
AntiBERTy |
| 558M VH/VL | N | CE | MLM | 8 epochs | N/A |
AbLang-1 |
|
| N | CE | MLM |
|
|
Ab-Unpaired |
|
| N | CE | MLM | 10K | 1M tokens |
Ab-Paired |
| 1.26M paired | Y | CE | MLM | 10K | 1–2M tokens |
Ab-FL |
| 1.26M paired | Y | FL | MLM | 10K | 1–2M tokens |
Ab-ModMask |
| 1.26M paired | Y | FL | Modified MLM | 10K | 1–2M tokens |
Ab-FT |
|
| Y | FL | Modified MLM | 10K + 1K | 1–2M tokens |
AbLang-2 |
|
| Y | FL | Modified MLM | 200K + 10K | 1–2M tokens |
. | Architecture . | Training data . | Paired . | Loss function . | Training objective . | Training steps . | Batch size . |
---|---|---|---|---|---|---|---|
ESM-2 |
|
| N | CE | MLM | 500K | 2M tokens |
AntiBERTy |
| 558M VH/VL | N | CE | MLM | 8 epochs | N/A |
AbLang-1 |
|
| N | CE | MLM |
|
|
Ab-Unpaired |
|
| N | CE | MLM | 10K | 1M tokens |
Ab-Paired |
| 1.26M paired | Y | CE | MLM | 10K | 1–2M tokens |
Ab-FL |
| 1.26M paired | Y | FL | MLM | 10K | 1–2M tokens |
Ab-ModMask |
| 1.26M paired | Y | FL | Modified MLM | 10K | 1–2M tokens |
Ab-FT |
|
| Y | FL | Modified MLM | 10K + 1K | 1–2M tokens |
AbLang-2 |
|
| Y | FL | Modified MLM | 200K + 10K | 1–2M tokens |
The architecture column shows the most similar architecture and the model’s size with the number of layers (L) and embedding size (ES). While the exact number of training steps for AntiBERTy is unknown, it was trained for eight epochs (Ruffolo et al. 2021). AbLang-1 and the new antibody-specific LMs were trained on 8192 sequences (4096 for AbLang-1 Light) per batch, with each sequence comprising approximately 120 amino acids. Each batch thus contained about 1M tokens for unpaired sequences and 2M for paired antibody VH-VL sequences. CE, cross-entropy loss; FL, focal loss; MLM, masked language modeling.
Comparison of the architecture, training data, and training approach for the protein language model (LM) ESM-2 (Lin et al. 2023), the antibody-specific LMs AntiBERTy (Ruffolo et al. 2021) and AbLang-1 (Olsen et al. 2022b), and our new selection of antibody-specific LMs.a
. | Architecture . | Training data . | Paired . | Loss function . | Training objective . | Training steps . | Batch size . |
---|---|---|---|---|---|---|---|
ESM-2 |
|
| N | CE | MLM | 500K | 2M tokens |
AntiBERTy |
| 558M VH/VL | N | CE | MLM | 8 epochs | N/A |
AbLang-1 |
|
| N | CE | MLM |
|
|
Ab-Unpaired |
|
| N | CE | MLM | 10K | 1M tokens |
Ab-Paired |
| 1.26M paired | Y | CE | MLM | 10K | 1–2M tokens |
Ab-FL |
| 1.26M paired | Y | FL | MLM | 10K | 1–2M tokens |
Ab-ModMask |
| 1.26M paired | Y | FL | Modified MLM | 10K | 1–2M tokens |
Ab-FT |
|
| Y | FL | Modified MLM | 10K + 1K | 1–2M tokens |
AbLang-2 |
|
| Y | FL | Modified MLM | 200K + 10K | 1–2M tokens |
. | Architecture . | Training data . | Paired . | Loss function . | Training objective . | Training steps . | Batch size . |
---|---|---|---|---|---|---|---|
ESM-2 |
|
| N | CE | MLM | 500K | 2M tokens |
AntiBERTy |
| 558M VH/VL | N | CE | MLM | 8 epochs | N/A |
AbLang-1 |
|
| N | CE | MLM |
|
|
Ab-Unpaired |
|
| N | CE | MLM | 10K | 1M tokens |
Ab-Paired |
| 1.26M paired | Y | CE | MLM | 10K | 1–2M tokens |
Ab-FL |
| 1.26M paired | Y | FL | MLM | 10K | 1–2M tokens |
Ab-ModMask |
| 1.26M paired | Y | FL | Modified MLM | 10K | 1–2M tokens |
Ab-FT |
|
| Y | FL | Modified MLM | 10K + 1K | 1–2M tokens |
AbLang-2 |
|
| Y | FL | Modified MLM | 200K + 10K | 1–2M tokens |
The architecture column shows the most similar architecture and the model’s size with the number of layers (L) and embedding size (ES). While the exact number of training steps for AntiBERTy is unknown, it was trained for eight epochs (Ruffolo et al. 2021). AbLang-1 and the new antibody-specific LMs were trained on 8192 sequences (4096 for AbLang-1 Light) per batch, with each sequence comprising approximately 120 amino acids. Each batch thus contained about 1M tokens for unpaired sequences and 2M for paired antibody VH-VL sequences. CE, cross-entropy loss; FL, focal loss; MLM, masked language modeling.
This PDF is available to Subscribers Only
View Article Abstract & Purchase OptionsFor full access to this pdf, sign in to an existing account, or purchase an annual subscription.