Sequence Parallelism
Step 1 of 5
Sequence-dim sharding for LayerNorm/dropout, paired with TP. Pure PyTorch.