PyTorch Module
PyTorch Generator
Generate a ready-to-run PyTorch project in seconds.
Pick a blueprint and let the config snap it into place.
PyTorch Module
Generate a ready-to-run PyTorch project in seconds.
Distillation — PyTorch
Distill GPT-4/Claude/Gemini into a GPT-2 student built from scratch. Full control over every tensor.
Distillation — HuggingFace
Distill API teachers into a pretrained HF student (DistilGPT2, GPT-2, etc.) using HF Trainer.
Fine-tuning — HuggingFace
Fine-tune any HF model with SFTTrainer. LoRA/QLoRA, 4-bit/8-bit quantization, packing, chat templates, push to Hub.
Fine-tuning — Unsloth
Fine-tune 4x faster with Unsloth. 4-bit LoRA, RSLoRA, gradient checkpointing, GGUF export.
RL — HuggingFace
Preference tuning with DPO, GRPO, PPO, ORPO, or KTO. Full TRL trainer configs, LoRA/QLoRA support.
RL — Unsloth
DPO, GRPO, or ORPO at 4x speed with Unsloth. 4-bit LoRA, merged/GGUF export, push to Hub.
Data Parallelism
Pure PyTorch distributed data parallelism: replicate model, shard batches, all-reduce gradients across GPUs.
ZeRO Optimizer
ZeRO optimizer/gradient/parameter sharding with pure PyTorch. Select stage 1, 2, or 3 to trade communication for memory.
Tensor Parallelism
Split weight matrices across GPUs with column/row parallel linears and attention head splitting. Pure PyTorch.
Pipeline Parallelism
Split model layers across GPUs with micro-batch pipelining. GPipe (fill-drain) and 1F1B schedules. Pure PyTorch.
Sequence Parallelism
Shard sequence dimension for LayerNorm/dropout, paired with tensor parallelism. Reduces activation memory. Pure PyTorch.
Expert Parallelism
Mixture-of-Experts with all-to-all token dispatch, top-k routing, load balancing loss, and shared experts. Pure PyTorch.
HuggingFace Pipeline
Generate a HuggingFace pipeline project with model loading, inference, and configuration.
GPU Kernel Blueprint
Generate custom CUDA kernels with automatic benchmark harness, Makefile, and PyTorch integration.