autograd provides ready-to-use model types and optimizers so you can focus on the training loop rather than layer bookkeeping.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/itsubaki/autograd/llms.txt
Use this file to discover all available pages before exploring further.
Multi-Layer Perceptron (MLP)
model.NewMLP constructs a fully-connected network. Pass the output sizes for each layer as a slice, and optionally configure the activation function and random source.
outSize slice is the output layer. All preceding layers use the configured activation function; the final layer is linear (no activation).
Training loop
m.Cleargrads() calls Cleargrad() on every parameter in the model. Call it before loss.Backward() each iteration so gradients reflect only the current batch.Optimizers
All optimizers implement the sameUpdate(model) interface. Swap them without changing the rest of the training loop.
SGD
Momentum
v = momentum*v - lr*grad, then param = param + v.
Adam
AdamW
lr * WeightDecay * param as a separate decay term.
Layer-by-layer composition
You can compose models from individual layers directly usinglayer.Linear and layer.LSTM:
LinearT.First is a convenience wrapper that returns the first (and only) output of Forward. The weight matrix is initialized lazily on the first call using Xavier initialization; the bias is initialized to zeros.
LSTM model
model.NewLSTM creates an LSTM layer followed by a linear output layer.
h and cell state c across time steps. Call m.ResetState() at the beginning of each sequence (epoch) to clear them.
Truncated BPTT
Training recurrent models on long sequences requires Truncated Backpropagation Through Time (TBPTT). Instead of accumulating loss for the entire sequence, update everybpttLength steps and detach the computation graph with UnchainBackward().
loss.UnchainBackward() severs the links between loss and the preceding computation graph nodes, so the next segment’s backward pass starts from the current loss node without re-traversing earlier time steps. The hidden state h and cell state c are preserved across the cut, keeping temporal context intact.
Running the LSTM example
Thecmd/lstm program trains on a noisy sine wave and outputs predictions as CSV:
variable.Nograd() to skip graph construction during inference:
Loss functions
| Function | Use case |
|---|---|
F.MeanSquaredError(y, t) | Regression |
F.SoftmaxCrossEntropy(x, t) | Multi-class classification (expects logits shaped (N, C) and integer labels shaped (N,)) |
Next steps
Gradient Descent
Understand the manual update loop before using higher-level optimizers.
Higher-Order Gradients
Use CreateGraph to compute second derivatives for Newton’s method.
Model API
Full reference for MLP and LSTM model types.
Optimizer API
Full reference for SGD, Adam, AdamW, and Momentum.