updated readme and docs

inikishev · inikishev · commit 702cd5e10f2c · 2025-02-10T17:29:23.000+03:00
diff --git a/README.md b/README.md
@@ -2,65 +2,61 @@
 
 # torchzero
 
-This is a work-in-progress optimizers library for pytorch with composable zeroth, first, second order and quasi newton methods, gradient approximation, line searches and a whole lot of other stuff.
-
-Most optimizers are modular, meaning you can chain them like this:
+`torchzero` implements a large number of chainable optimization modules that can be chained together to create custom optimizers:
 
 ```py
-optimizer = torchzero.optim.Modular(model.parameters(), [*list of modules*])`
+import torchzero as tz
+
+optimizer = tz.Modular(
+    model.parameters(),
+    tz.m.Adam(),
+    tz.m.Cautious(),
+    tz.m.LR(1e-3),
+    tz.m.WeightDecay(1e-4)
+)
+
+# standard training loop
+for batch in dataset:
+    preds = model(batch)
+    loss = criterion(preds)
+    optimizer.zero_grad()
+    optimizer.step()
 ```
 
-For example you might use `[ClipNorm(4), LR(1e-3), NesterovMomentum(0.9)]` for standard SGD with gradient clipping and nesterov momentum. Move `ClipNorm` to the end to clip the update instead of the gradients. If you don't have access to gradients, add a `RandomizedFDM()` at the beginning to approximate them via randomized finite differences. Add `Cautious()` to make the optimizer cautious.
+Each module takes the output of the previous module and applies a further transformation. This modular design avoids redundant code, such as reimplementing cautioning, orthogonalization, laplacian smoothing, etc for every optimizer. It is also easy to experiment with grafting, interpolation between different optimizers, and perhaps some weirder combinations like nested momentum.
 
-Each new module takes previous module update and works on it. That way there is no need to reimplement stuff like laplacian smoothing for all optimizers, and it is easy to experiment with grafting, interpolation between different optimizers, and perhaps some weirder combinations like nested momentum.
+Modules are not limited to gradient transformations. They can perform other operations like line searches, exponential moving average (EMA) and stochastic weight averaging (SWA), gradient accumulation, gradient approximation, and more.
 
-# How to use
+There are over 100 modules, all accessible within the `tz.m` namespace. For example, the Adam update rule is available as `tz.m.Adam`. Complete list of modules is available in [documentation](https://torchzero.readthedocs.io/en/latest/autoapi/torchzero/modules/index.html).
 
-All modules are defined in `torchzero.modules`. You can generally mix and match them however you want. Some pre-made optimizers are available in `torchzero.optim`.
+## Closure
 
-Some optimizers require closure, which should look like this:
+Some modules and optimizers in torchzero, particularly line-search methods and gradient approximation modules, require a closure function. This is similar to how `torch.optim.LBFGS` works in PyTorch. In torchzero, closure needs to accept a boolean backward argument (though the argument can have any name). When `backward=True`, the closure should zero out old gradients using `opt.zero_grad()`, and compute new gradients using `loss.backward()`.
 
 ```py
 def closure(backward = True):
-  preds = model(inputs)
-  loss = loss_fn(preds, targets)
+    preds = model(inputs)
+    loss = loss_fn(preds, targets)
 
-  # if you can't call loss.backward(), and instead use gradient-free methods,
-  # they always call closure with backward=False.
-  # so you can remove the part below, but keep the unused backward argument.
-  if backward:
-    optimizer.zero_grad()
-    loss.backward()
-  return loss
+    if backward:
+        optimizer.zero_grad()
+        loss.backward()
+    return loss
 
 optimizer.step(closure)
 ```
 
-This closure will also work with all built in pytorch optimizers, including LBFGS, all optimizers in this library, as well as most custom ones.
+If you intend to use gradient-free methods, `backward` argument is still required in the closure. Simply leave it unused. Gradient-free and gradient approximation methods always call closure with `backward=False`.
 
-# Contents
+All built-in pytorch optimizers, as well as most custom ones, support closure too. So the code above will work with all other optimizers out of the box, and you can switch between different optimizers without rewriting your training loop.
 
-Docs are available at [torchzero.readthedocs.io](https://torchzero.readthedocs.io/en/latest/). A preliminary list of all modules is available here <https://torchzero.readthedocs.io/en/latest/autoapi/torchzero/modules/index.html#classes>. Some of the implemented algorithms:
+# Documentation
 
-- SGD/Rprop/RMSProp/AdaGrad/Adam as composable modules. They are also tested to exactly match built in pytorch versions.
-- Cautious Optimizers (<https://huggingface.co/papers/2411.16085>)
-- Optimizer grafting (<https://openreview.net/forum?id=FpKgG31Z_i9>)
-- Laplacian smoothing (<https://arxiv.org/abs/1806.06317>)
-- Polyak momentum, nesterov momentum
-- Gradient norm and value clipping, gradient normalization
-- Gradient centralization (<https://arxiv.org/abs/2004.01461>)
-- Learning rate droput (<https://pubmed.ncbi.nlm.nih.gov/35286266/>).
-- Forward gradient (<https://arxiv.org/abs/2202.08587>)
-- Gradient approximation via finite difference or randomized finite difference, which includes SPSA, RDSA, FDSA and Gaussian smoothing (<https://arxiv.org/abs/2211.13566v3>)
-- Various line searches
-- Exact Newton's method (with Levenberg-Marquardt regularization), newton with hessian approximation via finite difference, subspace finite differences newton.
-- Directional newton via one additional forward pass
+For more information on how to create, use and extend torchzero modules, please refer to the documentation at [torchzero.readthedocs.io](https://torchzero.readthedocs.io/en/latest/index.html).
 
-All modules should be quite fast, especially on models with many different parameters, due to `_foreach` operations.
+# Extra
 
-I am getting to the point where I can start focusing on good docs and tests. As of now, the code should be considered experimental, untested and subject to change, so feel free but be careful if using this for actual project.
-
-# Wrappers
+Some other optimization related things in torchzero:
 
 ### scipy.optimize.minimize wrapper
 
@@ -71,12 +67,26 @@ from torchzero.optim.wrappers.scipy import ScipyMinimize
 opt = ScipyMinimize(model.parameters(), method = 'trust-krylov')
 ```
 
-Use as any other optimizer (make sure closure accepts `backward` argument like one from **How to use**). Note that it performs full minimization on each step.
+Use as any other closure-based optimizer, but make sure closure accepts `backward` argument. Note that it performs full minimization on each step.
 
 ### Nevergrad wrapper
 
+[Nevergrad](https://github.com/facebookresearch/nevergrad) is an optimization library by facebook with an insane number of gradient free methods.
+
 ```py
+from torchzero.optim.wrappers.nevergrad import NevergradOptimizer
 opt = NevergradOptimizer(bench.parameters(), ng.optimizers.NGOptBase, budget = 1000)
 ```
 
-Use as any other optimizer (make sure closure accepts `backward` argument like one from **How to use**).
+Use as any other closure-based optimizer, but make sure closure accepts `backward` argument.
+
+### NLopt wrapper
+
+[NLopt](https://nlopt.readthedocs.io/en/latest/NLopt_Algorithms/) is another optimization library similar to scipy.optimize.minimize, with a large number of both gradient based and gradient free methods.
+
+```py
+from torchzero.optim.wrappers.nlopt import NLOptOptimizer
+opt = NLOptOptimizer(bench.parameters(), 'LD_TNEWTON_PRECOND_RESTART', maxeval = 1000)
+```
+
+Use as any other closure-based optimizer, but make sure closure accepts `backward` argument. Note that it performs full minimization on each step.
diff --git a/docs/source/FAQ.rst b/docs/source/FAQ.rst
@@ -58,7 +58,7 @@ Using torchzero optimizers is generally similar to using built-in PyTorch optimi
         opt.zero_grad()
 
 
-Some modules and optimizers in torchzero, particularly line-search methods and gradient approximation modules, require a closure function. This is similar to how :code:`torch.optim.LBFGS` works in PyTorch. In torchzero, the closure function for these optimizers needs to accept an argument (we'll call it backward, though the argument can have any name). When :code:`backward=True`, the closure should zero out gradients using :code:`opt.zero_grad()`, and compute gradients using :code:`loss.backward()`.
+Some modules and optimizers in torchzero, particularly line-search methods and gradient approximation modules, require a closure function. This is similar to how :code:`torch.optim.LBFGS` works in PyTorch. In torchzero, closure needs to accept a boolean backward argument (though the argument can have any name). When :code:`backward=True`, the closure should zero out gradients using :code:`opt.zero_grad()`, and compute gradients using :code:`loss.backward()`.
 
 Here's how a training loop with a closure looks:
 
diff --git a/docs/source/implementing.rst b/docs/source/implementing.rst
@@ -19,12 +19,10 @@ Like in pytorch, putting all settings into :code:`defaults` dictionary allows to
             super().__init__(defaults)
 
 
-Note:, please don't add :code:`lr` setting to your modules. When learning rate is part of the update rule, like in Adam, I rename it to :code:`alpha` and set to 1 by default. Learning rate should be controlled by a separate :py:class:`tz.m.LR<torchzero.modules.LR>` module, this avoids unintended compounding of learning rate modifications when using learning rate schedulers and per-parameter lr settings (see :ref:`How do we handle learning rates?`).
+Note: please don't use :code:`lr` setting in your modules. When learning rate is part of the update rule, like in Adam, I rename it to :code:`alpha` and set to 1 by default. Learning rate should be controlled by a separate :py:class:`tz.m.LR<torchzero.modules.LR>` module, this avoids unintended compounding of learning rate modifications when using learning rate schedulers and per-parameter lr settings (see :ref:`How do we handle learning rates?`).
 
-Implementing update rule
+Implementing the update rule
 =============================
-Now we can implement the update rule.
-
 Update logic in :code:`OptimizerModule` is defined in the :code:`step` method. By default it calls :code:`_update`, which in turn calls :code:`_single_tensor_update`. You can overwrite one of those three methods depending on how much control you need.
 
 Method 1. Overwriting _single_tensor_update
@@ -35,12 +33,12 @@ For most update rules overwriting `_single_tensor_update` is the most convenient
 :code:`_single_tensor_update` accepts the following arguments:
 
 * :code:`vars`: :py:mod:`tz.core.OptimizationVars<torchzero.core.OptimizationVars>` object with various useful attributes, such as closure, list of current update tensors, loss, list of gradient tensors. For now we don't need this.
-* :code:`ascent`: torch.Tensor of the ascent - gradient or another modules update. This is what the update rule modifies.
-* :code:`param`: torch.Tensor of the parameter, useful for implementing weight decay and accessing per-parameter states.
-* :code:`grad`: torch.Tensor of the initial gradient, not transformed by previous modules. Sometimes gradient is never evaluated, like in gradient free methods, so this may be None.
+* :code:`ascent`: torch.Tensor with the ascent direction (update), which is the gradient if this module is first, or an update generated by previous module. This is what the update rule should modify and return.
+* :code:`param`: torch.Tensor with the parameter, useful for implementing weight decay and accessing per-parameter states.
+* :code:`grad`: torch.Tensor with the initial gradient, not transformed by previous modules. Useful for things like cautious optimizers that compare update sign with gradient sign. Sometimes gradient is never evaluated, like in gradient free methods, so this may be None.
 * per-parameter settings in any order, in the Adam example below :code:`beta1, beta2, eps, alpha`. Everything passed to :code:`defaults` will be accessible there.
 
-The method should return the updated ascent tensor. Please do not update :code:`param` directly.
+The method should return the updated ascent direction tensor. Please do not update :code:`param` directly.
 
 Here is a ready to use Adam implementation through overwriting :code:`_single_tensor_update`:
 
@@ -94,20 +92,20 @@ Here is a ready to use Adam implementation through overwriting :code:`_single_te
 
 Method 2. Overwriting _update
 +++++++++++++++++++++++++++++++++++++++++++++
-:code:`_update` is similar to :code:`_single_tensor_update`, however you get access to all ascent tensors in a single list, as opposed to looping through each element. That way you can use pytorch `_foreach_xxx <https://pytorch.org/docs/stable/torch.html#foreach-operations>`_ operations for better performance. Most modules in torchzero are implemented through overwriting `_update` and with _foreach operations.
+:code:`_update` is similar to :code:`_single_tensor_update`, however you get access to all ascent tensors in a single list, as opposed to looping through each element. That way you can use pytorch `_foreach_xxx <https://pytorch.org/docs/stable/torch.html#foreach-operations>`_ operations for better performance. Most modules in torchzero are implemented through overwriting :code:`_update` and with :code:`_foreach` operations.
 
 :code:`update` accepts the following arguments:
 
 * :code:`vars`: :py:mod:`tz.core.OptimizationVars<torchzero.core.OptimizationVars>` object with various useful attributes, such as closure, list of current update tensors, loss, list of gradient tensors. For now we don't need this.
-* :code:`ascent`: :py:mod:`tz.TensorList<torchzero.TensorList>` - list of tensors of the ascent direction (gradient or update) for each parameter with :code:`requires_grad = True`. :code:`TensorList` is a subclass of python list with some additional methods, but we won't use those methods for now.
+* :code:`ascent`: :py:mod:`tz.TensorList<torchzero.TensorList>` - list of tensors of the ascent direction (gradient or update) for each parameter with :code:`requires_grad = True`. :code:`TensorList` is a subclass of python list with some additional methods, but we won't use those methods for now. As it is a subclass of list, you can pass it directly to :code:`torch._foreach_xxx` methods.
 
 The method should return the updated ascent :code:`TensorList`.
 
 To make working with lists of tensors more convenient, :code:`OptimizerModule` also has some helper methods.
 
 * :code:`self.get_params()`: returns list of tensors of all params with :code:`requires_grad = True`.
-* :code:`self.get_group_key(key)`, :code:`self.get_group_keys(keys)`: return list of values of a per-parameter setting (such as beta1, beta2, eps) for each parameter with :code:`requires_grad = True`.
-* :code:`self.get_state_key(key)`, :code:`self.get_state_keys(keys)`: return a list of tensors of a state (e.g. exponential average) of each parameter with :code:`requires_grad = True`, initializes the state to zeroes if it doesn't exist.
+* :code:`self.get_group_key(key)`, :code:`self.get_group_keys(*keys)`: return list of values of a per-parameter setting (such as beta1, beta2, eps) for each parameter with :code:`requires_grad = True`.
+* :code:`self.get_state_key(key)`, :code:`self.get_state_keys(*keys)`: return a list of tensors of a state (e.g. exponential average) of each parameter with :code:`requires_grad = True`, initializes the state to zeroes if it doesn't exist.
 
 Here is a ready to use Adam implementation through overwriting :code:`_update` using :code:`_foreach` methods. Using a lot of :code:`_foreach_xxx` methods is not very readable, but it is fast.
 
diff --git a/docs/source/introduction.rst b/docs/source/introduction.rst
@@ -1,13 +1,13 @@
 Introduction
 ==================
 
-`torchzero` is a library for pytorch that offers a flexible and modular way to build optimizers for various tasks.  By combining smaller, reusable modules, you can easily customize and experiment with different optimization strategies.
+torchzero is a library for pytorch that offers a flexible and modular way to build optimizers for various tasks. By combining smaller, reusable modules, you can easily customize and experiment with different optimization strategies.
 
-Each module in `torchzero` takes the output of the previous module and applies a further transformation. This modular design avoids redundant code, such as reimplementing Laplacian smoothing, cautioning, orthogonalization, etc for every optimizer. It also simplifies experimenting with advanced techniques like optimizer grafting, interpolation, and complex combinations like nested momentum.
+Each module takes the output of the previous module and applies a further transformation. This modular design avoids redundant code, such as reimplementing Laplacian smoothing, cautioning, orthogonalization, etc for every optimizer. It also simplifies experimenting with advanced techniques like optimizer grafting, interpolation, and complex combinations like nested momentum.
 
-Many modules in `torchzero` perform gradient transformations. They receive an "ascent direction," which is initially the gradient, modify it, and pass it to the next module in the chain. Typically, the first module uses the raw gradient as the starting ascent direction. However, modules are not limited to gradient transformations. They can perform other operations like line searches, exponential moving average (EMA) and stochastic weight averaging (SWA), gradient accumulation, gradient approximation, and more.
+Many modules perform gradient transformations. They receive an "ascent direction," which is initially the gradient, modify it, and pass it to the next module in the chain. Typically, the first module uses the raw gradient as the starting ascent direction. However, modules are not limited to gradient transformations. They can perform other operations like line searches, exponential moving average (EMA) and stochastic weight averaging (SWA), gradient accumulation, gradient approximation, and more.
 
-`torchzero` provides over 100 modules, all accessible within the :py:mod:`tz.m<torchzero.modular>` namespace. For example, the Adam module is available as :py:class:`tz.m.Adam<torchzero.modules.Adam>`. You can find a complete list of modules in the `torchzero` documentation: https://torchzero.readthedocs.io/en/latest/autoapi/torchzero/modules/index.html.
+torchzero provides over 100 modules, all accessible within the :py:mod:`tz.m<torchzero.modular>` namespace. For example, the Adam module is available as :py:class:`tz.m.Adam<torchzero.modules.Adam>`. You can find a complete list of modules in the torchzero documentation: https://torchzero.readthedocs.io/en/latest/autoapi/torchzero/modules/index.html.
 
 To combine these modules and create a custom optimizer, use tz.Modular, and then use it as any other pytorch optimizer. Here’s an example of how to define a Cautious Adam optimizer with gradient clipping and decoupled weight decay:
 
@@ -44,4 +44,4 @@ To combine these modules and create a custom optimizer, use tz.Modular, and then
         print(epoch, loss.item(), end = '       \r')
 
 
-Please head over to :ref:FAQ for more examples and information.
+Please head over to :ref:`FAQ` for more examples and information.