functorch is a prototype of JAX-like composable function transforms for PyTorch.



Why functorch? | Install guide | Transformations | Future Plans

functorch is a prototype of JAX-like composable FUNCtion transforms for pyTORCH.

It aims to provide composable vmap and grad transforms that work with PyTorch modules and PyTorch autograd with good eager-mode performance. Because this project requires some investment, we'd love to hear from and work with early adopters to shape the design. Please reach out on the issue tracker if you're interested in using this for your project.

Why composable function transforms?

There are a number of use cases that are tricky to do in PyTorch today:

  • computing per-sample-gradients (or other per-sample quantities)
  • running ensembles of models on a single machine
  • efficiently batching together tasks in the inner-loop of MAML
  • efficiently computing Jacobians and Hessians
  • efficiently computing batched Jacobians and Hessians

Composing vmap, grad, and vjp transforms allows us to express the above without designing a separate subsystem for each. This idea of composable function transforms comes from the JAX framework.



Follow the instructions in this Colab notebook


First, set up an environment. We will be installing a nightly PyTorch binary as well as functorch. If you're using conda, create a conda environment:

conda create --name functorch
conda activate functorch

If you wish to use venv instead:

python -m venv functorch-env
source functorch-env/bin/activate

Next, install one of the following following PyTorch nightly binaries. functorch works with any of these but a more recent nightly should work as well.

# For CUDA 10.2
pip install --pre torch==1.9.0.dev20210517 -f
# For CUDA 11.1
pip install --pre torch==1.9.0.dev20210517 -f
# For CPU-only build
pip install --pre torch==1.9.0.dev20210517 -f

Install functorch:

pip install --user "git+"

Run a quick sanity check in python:

>>> import torch
>>> from functorch import vmap
>>> x = torch.randn(3)
>>> y = vmap(torch.sin)(x)
>>> assert torch.allclose(y, x.sin())

From Source

functorch is a PyTorch C++ Extension module. To install,

  • Install PyTorch from source. functorch usually runs on the latest development version of PyTorch.
  • Run python install. You can use DEBUG=1 to compile in debug mode.

Then, try to run some tests to make sure all is OK:

pytest test/ -v
pytest test/ -v

What are the transforms?

Right now, we support the following transforms:

  • grad, vjp, jacrev
  • vmap

Furthermore, we have some utilities for working with PyTorch modules.

  • make_functional(model) takes a model and returns its weights and a function version of the model that has no state.
  • make_functional_with_buffers(model) takes a model and returns its weights and buffers and a function version of the model that has no state.


Note: vmap imposes restrictions on the code that it can be used on. For more details, please read its docstring.

vmap(func)(*inputs) is a transform that adds a dimension to all Tensor operations in func. vmap(func) returns a few function that maps func over some dimension (default: 0) of each Tensor in inputs.

vmap is useful for hiding batch dimensions: one can write a function func that runs on examples and then lift it to a function that can take batches of examples with vmap(func), leading to a simpler modeling experience:

>>> from functorch import vmap
>>> batch_size, feature_size = 3, 5
>>> weights = torch.randn(feature_size, requires_grad=True)
>>> def model(feature_vec):
>>>     # Very simple linear model with activation
>>>     assert feature_vec.dim() == 1
>>>     return
>>> examples = torch.randn(batch_size, feature_size)
>>> result = vmap(model)(examples)


grad(func)(*inputs) assumes func returns a single-element Tensor. It compute the gradients of the output of func w.r.t. to inputs[0].

>>> from functorch import grad
>>> x = torch.randn([])
>>> cos_x = grad(lambda x: torch.sin(x))(x)
>>> assert torch.allclose(cos_x, x.cos())
>>> # Second-order gradients
>>> neg_sin_x = grad(grad(lambda x: torch.sin(x)))(x)
>>> assert torch.allclose(neg_sin_x, -x.sin())

When composed with vmap, grad can be used to compute per-sample-gradients:

>>> from functorch import vmap
>>> batch_size, feature_size = 3, 5
>>> def model(weights,feature_vec):
>>>     # Very simple linear model with activation
>>>     assert feature_vec.dim() == 1
>>>     return
>>> def compute_loss(weights, example, target):
>>>     y = model(weights, example)
>>>     return ((y - target) ** 2).mean()  # MSELoss
>>> weights = torch.randn(feature_size, requires_grad=True)
>>> examples = torch.randn(batch_size, feature_size)
>>> targets = torch.randn(batch_size)
>>> inputs = (weights,examples, targets)
>>> grad_weight_per_example = vmap(grad(compute_loss), in_dims=(None, 0, 0))(*inputs)

vjp and jacrev

>>> from functorch import vjp
>>> outputs, vjp_fn = vjp(func, inputs); vjps = vjp_fn(*cotangents)

The vjp transform applies func to inputs and returns a new function that computes vjps given some contangents Tensors.

>>> from functorch import jacrev
>>> x = torch.randn(5)
>>> jacobian = jacrev(torch.sin)(x)
>>> expected = torch.diag(x)
>>> assert torch.allclose(jacobian, expected)

Use jacrev to compute the jacobian. This can be composed with vmap to produce batched jacobians:

>>> x = torch.randn(64, 5)
>>> jacobian = vmap(jacrev(torch.sin))(x)
>>> assert jacobian.shape == (64, 5, 5)

jacrev can be composed with itself to produce hessians:

>>> def f(x):
>>>   return x.sin().sum()
>>> x = torch.randn(5)
>>> hessian = jacrev(jacrev(f))(x)


functorch._C.dump_tensor: Dumps dispatch keys on stack functorch._C._set_vmap_fallback_warning_enabled(False) if the vmap warning spam bothers you.

Future Plans

In the end state, we'd like to upstream this into PyTorch once we iron out the design details. To figure out the details, we need your help -- please send us your use cases by starting a conversation in the issue tracker or try out the prototype.


Functorch has a BSD-style license, as found in the LICENSE file.

  • ImportError: ~/.local/lib/python3.9/site-packages/functorch/ undefined symbol: _ZNK3c1010TensorImpl16sym_sizes_customEv

    ImportError: ~/.local/lib/python3.9/site-packages/functorch/ undefined symbol: _ZNK3c1010TensorImpl16sym_sizes_customEv

    Hi All,

    I was running an older version of PyTorch ( - built from source) with FuncTorch ( - built from source), and somehow I've broken the older version of functorch. When I import functorch I get the following error,

    import functorch
    #returns ImportError: ~/.local/lib/python3.9/site-packages/functorch/ undefined symbol: _ZNK3c1010TensorImpl16sym_sizes_customEv

    The version I had of functorch was 0.2.0a0+9d6ee76, is there a way to perhaps re-install to fix this ImportError? I do have the latest version of PyTorch/FuncTorch in a separate conda environment but I wanted to check how it compares to the older version in this 'older' conda environment PyTorch/Functorch were versions ,1.12.0a0+git7c2103a and 0.2.0a0+9d6ee76 respectively.

    Is there a way to download a specific version of functorch with ? Or another way to fix this issue?

    opened by AlphaBetaGamma96 24
  • Hessian (w.r.t inputs) calculation in PyTorch differs from FuncTorch

    Hessian (w.r.t inputs) calculation in PyTorch differs from FuncTorch

    Hi All,

    I've been trying to calculate the Hessian of the output of my network with respect to its inputs within FuncTorch. I had a version within PyTorch that supports batches, however, they seem to disagree with each other and I have no idea why they don't give the same results. Something is clearly wrong, I know my PyTorch version is right so either there's an issue in my version of FuncTorch or I've implemented it wrong in FuncTorch.

    Also, how can I use the has_aux flag in jacrev to return the jacobian from the first jacrev so I don't have to repeat the jacobian calculation?

    The only problem with my example is that it uses torch.linalg.slogdet and from what I remember FuncTorch can't vmap over .item(). I do have my own fork of pytorch where I edited the backward to remove the .item() call so it works with vmap. Although, it's not the greatest implementation as I just set it to the default nonsingular_case_backward like so,

    Tensor slogdet_backward(const Tensor& grad_logabsdet,
                            const Tensor& self,
                            const Tensor& signdet, const Tensor& logabsdet) {
      auto singular_case_backward = [&](const Tensor& grad_logabsdet, const Tensor& self) -> Tensor {
        Tensor u, sigma, vh;
        std::tie(u, sigma, vh) = at::linalg_svd(self, false);
        Tensor v = vh.mH();
        // sigma has all non-negative entries (also with at least one zero entry)
        // so logabsdet = \sum log(abs(sigma))
        // but det = 0, so backward logabsdet = \sum log(sigma)
        auto gsigma = grad_logabsdet.unsqueeze(-1).div(sigma);
        return svd_backward({}, gsigma, {}, u, sigma, vh);
      auto nonsingular_case_backward = [&](const Tensor& grad_logabsdet, const Tensor& self) -> Tensor {
        // TODO: replace self.inverse with linalg_inverse
        return unsqueeze_multiple(grad_logabsdet, {-1, -2}, self.dim()) * self.inverse().mH();
      auto nonsingular = nonsingular_case_backward(grad_logabsdet, self);
      return nonsingular;

    My 'minimal' reproducible script is below with the output shown below that. It computes the Laplacian via a PyTorch method and via FuncTorch for a single sample of size [A,1] where A is the number of input nodes to the network.

    import torch
    import torch.nn as nn
    from torch import Tensor
    import functorch
    from functorch import jacrev, jacfwd, hessian, make_functional, vmap
    import time 
    _ = torch.manual_seed(0)
    print("PyTorch version:   ", torch.__version__)
    print("CUDA version:      ", torch.version.cuda)
    print("FuncTorch version: ", functorch.__version__)
    def sync_time() -> float:
      return time.perf_counter()
    B=1 #batch
    A=3 #input nodes
    class model(nn.Module):
      def __init__(self, num_inputs, num_hidden):
        super(model, self).__init__()
        self.func = nn.Tanh()
        self.fc1 = nn.Linear(2, num_hidden)
        self.fc2 = nn.Linear(num_hidden, num_inputs)
      def forward(self, x):
        Takes x in [B,A,1] and maps it to sign/logabsdet value in Tuple([B,], [B,])
        rep=[1 for _ in range(idx)]
        rep[-2] = self.num_inputs
        g = x.mean(dim=(idx-2), keepdim=True).repeat(*rep)
        f =,g), dim=-1)
        h = self.func(self.fc1(f))
        mat = self.fc2(h)
        sgn, logabs = torch.linalg.slogdet(mat)
        return sgn, logabs
    net = model(A, 64)
    net =
    fnet, params = make_functional(net)
    def logabs(params, x):
      _, logabs = fnet(params, x)
      #print("functorch logabs: ",logabs)
      return logabs
    def kinetic_pytorch(xs: Tensor) -> Tensor:
      """Method to calculate the local kinetic energy values of a netork function, f, for samples, x.
      The values calculated here are 1/f d2f/dx2 which is equivalent to d2log(|f|)/dx2 + (dlog(|f|)/dx)^2
      within the log-domain (rather than the linear-domain).
      :param xs: The input positions of the many-body particles
      :type xs: class: `torch.Tensor`
      xis = [xi.requires_grad_() for xi in xs.flatten(start_dim=1).t()]
      xs_flat = torch.stack(xis, dim=1)
      _, ys = net(xs_flat.view_as(xs))
      #print("pytorch logabs: ",ys)
      ones = torch.ones_like(ys)
      #df_dx calculation
      (dy_dxs, ) = torch.autograd.grad(ys, xs_flat, ones, retain_graph=True, create_graph=True)
      #d2f_dx2 calculation (diagonal only)
      lay_ys = sum(torch.autograd.grad(dy_dxi, xi, ones, retain_graph=True, create_graph=False)[0] \
                    for xi, dy_dxi in zip(xis, (dy_dxs[..., i] for i in range(len(xis))))
      #print("(PyTorch): ",lay_ys, dy_dxs)
      ek_local_per_walker = -0.5 * (lay_ys + dy_dxs.pow(2).sum(-1)) #move const out of loop?
      return ek_local_per_walker
    jacjaclogabs = jacrev(jacrev(logabs, argnums=1), argnums=1)
    jaclogabs = jacrev(logabs, argnums=1)
    def kinetic_functorch(params, x):
      d2f_dx2 = vmap(jacjaclogabs, in_dims=(None, 0))(params, x)
      df_dx = vmap(jaclogabs, in_dims=(None, 0))(params, x)
      #print("(FuncTorch): ", d2f_dx2.squeeze(-3).squeeze(-1).diagonal(-2,-1).sum(-1), df_dx)
      #remove the trailing 1's so it's an A by A matrix 
      return -0.5 * d2f_dx2.squeeze(-3).squeeze(-1).diagonal(-2,-1).sum(-1) + df_dx.squeeze(-1).pow(2).sum(-1)
    x = torch.randn(B,A,1,device=device) #input Tensor 
    print("\nd2f/dx2, df/dx: ")
    kin_pt = kinetic_pytorch(x)
    kin_ft = kinetic_functorch(params, x)
    print("\nWalltime: ")
    print("PyTorch:   ",t2-t1)
    print("FuncTorch: ",t4-t3, "\n")
    print("Results: ")
    print("PyTorch: ",kin_pt)
    print("FuncTorch: ",kin_ft)

    This script returns

    PyTorch version:    1.12.0a0+git7c2103a
    CUDA version:       11.6
    FuncTorch version:  0.2.0a0+9d6ee76
    d2f/dx2, df/dx: 
    PyTorch:    0.4822753759999614
    FuncTorch:  0.004898710998531897 
    PyTorch:  tensor([1.3737], device='cuda:0', grad_fn=<MulBackward0>)    # should be the same values
    FuncTorch:  tensor([7.8411], device='cuda:0', grad_fn=<AddBackward0>) # the jacobian matches, but hessian does not

    Thanks for the help in advance! :)

    opened by AlphaBetaGamma96 18
  • Semantic discrepancy on requires_grad after compiling Tensor.detach

    Semantic discrepancy on requires_grad after compiling Tensor.detach


    import torch
    from functorch.compile import aot_function
    def fn(x):
        return x.detach()
    aot_fn = aot_function(fn, fw_compiler=lambda fx_module, _: fx_module)
    x = torch.randn(1, requires_grad=True)
    ref = fn(x)
    res = aot_fn(x)
    assert(ref.requires_grad == res.requires_grad)

    PyTorch version: 1.13.0.dev20220929+cu116

    Not sure if this is related to #376.

    opened by sangongs 14
  • add batching rule for block_diag, kill DECOMPOSE_FUNCTIONAL

    add batching rule for block_diag, kill DECOMPOSE_FUNCTIONAL

    Companion core PR:

    The above PR makes block_diag composite compliant, and this PR adds a batching rule for it.

    Those two changes together should let us fully remove the DECOMPOSE_FUNCTIONAL macro, which was preventing me from moving the Functionalize dispatch key below FuncTorchBatched (which I want to do as part of XX, in order to properly get functionalization working with LTC/XLA).

    cla signed 
    opened by bdhirsh 13
  • svd-related op regression in functorch

    svd-related op regression in functorch and caused svd-related tests in functorch to fail:


    The main problem seems to be that the backward pass uses in-place operations that are incompatible with vmap (aka Composite Compliance problems). There are some other failures that seem to be because some other operations are not Composite Compliant but somehow these weren't a problem previously.

    opened by zou3519 12
  • Installing functorch breaks torchaudio

    Installing functorch breaks torchaudio

    I'm following along with this colab from the functorch installation docs.

    After installing and restarting, when I try to import torchaudio, the runtime crashes. At first, I got this error:

    OSError: /usr/local/lib/python3.7/dist-packages/torchaudio/lib/ undefined symbol: _ZN2at4_ops7resize_4callERKNS_6TensorEN3c108ArrayRefIlEENS5_8optionalINS5_12MemoryFormatEEE

    Now, I'm just getting the runtime crashing with no visible error.

    I know functorch was merged into pytorch proper, but I don't see any instructions about how to use it from there. Would that fix the issue? If so, should the main docs be updated?

    opened by dellis23 11
  • functorch doesn't work in debug mode

    functorch doesn't work in debug mode

    It's that autograd assert that we run into often:

    import torch
    from functorch import make_fx
    from functorch.compile import nnc_jit
    def f(x, y):
        return torch.broadcast_tensors(x, y)
    inp1 = torch.rand(())
    inp2 = torch.rand(3)
    print(f(inp1, inp2))  # without nnc compile everything works fine
    print(make_fx(f)(inp1, inp2))  # fails
    print(nnc_jit(f)(inp1, inp2))
    # RuntimeError: self__storage_saved.value().is_alias_of( ASSERT FAILED at "autograd/generated/VariableType_3.cpp":3899, please report a bug to PyTorch.

    cc @albanD @soulitzer what's the chance we can add an option to turn these off? They've been more harmful (e.g. prevent debugging in debug mode) than useful for us.

    opened by zou3519 11
  • Index put vmap internal assert

    Index put vmap internal assert

    import torch
    from functorch import vmap
    self = torch.randn(4, 1, 1).cuda()
    idx = (torch.tensor([0]).cuda(),)
    value = torch.randn(1, 1).cuda()
    def foo(x):
        return x.index_put_(idx, value, accumulate=True)
    RuntimeError: linearIndex.numel()*sliceSize*nElemBefore == value.numel()INTERNAL ASSERT FAILED at "/raid/rzou/pt/debug-cuda/aten/src/ATen/native/cuda/":249, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor41
    opened by zou3519 11
  • Add flake8 pre commit hook script

    Add flake8 pre commit hook script

    PyTorch's pre commit hooks, scripts that are called are in tools

    Really I selfishly just want the flake8 ones so I don't have to remember to run it against my changes each time. We could also get the clang tidy info while we're in there

    opened by samdow 10
  • Batching rule not implemented for aten::item.

    Batching rule not implemented for aten::item.

    Hey, I would like to use functorch.vmap in a custom PyTorch activation function (the gradients are not needed, because the backward-pass is calculated differently). During the computation of the activation function, I do a lookup in a tensor X using a tensor Y.item() call, similar to the small dummy code below.

    Unfortunately I get the error message: RuntimeError: Batching rule not implemented for aten::item. We could not generate a fallback.

    Is it not possible to do an item() call in a vmap function or is something else wrong? Thanks a lot!

    import torch
    from functorch import vmap
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    sum = torch.zeros([100, 10], dtype=torch.int32).to(device)
    lookup = torch.randint(100, (20, 1000, 10)).to(device)
    input_tensor = torch.randint(1000, (100, 20)).to(device)
    def test_fun(sum, input_tensor):
      for j in range(20):
        for i in range(10):
          sum[i] += lookup[j, input_tensor[j].item(), i]
      return sum
    # non-vectorized version
    for i in range(100):
      test_fun(sum[i], input_tensor[I])
    # vectorized version throws error
    test_fun_vec = vmap(test_fun)
    test_fun_vec(sum, input_tensor)
    opened by hallojs 10
  • torch.atleast_1d batching rule implementation

    torch.atleast_1d batching rule implementation

    Hi functorch devs! I'm filing this issue because my code prints the following warning:

    UserWarning: There is a performance drop because we have not yet implemented the batching rule for aten::atleast_1d. Please file us an issue on GitHub so that we can prioritize its implementation. (Triggered internally at  /tmp/pip-req-build-ytawxmfk/functorch/csrc/BatchedFallback.cpp:106.)

    Why Am I Using atleast_1d ?

    I'm subclassing torch.Tensor because my code needs to be able to add some extra data to that class (I'm integrating PyTorch's AD system with another AD system to be able to call torch functions from inside a PDE solve, which is why I also inherit from a class called OverloadedType), which is named _block_variable; e.g. the subclass looks like

    class MyTensor(torch.Tensor, OverloadedType):
        _block_variable = None
        def __new__(cls, x, *args, **kwargs):
            return super().__new__(cls, x, *args, **kwargs)
        def __init__(self, x, block_var=None):
            super(OverloadedType, self).__init__()
            self._block_variable = block_var or BlockVariable(self)
        def to(self, *args, **kwargs):
            new = Tensor([])
            tmp = super(torch.Tensor, self).to(*args, **kwargs)
            new.requires_grad = tmp.requires_grad
            new._block_variable = self._block_variable
            return new
         ... #some subclass-specific methods etc

    This causes problems when I have code that does stuff like torch.tensor([torch.trace(x), torch.trace(x @ x)]) where x is a square MyTensor; the torch.tensor() call raises an exception related to taking the __len__ of a 0-dimentional tensor (the scalar traces). So instead, I do[torch.atleast_1d(torch.trace(x)), torch.atleast_1d(torch.trace(x @ x))]), which works. However, this function is functorch.vmap-ed, which triggers the performance warning. It would be great if I could either get the naive implementation (using torch.tensor instead of to work, or if a batch rule for atleast_1d() were to be implemented.

    Thank you for any help you can provide!

    opened by DiffeoInvariant 10
  • batching over model parameters

    batching over model parameters

    I have a use-case for functorch. I would like to check possible iterations of model parameters in a very efficient way (I want to eliminate the loop). Here's an example code for a simplified case I got it working:

    linear = torch.nn.Linear(10,2)
    default_weight =
    sample_input = torch.rand(3,10)
    sample_add = torch.rand_like(default_weight)
    def interpolate_weights(alpha):
        with torch.no_grad():
            res_weight = torch.nn.Parameter(default_weight + alpha*sample_add)
            linear.weight = res_weight
            return linear(sample_input)

    now I could do for alpha in, 1.0, 100) but I want to vectorise this loop since my code is prohibitively slow. Is functorch here applicable? Executing:

    alphas = torch.linspace(0.0, 1.0, 100)

    works, but how to do something similar for a simple resnet does not work. I've tried using load_state_dict but that's not working:

    from torchvision import models
    model_resnet = models.resnet18(pretrained=True)
    named_params = list(model_resnet.named_parameters())
    named_params_data = [(n, for (n,p) in named_params]
    sample_data = torch.rand(10,3,224,244)
    def test_resnet(new_params):
        def interpolate(alpha):
            with torch.no_grad():
                p_dict = {name:(old + alpha*new_params[i]) for i,(name, old) in enumerate(named_params_data)}
                model_resnet.load_state_dict(p_dict, strict=False)
                out = model_resnet(sample_data)
                return out
        return interpolate
    rand_tensor = [torch.rand_like(p) for n,p in named_params_data]
    to_vamp_resnet = test_thing(rand_tensor)

    results in:

    While copying the parameter named "fc.bias", whose dimensions in the model are torch.Size([1000]) and whose dimensions in the checkpoint are torch.Size([1000]), an exception occurred : ('vmap: inplace arithmetic(self, *extra_args) is not possible because there exists a Tensorotherin extra_args that has more elements thanself. This happened due tootherbeing vmapped over butselfnot being vmapped over in a vmap. Please try to use out-of-place operators instead of inplace arithmetic. If said operator is being called inside the PyTorch framework, please file a bug report instead.',).

    opened by LeanderK 1
  • Make vmap tests use dtype `any_one`

    Make vmap tests use dtype `any_one`

    In #1069, @kshitij12345 smartly pointed out that it's disturbing that these batch rules aren't caught by test_op_has_batch_rule. From looking at it, the bitwise ops in particular aren't being tested because the only allowed_dtype is torch.float


    1. First, please update both test_vmap and test_op_has_batch_rule to have their allowed_dtypes (in the @ops decorator) be OpDTypes.any_one instead of torch.float32
    2. We expect this to lead to new failures. Please update the corresponding xfail list for the test. i. In the case of test_op_has_batch_rule, if the failure looks to occur on an in-place function, please try first to only add it the inplace_failures list. If this does not work, you can xfail it
    opened by samdow 0
  • [testing] Insufficient coverage in test suite

    [testing] Insufficient coverage in test suite

    In functorch test suite, we use sample_inputs to get samples from an OpInfo. The problem is that sample_inputs may or may not cover all the case/overloads for an operator. I think we should use reference_inputs which super set of sample_inputs and more comprehensive. (Though this will increase the test times).

    Switching sample_inputs to reference_inputs leads to bunch of failure for test_op_has_batch_rule including the ones mentioned in

    Refer to for failures.

    cc: @zou3519

    opened by kshitij12345 3
  • vmap + GRU

    vmap + GRU

    Hi everyone, I was trying to retrieve per-sample gradients following the functorch documentation for a GRU-like model, but i get the following error:

    Traceback (most recent call last):
      File "c:/Users/Ospite/Desktop/temp/funct/examples/", line 51, in <module>
        ft_sample_grads = ft_compute_sample_grad(params, buffers, x, t, hx)
      File "C:\Users\Ospite\Desktop\temp\funct\.venv\lib\site-packages\functorch\_src\", line 362, in wrapped
        return _flat_vmap(
      File "C:\Users\Ospite\Desktop\temp\funct\.venv\lib\site-packages\functorch\_src\", line 35, in fn
        return f(*args, **kwargs)
      File "C:\Users\Ospite\Desktop\temp\funct\.venv\lib\site-packages\functorch\_src\", line 489, in _flat_vmap
        batched_outputs = func(*batched_inputs, **kwargs)
      File "C:\Users\Ospite\Desktop\temp\funct\.venv\lib\site-packages\functorch\_src\", line 1241, in wrapper
        results = grad_and_value(func, argnums, has_aux=has_aux)(*args, **kwargs)
      File "C:\Users\Ospite\Desktop\temp\funct\.venv\lib\site-packages\functorch\_src\", line 35, in fn
        return f(*args, **kwargs)
      File "C:\Users\Ospite\Desktop\temp\funct\.venv\lib\site-packages\functorch\_src\", line 1111, in wrapper
        output = func(*args, **kwargs)
      File "c:/Users/Ospite/Desktop/temp/funct/examples/", line 26, in compute_loss_stateless_model
        prediction = fmodel(params, buffers, sample.unsqueeze(1), state.unsqueeze(1))
      File "C:\Users\Ospite\Desktop\temp\funct\.venv\lib\site-packages\torch\nn\modules\", line 1190, in _call_impl
        return forward_call(*input, **kwargs)
      File "C:\Users\Ospite\Desktop\temp\funct\.venv\lib\site-packages\functorch\_src\", line 282, in forward
        return self.stateless_model(*args, **kwargs)
      File "C:\Users\Ospite\Desktop\temp\funct\.venv\lib\site-packages\torch\nn\modules\", line 1190, in _call_impl
        return forward_call(*input, **kwargs)
      File "c:/Users/Ospite/Desktop/temp/funct/examples/", line 20, in forward
        x, _ = self.recurrent(x, hx)
      File "C:\Users\Ospite\Desktop\temp\funct\.venv\lib\site-packages\torch\nn\modules\", line 1190, in _call_impl
        return forward_call(*input, **kwargs)
      File "C:\Users\Ospite\Desktop\temp\funct\.venv\lib\site-packages\torch\nn\modules\", line 955, in forward
        result = _VF.gru(input, hx, self._flat_weights, self.bias, self.num_layers,
    RuntimeError: Batching rule not implemented for aten::unsafe_split.Tensor. We could not generate a fallback.

    Vanilla RNN works correctly. The code i've used is the following:

    from functools import partial
    from typing import Type, Union
    import torch
    from functorch import grad, make_functional_with_buffers, vmap
    class Recurrent(torch.nn.Module):
        def __init__(
            recurrent_layer: Union[Type[torch.nn.GRU], Type[torch.nn.RNN]],
            input_size: int,
            hidden_size: int,
            output_size: int,
        ) -> None:
            self.recurrent = recurrent_layer(input_size=input_size, hidden_size=hidden_size, batch_first=False)
            self.fc = torch.nn.Linear(hidden_size, output_size)
        def forward(self, x: torch.Tensor, hx: torch.Tensor) -> torch.Tensor:
            x, _ = self.recurrent(x, hx)
            x = self.fc(torch.relu(x))
            return x
    def compute_loss_stateless_model(fmodel, params, buffers, sample, target, state):
        prediction = fmodel(params, buffers, sample.unsqueeze(1), state.unsqueeze(1))
        loss = torch.nn.functional.mse_loss(prediction, target.unsqueeze(1))
        return loss
    if __name__ == "__main__":
        T, B, D, H, O = 128, 64, 64, 256, 1
        x = torch.rand(T, B, D)
        t = torch.ones(T, B, O)
        hx = torch.zeros(1, B, H)
        gru = Recurrent(torch.nn.GRU, D, H, O)
        rnn = Recurrent(torch.nn.RNN, D, H, O)
        # functional RNN + vmap
        frnn, params, buffers = make_functional_with_buffers(rnn)
        ft_compute_grad = grad(partial(compute_loss_stateless_model, frnn))
        ft_compute_sample_grad = vmap(ft_compute_grad, in_dims=(None, None, 1, 1, 1))
        ft_sample_grads = ft_compute_sample_grad(params, buffers, x, t, hx)
        for g in ft_sample_grads:
        # functional GRU + vmap
        fgru, params, buffers = make_functional_with_buffers(gru)
        ft_compute_grad = grad(partial(compute_loss_stateless_model, fgru))
        ft_compute_sample_grad = vmap(ft_compute_grad, in_dims=(None, None, 1, 1, 1))
        ft_sample_grads = ft_compute_sample_grad(params, buffers, x, t, hx)
        for g in ft_sample_grads:

    The collected environment is the following:

    PyTorch version: 1.13.0+cpu
    Is debug build: False
    CUDA used to build PyTorch: Could not collect
    ROCM used to build PyTorch: N/A
    OS: Microsoft Windows 10 Pro
    GCC version: Could not collect
    Clang version: Could not collect
    CMake version: Could not collect
    Libc version: N/A
    Python version: 3.8.10 (tags/v3.8.10:3d8993a, May  3 2021, 11:48:03) [MSC v.1928 64 bit (AMD64)] (64-bit runtime)
    Python platform: Windows-10-10.0.19044-SP0
    Is CUDA available: False
    CUDA runtime version: Could not collect
    GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2070 SUPER
    Nvidia driver version: 516.94
    cuDNN version: Could not collect
    HIP runtime version: N/A
    MIOpen runtime version: N/A
    Is XNNPACK available: True
    Versions of relevant libraries:
    [pip3] functorch==1.13.0
    [pip3] mypy==0.931
    [pip3] mypy-extensions==0.4.3
    [pip3] numpy==1.23.5
    [pip3] pytorch-lightning==1.8.3.post1
    [pip3] torch==1.13.0
    [pip3] torchmetrics==0.11.0
    [conda] Could not collect

    Thank you, Federico

    opened by belerico 0
  • Add vmap support for PyTorch operators

    Add vmap support for PyTorch operators

    We're looking for more motivated open-source developers to help build out functorch (and PyTorch, since functorch is now just a part of PyTorch). Below is a selection of good first issues.

    • [x]
    • [ ]
    • [ ]
    • [ ]
    • [ ]
    • [ ]
    • [ ]
    • [ ]
    • [ ]

    In general there's a high barrier to developing PyTorch and/or functorch. We've collected topics and information over at the PyTorch Developer Wiki

    good first issue 
    opened by zou3519 2
  • Audit CompositeImplicitAutograd ops that do not have a batching rule, add them to BatchRulesDecomposition

    Audit CompositeImplicitAutograd ops that do not have a batching rule, add them to BatchRulesDecomposition

    This is low hanging fruit: there are a number of CompositeImplicitAutograd ops that do not have a batching rule. We should just add all of them to BatchRulesDecomposition. Should be easy to detect using testing similar to what @srossross did in

    Here are a couple of things to be careful of:

    • If the op has an OpInfo, then we're good (because we have test coverage)
    • If there has no test coverage, then ideally we would add an OpInfo. This is because not all CompositeImplicitAutograd operations are "Composite Compliant"
    • If there is no test coverage, an alternative to adding an OpInfo is to just read the code and eyeball if it is composite compliant or not. We would prefer having an OpInfo to this option.
    actionable high priority 
    opened by zou3519 0
Facebook Research
Facebook Research
functorch is a prototype of JAX-like composable function transforms for PyTorch.

functorch Why functorch? | Install guide | Transformations | Documentation | Future Plans This library is currently under heavy development - if you h

null 1.2k Dec 27, 2022
JBDL: A JAX-Based Body Dynamics Algorithm Library forRobotics

JBDL: A JAX-Based Body Dynamics Algorithm Library forRobotics

Tencent Robotics X 20 Dec 14, 2022
Blazing fast, composable, Pythonic quantile filters.

Rolling Quantiles for NumPy Hyper-efficient and composable filters. Simple, clean, intuitive interface. Supports streaming data or bulk processing. Py

Myrl Marmarelis 126 Dec 13, 2022
A c++ trainable semantic segmentation library based on libtorch (pytorch c++). Backbone: ResNet, ResNext. Architecture: FPN, U-Net, PAN, LinkNet, PSPNet, DeepLab-V3, DeepLab-V3+ by now.

中文 C++ library with Neural Networks for Image Segmentation based on LibTorch. The main features of this library are: High level API (just a line to cr

null 310 Jan 3, 2023
This is a code repository for pytorch c++ (or libtorch) tutorial.

LibtorchTutorials English version 环境 win10 visual sutdio 2017 或者Qt4.11.0 Libtorch 1.7 Opencv4.5 配置 libtorch+Visual Studio和libtorch+QT分别记录libtorch在VS和Q

null 464 Jan 9, 2023
GPU PyTorch TOP in TouchDesigner with CUDA-enabled OpenCV

PyTorchTOP This project demonstrates how to use OpenCV with CUDA modules and PyTorch/LibTorch in a TouchDesigner Custom Operator. Building this projec

David 65 Jun 15, 2022
Deep Learning API and Server in C++11 support for Caffe, Caffe2, PyTorch,TensorRT, Dlib, NCNN, Tensorflow, XGBoost and TSNE

Open Source Deep Learning Server & API DeepDetect ( is a machine learning API and server written in C++11. It makes state

JoliBrain 2.4k Dec 30, 2022
Fast, differentiable sorting and ranking in PyTorch

Torchsort Fast, differentiable sorting and ranking in PyTorch. Pure PyTorch implementation of Fast Differentiable Sorting and Ranking (Blondel et al.)

Teddy Koker 654 Dec 25, 2022
Official PyTorch Code of GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Monocular 3D Object Detection (CVPR 2021)

GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Monocular 3D Object Detection GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Mo

Abhinav Kumar 76 Jan 2, 2023
Support Yolov4/Yolov3/Centernet/Classify/Unet. use darknet/libtorch/pytorch to onnx to tensorrt

ONNX-TensorRT Yolov4/Yolov3/CenterNet/Classify/Unet Implementation Yolov4/Yolov3 centernet INTRODUCTION you have the trained model file from the darkn

null 172 Dec 29, 2022
UE4 Plugin to execute trained PyTorch modules

SimplePyTorch UE4 Plugin to execute trained PyTorch modules ------- Packaging ------- Download PyTorch C++ distributions:

null 50 Dec 6, 2022
C++ trainable detection library based on libtorch (or pytorch c++). Yolov4 tiny provided now.

C++ Library with Neural Networks for Object Detection Based on LibTorch. ?? Libtorch Tutorials ?? Visit Libtorch Tutorials Project if you want to know

null 62 Dec 29, 2022
A simple demonstration of how PyTorch autograd works

简单地演示了 PyTorch 中自动求导机制的原理。 官方博客: 编译运行 使用 Bazel bazel run autograd_test 包含了一个使用 MSE 损失函数的一

Howard Lau 14 Feb 24, 2022
An inofficial PyTorch implementation of PREDATOR based on KPConv.

PREDATOR: Registration of 3D Point Clouds with Low Overlap An inofficial PyTorch implementation of PREDATOR based on KPConv. The code has been tested

ZhuLifa 14 Aug 3, 2022
DLPrimitives/OpenCL out of tree backend for pytorch

Pytorch OpenCL backend based on dlprimitives DLPrimitives-OpenCL out of tree backend for pytorch It is only beginning, but you can train some vision n

Artyom Beilis 89 Dec 27, 2022
A external memory allocator example for PyTorch.

Custom PyTorch Memory Management This is a external memory allocator example for PyTorch. The underlying memory allocator is CNMeM. Usage Compile with

Zilin Zhu 12 Aug 2, 2022
A LLVM-based static analyzer to produce PyTorch operator dependency graph.

What is this? This is a clone of the deprecated LLVM-based static analyzer from the PyTorch repo, which can be used to produce the PyTorch operator de

Jiakai Liu 5 Dec 15, 2021
PSTensor provides a way to hack the memory management of tensors in TensorFlow and PyTorch by defining your own C++ Tensor Class.

PSTensor : Custimized a Tensor Data Structure Compatible with PyTorch and TensorFlow. You may need this software in the following cases. Manage memory

Jiarui Fang 8 Feb 12, 2022
Official Pytorch implementation of RePOSE (ICCV2021)

RePOSE: Fast 6D Object Pose Refinement via Deep Texture Rendering (ICCV2021) [Link] Abstract We present RePOSE, a fast iterative refinement method for

Shun Iwase 68 Nov 15, 2022