Triton - a language and compiler for writing highly efficient custom Deep-Learning primitives.

Overview
Triton logo

Wheels

Documentation
Documentation

Triton

This is the development repository of Triton, a language and compiler for writing highly efficient custom Deep-Learning primitives. The aim of Triton is to provide an open-source environment to write fast code at higher productivity than CUDA, but also with higher flexibility than other existing DSLs.

The foundations of this project are described in the following MAPL2019 publication: Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations. Please consider citing this work if you use Triton!

The official documentation contains installation instructions and tutorials.

Compatibility

Supported Platforms:

  • Linux

Supported Hardware:

  • NVIDIA GPUs (Compute Capability 7.0+)
  • Under development: AMD GPUs, CPUs

Contributing

Community contributions are more than welcome, whether it be to fix bugs or to add new features. Feel free to open Github issues about your contribution ideas, and we will review them. A contributor's guide containing general guidelines is coming soon!

Disclaimer

Triton is a fairly recent project, and it is under active development. We expect it to be pretty useful in a wide variety of cases, but don't be surprised if it's a bit rough around the edges :)

Issues
  • Support for Sparse-Dense Matrix Mulitplication

    Support for Sparse-Dense Matrix Mulitplication

    Hi, All

    Is there any support for using GPU tensor core in Sparse-Dense Matrix Multiplication (SpMM) or Sampled Dense-Dense Matrix Multiplication (SDDMM)?

    Thanks!

    opened by YukeWang96 18
  • triton==1.0.0.dev20210329 no longer installable via pip

    triton==1.0.0.dev20210329 no longer installable via pip

    Hi - would it be possible to reinstantiate triton==1.0.0.dev20210329 on pip, or make it clear how to update to the latest dev branch? The api seems to have changed significantly in the latest nightlies, and some function in https://github.com/microsoft/DeepSpeed rely on that particular interface.

    opened by sdtblck 15
  • Regression for caffe opencl branch.

    Regression for caffe opencl branch.

    The latest isaac code triggers many test failures with caffe's opencl branch. The good commit is: Templates/Reduce1D: now properly loading 2D scalars commit 6ac5e1f55b1cae5

    Since that commit, both "General: Internal code generator overhaul" and "JIT: No longer using fallbacks for stride[0] > 1" introduce some regressions.

    It's easy to build the Caffe's opencl branch as below:

    mkdir build

    cmake -DUSE_GREENTEA=ON -DUSE_ISAAC=ON ..

    cd build

    make -j8

    make runtest

    Then you will see many new failures with the above two commit.

    BTW It's better to use latest beignet driver as the OCL compiler. The good commit works great with beignet.

    @ptillet Could you look at this issue? Thanks.

    bug 
    opened by gongzg 15
  • EXTREMELY SERIOUS BUGS THAT MAKE BLOCKSPARSE COMPLETELY USELESS IN TRAINING

    EXTREMELY SERIOUS BUGS THAT MAKE BLOCKSPARSE COMPLETELY USELESS IN TRAINING

    I found that the blocksparse ops' backward gradient is totally wrong, which makes the training meaningless. Take matmul (mode='sdd') as a simple example, the following is my test code. As you can see, the forward result is the same as pytorch's implementation, while the backward is wrong, with big errors, even when layout is a dense one.

    I tested the old version, and the problem still exists. For other operators or other modes, it seems the gradient result is still wrong. My environment is:

    - python: 3.7
    - pytorch: 1.10.1
    - triton: 1.1.1
    - CUDA: 11.3.1
    - GPU: Nvidia Tesla V100
    
    import torch
    from triton.ops.blocksparse.matmul import matmul as TtMatmul
    from triton.ops.blocksparse.softmax import softmax as TtSoftmax
    
    # ===== Settings =====
    batch = 1
    head = 1
    len1 = 1024
    len2 = 768
    block = 64
    dim = 64
    
    device = 'cuda:0'
    requires_grad = True
    dtype = torch.float32
    same_heads = False  # all heads have a same attention pattern
    dense_layout = False  # use the dense layout
    
    repeat = 1000  # test time
    raise_size_error = True  # if the output size does not match, raise error
    do_not_use_full_zero_line_layout = True  # ignore the layout that has a row or column full of zero
    
    # --- check input ---
    assert block in (16, 32, 64, 128)
    chunk1 = len1 // block
    chunk2 = len2 // block
    assert chunk1 * block == len1
    assert chunk2 * block == len2
    
    # ===== Basic Functions =====
    def get_triton_matmul(*args, **kwargs):
        return TtMatmul(*args, **kwargs)
    
    def get_triton_softmax(*args, **kwargs):
        return TtSoftmax(*args, **kwargs)
    
    get_matmul = get_triton_matmul
    get_softmax = get_triton_softmax
    
    def layout_full_zero_check(layout):
        row_check = layout.sum(dim=2).eq(0).any()  # (H, L // block)
        col_check = layout.sum(dim=1).eq(0).any()
        return row_check or col_check
    
    def generate_layout(num_heads, num_chunks_1, num_chunks_2, dtype=torch.long, device='cpu'):
        if dense_layout:
            layout = torch.ones(
                1 if same_heads else num_heads, num_chunks_1, num_chunks_2, 
                dtype=dtype, 
                device=device
            )
            num_selected_blocks = layout.numel()
        else:
            while True:
                layout = torch.randint(
                    0, 2, (1 if same_heads else num_heads, num_chunks_1, num_chunks_2), 
                    dtype=dtype, 
                    device=device
                )
                num_selected_blocks = layout.sum().item()
                if num_selected_blocks > 1:
                    if do_not_use_full_zero_line_layout and layout_full_zero_check(layout):
                        continue
                    break
        if same_heads and num_heads > 1:
            layout = layout.expand(num_heads, -1, -1)
        
        return layout, num_selected_blocks
    
    
    # ===== Test SDD Matmul =====
    
    ka = 0
    kb = 0
    for _ in range(repeat):
        layout, num_selected_blocks = generate_layout(head, chunk1, chunk2)
    
        a = torch.rand((batch, head, len1, dim), dtype=dtype, device=device)
        b = torch.rand((batch, head, len2, dim), dtype=dtype, device=device)
    
        a_pytorch = a.clone()
        b_pytorch = b.clone()
    
        if requires_grad:
            for item in (a, b, a_pytorch, b_pytorch):
                item.requires_grad_()
        
        dot = get_matmul(layout, block, mode='sdd', trans_a=False, trans_b=True)
    
        c = dot(a, b)  # (batch, num_selected_blocks, block, block)
    
        try:
            assert c.shape[1] == num_selected_blocks
        except AssertionError:
            print('SIZE ERROR: %d\t%d' % (c.shape[1], num_selected_blocks))
            if raise_size_error:
                raise
    
        c_pytorch = torch.bmm(a_pytorch.view(-1, len1, dim), b_pytorch.view(-1, len2, dim).transpose(-2, -1))  # (batch * head, len1, len2)
        c_pytorch = c_pytorch.view(batch, head, chunk1, block, chunk2, block)
        c_pytorch = c_pytorch.permute(0, 1, 2, 4, 3, 5)
        c_pytorch = c_pytorch.masked_select(layout.bool().to(device)[None, :, :, :, None, None]).view(batch, -1, block, block)  # (batch, num_selected_blocks, block, block)
    
        assert torch.allclose(c, c_pytorch)
    
        sum_c, sum_c_pytorch = c.sum(), c_pytorch.sum()
    
        assert torch.allclose(sum_c, sum_c_pytorch)
    
        if requires_grad:
            sum_c.backward()
            sum_c_pytorch.backward()
    
            try:
                assert torch.allclose(a.grad, a_pytorch.grad)
            except AssertionError:
                ka += 1
                not_same = a.grad != a_pytorch.grad
                print('a.grad ERROR', (a.grad[not_same] - a_pytorch.grad[not_same]).abs().max(), float(not_same.sum() / a.grad.numel()))
            try:
                assert torch.allclose(b.grad, b_pytorch.grad)
            except AssertionError:
                kb += 1
                not_same = b.grad != b_pytorch.grad
                print('b.grad ERROR', (b.grad[not_same] - b_pytorch.grad[not_same]).abs().max(), float(not_same.sum() / b.grad.numel()))
    
    if requires_grad:
        print('dismatch for a.grad:', ka, ka / repeat)
        print('dismatch for b.grad:', kb, kb /repeat)
    
    

    One of my running:

    a.grad ERROR tensor(1.3109e+09, device='cuda:0') 1.0
    b.grad ERROR tensor(2.5181e+10, device='cuda:0') 1.0
    a.grad ERROR tensor(1.4002e+12, device='cuda:0') 1.0
    b.grad ERROR tensor(2.7797e+13, device='cuda:0') 1.0
    a.grad ERROR tensor(3715.7754, device='cuda:0') 1.0
    b.grad ERROR tensor(5357.4009, device='cuda:0') 1.0
    a.grad ERROR tensor(2650638.7500, device='cuda:0') 1.0
    b.grad ERROR tensor(5492.8169, device='cuda:0') 1.0
    a.grad ERROR tensor(4457.6938, device='cuda:0') 1.0
    b.grad ERROR tensor(5327.3140, device='cuda:0') 1.0
    a.grad ERROR tensor(3303.7292, device='cuda:0') 1.0
    b.grad ERROR tensor(4878.7446, device='cuda:0') 1.0
    a.grad ERROR tensor(4601.9453, device='cuda:0') 1.0
    b.grad ERROR tensor(6078.3545, device='cuda:0') 1.0
    a.grad ERROR tensor(3987.0869, device='cuda:0') 1.0
    b.grad ERROR tensor(5731.9072, device='cuda:0') 1.0
    a.grad ERROR tensor(3546.3162, device='cuda:0') 1.0
    b.grad ERROR tensor(6374.2583, device='cuda:0') 1.0
    a.grad ERROR tensor(2690220.7500, device='cuda:0') 1.0
    b.grad ERROR tensor(5759.3281, device='cuda:0') 1.0
    a.grad ERROR tensor(142.5978, device='cuda:0') 1.0
    b.grad ERROR tensor(11595.9834, device='cuda:0') 1.0
    a.grad ERROR tensor(4336.1611, device='cuda:0') 1.0
    b.grad ERROR tensor(5793.8066, device='cuda:0') 1.0
    a.grad ERROR tensor(131.8495, device='cuda:0') 1.0
    b.grad ERROR tensor(15427.1631, device='cuda:0') 1.0
    a.grad ERROR tensor(152.3687, device='cuda:0') 1.0
    b.grad ERROR tensor(700640.1250, device='cuda:0') 1.0
    a.grad ERROR tensor(3544.4004, device='cuda:0') 1.0
    b.grad ERROR tensor(5818.4619, device='cuda:0') 1.0
    a.grad ERROR tensor(4417.7246, device='cuda:0') 1.0
    b.grad ERROR tensor(5654.2842, device='cuda:0') 1.0
    a.grad ERROR tensor(2918971.7500, device='cuda:0') 1.0
    b.grad ERROR tensor(5756.5093, device='cuda:0') 1.0
    a.grad ERROR tensor(2656234.7500, device='cuda:0') 1.0
    b.grad ERROR tensor(4773.9717, device='cuda:0') 1.0
    a.grad ERROR tensor(2660450.5000, device='cuda:0') 1.0
    b.grad ERROR tensor(6892.2666, device='cuda:0') 1.0
    a.grad ERROR tensor(4277.8979, device='cuda:0') 1.0
    b.grad ERROR tensor(5337.3726, device='cuda:0') 1.0
    a.grad ERROR tensor(2650022.2500, device='cuda:0') 1.0
    b.grad ERROR tensor(1.1324e+08, device='cuda:0') 1.0
    a.grad ERROR tensor(359.3810, device='cuda:0') 1.0
    b.grad ERROR tensor(5316.9575, device='cuda:0') 1.0
    a.grad ERROR tensor(356.5109, device='cuda:0') 1.0
    b.grad ERROR tensor(4588.6304, device='cuda:0') 1.0
    a.grad ERROR tensor(2628183.7500, device='cuda:0') 1.0
    b.grad ERROR tensor(4068.6719, device='cuda:0') 1.0
    a.grad ERROR tensor(289.8997, device='cuda:0') 1.0
    b.grad ERROR tensor(4290.9443, device='cuda:0') 1.0
    a.grad ERROR tensor(256.7503, device='cuda:0') 1.0
    b.grad ERROR tensor(4042.9343, device='cuda:0') 1.0
    a.grad ERROR tensor(2634063.5000, device='cuda:0') 1.0
    b.grad ERROR tensor(351.9893, device='cuda:0') 1.0
    a.grad ERROR tensor(287.7516, device='cuda:0') 1.0
    b.grad ERROR tensor(353.8267, device='cuda:0') 1.0
    a.grad ERROR tensor(2627630.2500, device='cuda:0') 1.0
    b.grad ERROR tensor(386.0250, device='cuda:0') 1.0
    a.grad ERROR tensor(286.2173, device='cuda:0') 1.0
    b.grad ERROR tensor(322.8841, device='cuda:0') 1.0
    a.grad ERROR tensor(2692211.2500, device='cuda:0') 1.0
    b.grad ERROR tensor(321.6712, device='cuda:0') 1.0
    a.grad ERROR tensor(286.0741, device='cuda:0') 1.0
    b.grad ERROR tensor(325.7232, device='cuda:0') 1.0
    a.grad ERROR tensor(2567459.5000, device='cuda:0') 1.0
    b.grad ERROR tensor(354.9262, device='cuda:0') 1.0
    a.grad ERROR tensor(286.2664, device='cuda:0') 1.0
    b.grad ERROR tensor(419.1502, device='cuda:0') 1.0
    a.grad ERROR tensor(2686237.7500, device='cuda:0') 1.0
    b.grad ERROR tensor(389.2686, device='cuda:0') 1.0
    a.grad ERROR tensor(255.2152, device='cuda:0') 1.0
    b.grad ERROR tensor(286.1020, device='cuda:0') 1.0
    a.grad ERROR tensor(2576712.5000, device='cuda:0') 1.0
    b.grad ERROR tensor(358.9259, device='cuda:0') 1.0
    a.grad ERROR tensor(289.7795, device='cuda:0') 1.0
    b.grad ERROR tensor(322.9184, device='cuda:0') 1.0
    a.grad ERROR tensor(2665924.7500, device='cuda:0') 1.0
    b.grad ERROR tensor(384.9941, device='cuda:0') 1.0
    a.grad ERROR tensor(289.0149, device='cuda:0') 1.0
    b.grad ERROR tensor(392.9610, device='cuda:0') 1.0
    a.grad ERROR tensor(2662453.5000, device='cuda:0') 1.0
    b.grad ERROR tensor(357.0535, device='cuda:0') 1.0
    a.grad ERROR tensor(287.9712, device='cuda:0') 1.0
    b.grad ERROR tensor(357.6203, device='cuda:0') 1.0
    a.grad ERROR tensor(2590091., device='cuda:0') 1.0
    b.grad ERROR tensor(355.0422, device='cuda:0') 1.0
    a.grad ERROR tensor(292.4987, device='cuda:0') 1.0
    b.grad ERROR tensor(390.0421, device='cuda:0') 1.0
    a.grad ERROR tensor(2567506.7500, device='cuda:0') 1.0
    b.grad ERROR tensor(360.4383, device='cuda:0') 1.0
    a.grad ERROR tensor(255.9906, device='cuda:0') 1.0
    b.grad ERROR tensor(353.5637, device='cuda:0') 1.0
    a.grad ERROR tensor(2691391.5000, device='cuda:0') 1.0
    b.grad ERROR tensor(354.3253, device='cuda:0') 1.0
    a.grad ERROR tensor(286.5569, device='cuda:0') 1.0
    b.grad ERROR tensor(351.1349, device='cuda:0') 1.0
    a.grad ERROR tensor(2614949.2500, device='cuda:0') 1.0
    b.grad ERROR tensor(353.5500, device='cuda:0') 1.0
    a.grad ERROR tensor(328.6048, device='cuda:0') 1.0
    b.grad ERROR tensor(423.7781, device='cuda:0') 1.0
    a.grad ERROR tensor(313.4121, device='cuda:0') 1.0
    b.grad ERROR tensor(392.1626, device='cuda:0') 1.0
    a.grad ERROR tensor(256.8100, device='cuda:0') 1.0
    b.grad ERROR tensor(320.8966, device='cuda:0') 1.0
    a.grad ERROR tensor(2681082.2500, device='cuda:0') 1.0
    b.grad ERROR tensor(320.8078, device='cuda:0') 1.0
    a.grad ERROR tensor(257.9711, device='cuda:0') 1.0
    b.grad ERROR tensor(355.4615, device='cuda:0') 1.0
    a.grad ERROR tensor(289.8659, device='cuda:0') 1.0
    b.grad ERROR tensor(353.8013, device='cuda:0') 1.0
    a.grad ERROR tensor(284.3614, device='cuda:0') 1.0
    b.grad ERROR tensor(353.3466, device='cuda:0') 1.0
    a.grad ERROR tensor(296.3285, device='cuda:0') 1.0
    b.grad ERROR tensor(359.3662, device='cuda:0') 1.0
    a.grad ERROR tensor(319.2778, device='cuda:0') 1.0
    b.grad ERROR tensor(327.0521, device='cuda:0') 1.0
    a.grad ERROR tensor(291.4463, device='cuda:0') 1.0
    b.grad ERROR tensor(450.9517, device='cuda:0') 1.0
    a.grad ERROR tensor(352.7974, device='cuda:0') 1.0
    b.grad ERROR tensor(388.5782, device='cuda:0') 1.0
    a.grad ERROR tensor(254.4195, device='cuda:0') 1.0
    b.grad ERROR tensor(390.5658, device='cuda:0') 1.0
    a.grad ERROR tensor(256.4656, device='cuda:0') 1.0
    b.grad ERROR tensor(1.1066e+08, device='cuda:0') 1.0
    a.grad ERROR tensor(255.1293, device='cuda:0') 1.0
    b.grad ERROR tensor(353.7872, device='cuda:0') 1.0
    a.grad ERROR tensor(288.5960, device='cuda:0') 1.0
    b.grad ERROR tensor(393.4873, device='cuda:0') 1.0
    a.grad ERROR tensor(256.1300, device='cuda:0') 1.0
    b.grad ERROR tensor(355.0295, device='cuda:0') 1.0
    a.grad ERROR tensor(323.6937, device='cuda:0') 1.0
    b.grad ERROR tensor(365.0252, device='cuda:0') 1.0
    a.grad ERROR tensor(287.0813, device='cuda:0') 1.0
    b.grad ERROR tensor(387.5172, device='cuda:0') 1.0
    a.grad ERROR tensor(253.4359, device='cuda:0') 1.0
    b.grad ERROR tensor(324.5351, device='cuda:0') 1.0
    a.grad ERROR tensor(290.9200, device='cuda:0') 1.0
    b.grad ERROR tensor(321.8120, device='cuda:0') 1.0
    a.grad ERROR tensor(323.4057, device='cuda:0') 1.0
    b.grad ERROR tensor(321.2264, device='cuda:0') 1.0
    a.grad ERROR tensor(326.8586, device='cuda:0') 1.0
    b.grad ERROR tensor(356.5294, device='cuda:0') 1.0
    a.grad ERROR tensor(291.3673, device='cuda:0') 1.0
    b.grad ERROR tensor(402.2349, device='cuda:0') 1.0
    a.grad ERROR tensor(322.2409, device='cuda:0') 1.0
    b.grad ERROR tensor(390.6744, device='cuda:0') 1.0
    a.grad ERROR tensor(317.9997, device='cuda:0') 1.0
    b.grad ERROR tensor(358.8085, device='cuda:0') 1.0
    a.grad ERROR tensor(287.7614, device='cuda:0') 1.0
    b.grad ERROR tensor(423.3942, device='cuda:0') 1.0
    a.grad ERROR tensor(286.4266, device='cuda:0') 1.0
    b.grad ERROR tensor(355.7695, device='cuda:0') 1.0
    a.grad ERROR tensor(322.1042, device='cuda:0') 1.0
    b.grad ERROR tensor(417.8162, device='cuda:0') 1.0
    a.grad ERROR tensor(285.9888, device='cuda:0') 1.0
    b.grad ERROR tensor(324.0033, device='cuda:0') 1.0
    a.grad ERROR tensor(286.2515, device='cuda:0') 1.0
    b.grad ERROR tensor(421.8573, device='cuda:0') 1.0
    a.grad ERROR tensor(318.7636, device='cuda:0') 1.0
    b.grad ERROR tensor(324.4319, device='cuda:0') 1.0
    a.grad ERROR tensor(319.1130, device='cuda:0') 1.0
    b.grad ERROR tensor(358.5825, device='cuda:0') 1.0
    a.grad ERROR tensor(318.4194, device='cuda:0') 1.0
    b.grad ERROR tensor(292.2460, device='cuda:0') 1.0
    a.grad ERROR tensor(288.0099, device='cuda:0') 1.0
    b.grad ERROR tensor(350.2174, device='cuda:0') 1.0
    a.grad ERROR tensor(2751064.2500, device='cuda:0') 1.0
    b.grad ERROR tensor(488.0098, device='cuda:0') 1.0
    a.grad ERROR tensor(224.5474, device='cuda:0') 1.0
    b.grad ERROR tensor(324.3686, device='cuda:0') 1.0
    a.grad ERROR tensor(350.1700, device='cuda:0') 1.0
    b.grad ERROR tensor(425.4680, device='cuda:0') 1.0
    a.grad ERROR tensor(287.2039, device='cuda:0') 1.0
    b.grad ERROR tensor(451.3908, device='cuda:0') 1.0
    a.grad ERROR tensor(254.7051, device='cuda:0') 1.0
    b.grad ERROR tensor(1.0482e+08, device='cuda:0') 1.0
    a.grad ERROR tensor(284.7216, device='cuda:0') 1.0
    b.grad ERROR tensor(326.0213, device='cuda:0') 1.0
    a.grad ERROR tensor(1.2124e+08, device='cuda:0') 1.0
    b.grad ERROR tensor(352.3882, device='cuda:0') 1.0
    a.grad ERROR tensor(253.9305, device='cuda:0') 1.0
    b.grad ERROR tensor(324.9531, device='cuda:0') 1.0
    a.grad ERROR tensor(2662769.5000, device='cuda:0') 1.0
    b.grad ERROR tensor(360.7639, device='cuda:0') 1.0
    a.grad ERROR tensor(289.1925, device='cuda:0') 1.0
    b.grad ERROR tensor(385.9752, device='cuda:0') 1.0
    a.grad ERROR tensor(2861392.5000, device='cuda:0') 1.0
    b.grad ERROR tensor(320.4403, device='cuda:0') 1.0
    a.grad ERROR tensor(2671811.5000, device='cuda:0') 1.0
    b.grad ERROR tensor(388.5952, device='cuda:0') 1.0
    a.grad ERROR tensor(2979740., device='cuda:0') 1.0
    b.grad ERROR tensor(22014.4922, device='cuda:0') 1.0
    a.grad ERROR tensor(2979724.2500, device='cuda:0') 1.0
    b.grad ERROR tensor(2.0401e+08, device='cuda:0') 1.0
    a.grad ERROR tensor(1.4005e+08, device='cuda:0') 1.0
    b.grad ERROR tensor(1.0918e+10, device='cuda:0') 1.0
    a.grad ERROR tensor(111.5738, device='cuda:0') 1.0
    b.grad ERROR tensor(8.8034e+09, device='cuda:0') 1.0
    a.grad ERROR tensor(116.8540, device='cuda:0') 1.0
    b.grad ERROR tensor(13286.1270, device='cuda:0') 1.0
    a.grad ERROR tensor(2574.5845, device='cuda:0') 1.0
    b.grad ERROR tensor(59198.0352, device='cuda:0') 1.0
    a.grad ERROR tensor(2796279.2500, device='cuda:0') 1.0
    b.grad ERROR tensor(290.7633, device='cuda:0') 1.0
    a.grad ERROR tensor(257.6186, device='cuda:0') 1.0
    b.grad ERROR tensor(351.0964, device='cuda:0') 1.0
    a.grad ERROR tensor(254.5203, device='cuda:0') 1.0
    b.grad ERROR tensor(297.9558, device='cuda:0') 1.0
    a.grad ERROR tensor(259.2481, device='cuda:0') 1.0
    b.grad ERROR tensor(270.5669, device='cuda:0') 1.0
    a.grad ERROR tensor(288.9724, device='cuda:0') 1.0
    b.grad ERROR tensor(352.0467, device='cuda:0') 1.0
    a.grad ERROR tensor(257.4395, device='cuda:0') 1.0
    b.grad ERROR tensor(352.4917, device='cuda:0') 1.0
    a.grad ERROR tensor(2630416.5000, device='cuda:0') 1.0
    b.grad ERROR tensor(348.8304, device='cuda:0') 1.0
    a.grad ERROR tensor(2611716.5000, device='cuda:0') 1.0
    b.grad ERROR tensor(351.9631, device='cuda:0') 1.0
    a.grad ERROR tensor(2.4866e+08, device='cuda:0') 1.0
    b.grad ERROR tensor(40511900., device='cuda:0') 1.0
    a.grad ERROR tensor(133.4997, device='cuda:0') 1.0
    b.grad ERROR tensor(322.5428, device='cuda:0') 1.0
    a.grad ERROR tensor(2725685.5000, device='cuda:0') 1.0
    b.grad ERROR tensor(420.1107, device='cuda:0') 1.0
    a.grad ERROR tensor(2781390.2500, device='cuda:0') 1.0
    b.grad ERROR tensor(388.4075, device='cuda:0') 1.0
    a.grad ERROR tensor(257.0772, device='cuda:0') 1.0
    b.grad ERROR tensor(425.4004, device='cuda:0') 1.0
    a.grad ERROR tensor(286.1646, device='cuda:0') 1.0
    b.grad ERROR tensor(354.5847, device='cuda:0') 1.0
    a.grad ERROR tensor(2775809., device='cuda:0') 1.0
    b.grad ERROR tensor(384.3141, device='cuda:0') 1.0
    a.grad ERROR tensor(2784289.2500, device='cuda:0') 1.0
    b.grad ERROR tensor(390.9427, device='cuda:0') 1.0
    a.grad ERROR tensor(1491.5585, device='cuda:0') 1.0
    b.grad ERROR tensor(1.8443e+09, device='cuda:0') 1.0
    a.grad ERROR tensor(2666701.7500, device='cuda:0') 1.0
    b.grad ERROR tensor(1.1275e+08, device='cuda:0') 1.0
    a.grad ERROR tensor(2653713., device='cuda:0') 1.0
    b.grad ERROR tensor(353.5638, device='cuda:0') 1.0
    a.grad ERROR tensor(322.0308, device='cuda:0') 1.0
    b.grad ERROR tensor(357.3578, device='cuda:0') 1.0
    a.grad ERROR tensor(254.9564, device='cuda:0') 1.0
    b.grad ERROR tensor(323.5333, device='cuda:0') 1.0
    a.grad ERROR tensor(333.6547, device='cuda:0') 1.0
    b.grad ERROR tensor(359.9502, device='cuda:0') 1.0
    a.grad ERROR tensor(286.8031, device='cuda:0') 1.0
    b.grad ERROR tensor(389.5343, device='cuda:0') 1.0
    a.grad ERROR tensor(2726983.5000, device='cuda:0') 1.0
    b.grad ERROR tensor(357.1115, device='cuda:0') 1.0
    a.grad ERROR tensor(288.9051, device='cuda:0') 1.0
    b.grad ERROR tensor(357.7245, device='cuda:0') 1.0
    a.grad ERROR tensor(876.8928, device='cuda:0') 1.0
    b.grad ERROR tensor(19977906., device='cuda:0') 1.0
    a.grad ERROR tensor(2686767.2500, device='cuda:0') 1.0
    b.grad ERROR tensor(357.6304, device='cuda:0') 1.0
    a.grad ERROR tensor(2765077.7500, device='cuda:0') 1.0
    b.grad ERROR tensor(360.3122, device='cuda:0') 1.0
    a.grad ERROR tensor(283.9131, device='cuda:0') 1.0
    b.grad ERROR tensor(2.1615e+08, device='cuda:0') 1.0
    a.grad ERROR tensor(2798058.7500, device='cuda:0') 1.0
    b.grad ERROR tensor(19864.1621, device='cuda:0') 1.0
    a.grad ERROR tensor(2597423.2500, device='cuda:0') 1.0
    b.grad ERROR tensor(751808.8125, device='cuda:0') 1.0
    a.grad ERROR tensor(69647880., device='cuda:0') 1.0
    b.grad ERROR tensor(2.6077e+09, device='cuda:0') 1.0
    a.grad ERROR tensor(2.6070e+09, device='cuda:0') 1.0
    b.grad ERROR tensor(1.5997e+08, device='cuda:0') 1.0
    a.grad ERROR tensor(59987.2266, device='cuda:0') 1.0
    b.grad ERROR tensor(6796.1133, device='cuda:0') 1.0
    a.grad ERROR tensor(3446316.2500, device='cuda:0') 1.0
    b.grad ERROR tensor(5.5290e+09, device='cuda:0') 1.0
    a.grad ERROR tensor(254.4641, device='cuda:0') 1.0
    b.grad ERROR tensor(2.0320e+11, device='cuda:0') 1.0
    a.grad ERROR tensor(1.5091e+09, device='cuda:0') 1.0
    b.grad ERROR tensor(4.8775e+10, device='cuda:0') 1.0
    a.grad ERROR tensor(4.9482e+10, device='cuda:0') 1.0
    b.grad ERROR tensor(4.9815e+10, device='cuda:0') 1.0
    a.grad ERROR tensor(288.4737, device='cuda:0') 1.0
    b.grad ERROR tensor(22817.1172, device='cuda:0') 1.0
    Traceback (most recent call last):
      File "test_matmul_sdd.py", line 112, in <module>
        sum_c_pytorch.backward()
      File "/opt/miniconda/lib/python3.7/site-packages/torch/_tensor.py", line 307, in backward
        torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
      File "/opt/miniconda/lib/python3.7/site-packages/torch/autograd/__init__.py", line 156, in backward
        allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
    RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`
    

    Sometimes, RuntimeError appears as you can see above, but sometimes it doesn't.

    Could you please fix it asap? It's really serious, and many people use this module oblivious of the problem.

    Thank you very much.

    opened by btyu 13
  • IndexError: map::at when calling `_triton.code_gen.add_passes_to_emit_bin`

    IndexError: map::at when calling `_triton.code_gen.add_passes_to_emit_bin`

    (Congratulations on the release! This looks like a very interesting package :smiley:)

    Both building from source and installing the wheel on my machine, attempting to run anything always results in IndexError: map::at. When attempting to run the vector addition example, for example:

    $ python vector_addition.py 
    /home/gmarkall/miniconda3/envs/triton-wheel/lib/python3.9/site-packages/torch/package/_mock_zipreader.py:17: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at  /pytorch/torch/csrc/utils/tensor_numpy.cpp:67.)
      _dtype_to_storage = {data_type(0).dtype: data_type for data_type in _storages}
    Traceback (most recent call last):
      File "/home/gmarkall/numbadev/triton/tutorials/vector_addition.py", line 50, in <module>
        zb = add(x, y)
      File "/home/gmarkall/numbadev/triton/tutorials/vector_addition.py", line 39, in add
        _add[grid](x, y, z, N, BLOCK=1024)
      File "/home/gmarkall/miniconda3/envs/triton-wheel/lib/python3.9/site-packages/triton/code_gen.py", line 599, in __call__
        return self.kernel(*wargs, **kwargs, grid=self.grid)
      File "/home/gmarkall/miniconda3/envs/triton-wheel/lib/python3.9/site-packages/triton/code_gen.py", line 576, in __call__
        cache[key] = self._compile(
      File "/home/gmarkall/miniconda3/envs/triton-wheel/lib/python3.9/site-packages/triton/code_gen.py", line 550, in _compile
        mod, ker, shared_mem, ir_asm = _triton.code_gen.add_passes_to_emit_bin(generator.module, tt_device, num_warps, num_stages, force_nc_cache)
    IndexError: map::at
    

    Packages installed in the environment are:

    $ conda list
    # packages in environment at /home/gmarkall/miniconda3/envs/triton-wheel:
    #
    # Name                    Version                   Build  Channel
    _libgcc_mutex             0.1                        main  
    _openmp_mutex             4.5                       1_gnu  
    ca-certificates           2021.7.5             h06a4308_1  
    certifi                   2021.5.30        py39h06a4308_0  
    ld_impl_linux-64          2.35.1               h7274673_9  
    libffi                    3.3                  he6710b0_2  
    libgcc-ng                 9.3.0               h5101ec6_17  
    libgomp                   9.3.0               h5101ec6_17  
    libstdcxx-ng              9.3.0               hd4cf53a_17  
    ncurses                   6.2                  he6710b0_1  
    openssl                   1.1.1k               h27cfd23_0  
    pip                       21.1.3           py39h06a4308_0  
    python                    3.9.5                h12debd9_4  
    readline                  8.1                  h27cfd23_0  
    setuptools                52.0.0           py39h06a4308_0  
    sqlite                    3.36.0               hc218d9a_0  
    tk                        8.6.10               hbc83047_0  
    torch                     1.9.0                    pypi_0    pypi
    triton                    1.0.0                    pypi_0    pypi
    typing-extensions         3.10.0.0                 pypi_0    pypi
    tzdata                    2021a                h52ac0ba_0  
    wheel                     0.36.2             pyhd3eb1b0_0  
    xz                        5.2.5                h7b6447c_0  
    zlib                      1.2.11               h7b6447c_3  
    

    I'm guessing that perhaps there's something missing / a wrong version on my system? I couldn't find an exact list of requirements - have I missed something somewhere?

    Many thanks in advance for any help!

    bug 
    opened by gmarkall 13
  • Integrate optimized image based GEMM kernels for Intel

    Integrate optimized image based GEMM kernels for Intel

    This PR implemented an image based blocking matrix multiplication kernel to do the GEMM on Intel platforms. And also choose a build time platform selection for now, as the OCL driver doesn't provide detail information.

    opened by gongzg 13
  • Fail case when running caffe opencl branch with isaac

    Fail case when running caffe opencl branch with isaac

    Hi, I am running caffe opencl branch (https://github.com/BVLC/caffe/tree/opencl) with isaac master branch on INTEL BROADWELL platform with below command: ./build/test/test.testbin --gtest_filter=NetTest/2.TestLossWeight, that will bring fail. While comment out the line 94 to 141 on file https://github.com/ptillet/isaac/blob/master/lib/runtime/profiles.cpp, that will pass the test case.

    Can you reproduce the fail case, seems there is some problem with the copy operation on predict_ logic?

    opened by listenlink 13
  • Install triton failure

    Install triton failure

    Today i want have a try on triton, but get stuck in the first step. Install failure:

    System: ubuntu 18.04 gpu:2080TI python 3.8

    1、Install via pip install triton as shown in https://triton-lang.org/getting-started/installation.html#binary-distributions, and it looks like success. But when run the test code, it arise: Traceback (most recent call last): File "test.py", line 117, in y_triton = softmax(x) File "test.py", line 102, in softmax BLOCK_SIZE=BLOCK_SIZE, File "/home/**/triton/python/triton/code_gen.py", line 676, in call return self.kernel(wargs, kwargs, grid=self.grid) File "/home//triton/python/triton/code_gen.py", line 647, in call constants=constants, meta File "/home//triton/python/triton/code_gen.py", line 563, in _compile name, asm, shared_mem = _triton.code_gen.compile_ttir(backend, generator.module, device, num_warps, num_stages) RuntimeError: CUDA: Error- unknown

    2、Install From Source as shown in https://triton-lang.org/getting-started/installation.html#from-source , it failure: Requirement already satisfied: torch in /opt/conda/lib/python3.7/site-packages (from triton==1.1.1) (1.9.0) Requirement already satisfied: filelock in /opt/conda/lib/python3.7/site-packages (from triton==1.1.1) (3.0.12) Requirement already satisfied: typing_extensions in /opt/conda/lib/python3.7/site-packages (from torch->triton==1.1.1) (3.7.4.3) Building wheels for collected packages: triton Building wheel for triton (setup.py) ... error ERROR: Command errored out with exit status 1: command: /opt/conda/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-7jocwigo/setup.py'"'"'; file='"'"'/tmp/pip-req-build-7jocwigo/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-_kj3or_s cwd: /tmp/pip-req-build-7jocwigo/ Complete output (48 lines): /opt/conda/lib/python3.7/site-packages/setuptools/dist.py:694: UserWarning: Usage of dash-separated 'description-file' will not be supported in future versions. Please use the underscore name 'description_file' instead % (opt, underscore_opt)) running bdist_wheel running build running build_py package init file 'triton/_C/init.py' not found (or not a regular file) running egg_info writing triton.egg-info/PKG-INFO writing dependency_links to triton.egg-info/dependency_links.txt writing requirements to triton.egg-info/requires.txt writing top-level names to triton.egg-info/top_level.txt reading manifest file 'triton.egg-info/SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'triton.egg-info/SOURCES.txt' running build_ext CMake Error: The source directory "/tmp" does not appear to contain CMakeLists.txt. Specify --help for usage, or press the help button on the CMake GUI. Traceback (most recent call last): File "", line 1, in File "/tmp/pip-req-build-7jocwigo/setup.py", line 144, in "Programming Language :: Python :: 3.6", File "/opt/conda/lib/python3.7/site-packages/setuptools/init.py", line 153, in setup return distutils.core.setup(**attrs) File "/opt/conda/lib/python3.7/distutils/core.py", line 148, in setup dist.run_commands() File "/opt/conda/lib/python3.7/distutils/dist.py", line 966, in run_commands self.run_command(cmd) File "/opt/conda/lib/python3.7/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/opt/conda/lib/python3.7/site-packages/wheel/bdist_wheel.py", line 299, in run self.run_command('build') File "/opt/conda/lib/python3.7/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/opt/conda/lib/python3.7/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/opt/conda/lib/python3.7/distutils/command/build.py", line 135, in run self.run_command(cmd_name) File "/opt/conda/lib/python3.7/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/opt/conda/lib/python3.7/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/tmp/pip-req-build-7jocwigo/setup.py", line 77, in run self.build_extension(ext) File "/tmp/pip-req-build-7jocwigo/setup.py", line 118, in build_extension subprocess.check_call(["cmake", self.base_dir] + cmake_args, cwd=self.build_temp, env=env) File "/opt/conda/lib/python3.7/subprocess.py", line 363, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['cmake', '/tmp', '-DCMAKE_LIBRARY_OUTPUT_DIRECTORY=/tmp/pip-req-build-7jocwigo/build/lib.linux-x86_64-3.7/triton/_C', '-DBUILD_TUTORIALS=OFF', '-DBUILD_PYTHON_MODULE=ON', '-DLLVM_INCLUDE_DIRS=/tmp/clang+llvm-11.0.1-x86_64-linux-gnu-ubuntu-16.04/include', '-DLLVM_LIBRARY_DIR=/tmp/clang+llvm-11.0.1-x86_64-linux-gnu-ubuntu-16.04/lib', '-DTRITON_LLVM_BUILD_DIR=/tmp/llvm-release', '-DPYTHON_INCLUDE_DIRS=/opt/conda/include/python3.7m;/usr/local/cuda/include', '-DCMAKE_BUILD_TYPE=Release']' returned non-zero exit status 1.

    ERROR: Failed building wheel for triton Running setup.py clean for triton Failed to build triton Installing collected packages: triton Running setup.py install for triton ... error ERROR: Command errored out with exit status 1: command: /opt/conda/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-7jocwigo/setup.py'"'"'; file='"'"'/tmp/pip-req-build-7jocwigo/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-c7q59hfb/install-record.txt --single-version-externally-managed --compile --install-headers /opt/conda/include/python3.7m/triton cwd: /tmp/pip-req-build-7jocwigo/ Complete output (71 lines): /opt/conda/lib/python3.7/site-packages/setuptools/dist.py:694: UserWarning: Usage of dash-separated 'description-file' will not be supported in future versions. Please use the underscore name 'description_file' instead % (opt, underscore_opt)) running install running build running build_py creating build creating build/lib.linux-x86_64-3.7 creating build/lib.linux-x86_64-3.7/triton copying triton/code_gen.py -> build/lib.linux-x86_64-3.7/triton copying triton/init.py -> build/lib.linux-x86_64-3.7/triton copying triton/testing.py -> build/lib.linux-x86_64-3.7/triton package init file 'triton/_C/init.py' not found (or not a regular file) creating build/lib.linux-x86_64-3.7/triton/language copying triton/language/init.py -> build/lib.linux-x86_64-3.7/triton/language copying triton/language/core.py -> build/lib.linux-x86_64-3.7/triton/language copying triton/language/random.py -> build/lib.linux-x86_64-3.7/triton/language creating build/lib.linux-x86_64-3.7/triton/tools copying triton/tools/disasm.py -> build/lib.linux-x86_64-3.7/triton/tools copying triton/tools/init.py -> build/lib.linux-x86_64-3.7/triton/tools creating build/lib.linux-x86_64-3.7/triton/ops copying triton/ops/matmul.py -> build/lib.linux-x86_64-3.7/triton/ops copying triton/ops/init.py -> build/lib.linux-x86_64-3.7/triton/ops copying triton/ops/cross_entropy.py -> build/lib.linux-x86_64-3.7/triton/ops creating build/lib.linux-x86_64-3.7/triton/ops/blocksparse copying triton/ops/blocksparse/matmul.py -> build/lib.linux-x86_64-3.7/triton/ops/blocksparse copying triton/ops/blocksparse/init.py -> build/lib.linux-x86_64-3.7/triton/ops/blocksparse copying triton/ops/blocksparse/softmax.py -> build/lib.linux-x86_64-3.7/triton/ops/blocksparse running egg_info writing triton.egg-info/PKG-INFO writing dependency_links to triton.egg-info/dependency_links.txt writing requirements to triton.egg-info/requires.txt writing top-level names to triton.egg-info/top_level.txt reading manifest file 'triton.egg-info/SOURCES.txt' reading manifest template 'MANIFEST.in' writing manifest file 'triton.egg-info/SOURCES.txt' running build_ext CMake Error: The source directory "/tmp" does not appear to contain CMakeLists.txt. Specify --help for usage, or press the help button on the CMake GUI. Traceback (most recent call last): File "", line 1, in File "/tmp/pip-req-build-7jocwigo/setup.py", line 144, in "Programming Language :: Python :: 3.6", File "/opt/conda/lib/python3.7/site-packages/setuptools/init.py", line 153, in setup return distutils.core.setup(**attrs) File "/opt/conda/lib/python3.7/distutils/core.py", line 148, in setup dist.run_commands() File "/opt/conda/lib/python3.7/distutils/dist.py", line 966, in run_commands self.run_command(cmd) File "/opt/conda/lib/python3.7/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/opt/conda/lib/python3.7/site-packages/setuptools/command/install.py", line 61, in run return orig.install.run(self) File "/opt/conda/lib/python3.7/distutils/command/install.py", line 545, in run self.run_command('build') File "/opt/conda/lib/python3.7/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/opt/conda/lib/python3.7/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/opt/conda/lib/python3.7/distutils/command/build.py", line 135, in run self.run_command(cmd_name) File "/opt/conda/lib/python3.7/distutils/cmd.py", line 313, in run_command self.distribution.run_command(command) File "/opt/conda/lib/python3.7/distutils/dist.py", line 985, in run_command cmd_obj.run() File "/tmp/pip-req-build-7jocwigo/setup.py", line 77, in run self.build_extension(ext) File "/tmp/pip-req-build-7jocwigo/setup.py", line 118, in build_extension subprocess.check_call(["cmake", self.base_dir] + cmake_args, cwd=self.build_temp, env=env) File "/opt/conda/lib/python3.7/subprocess.py", line 363, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['cmake', '/tmp', '-DCMAKE_LIBRARY_OUTPUT_DIRECTORY=/tmp/pip-req-build-7jocwigo/build/lib.linux-x86_64-3.7/triton/_C', '-DBUILD_TUTORIALS=OFF', '-DBUILD_PYTHON_MODULE=ON', '-DLLVM_INCLUDE_DIRS=/tmp/clang+llvm-11.0.1-x86_64-linux-gnu-ubuntu-16.04/include', '-DLLVM_LIBRARY_DIR=/tmp/clang+llvm-11.0.1-x86_64-linux-gnu-ubuntu-16.04/lib', '-DTRITON_LLVM_BUILD_DIR=/tmp/llvm-release', '-DPYTHON_INCLUDE_DIRS=/opt/conda/include/python3.7m;/usr/local/cuda/include', '-DCMAKE_BUILD_TYPE=Release']' returned non-zero exit status 1. ---------------------------------------- ERROR: Command errored out with exit status 1: /opt/conda/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-7jocwigo/setup.py'"'"'; file='"'"'/tmp/pip-req-build-7jocwigo/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(file) if os.path.exists(file) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-c7q59hfb/install-record.txt --single-version-externally-managed --compile --install-headers /opt/conda/include/python3.7m/triton Check the logs for full command output.

    Anyone has any idea about how to solve this? thanks a lot!

    opened by Tengxu-Sun 10
  • question on blocksparse.

    question on blocksparse.

    Hi thanks for open-sourced contribution. I see there is block sparse implementation in the test. Is this the block sparse attention mechanism specific for transformer? Or it is the general spMM?

    Thanks

    opened by Young768 10
  • Unable to install via pip

    Unable to install via pip

    Hi all,

    I tried to install the versions 0.3.0 and 0.2.0 and got some build errors. Any ideas what might be going on? The error points me to a CMake log, but that log doesn't exist on my computer.

    For what its worth, I have Cuda 11.1 and Python 3.9.1.

    Thanks!

         
          -- Configuring incomplete, errors occurred!
          See also "/tmp/pip-req-build-ll5zoboi/build/temp.linux-x86_64-3.9/CMakeFiles/CMakeOutput.log".
          Traceback (most recent call last):
            File "<string>", line 1, in <module>
            File "/tmp/pip-req-build-ll5zoboi/setup.py", line 106, in <module>
              setup(
            File "/home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/lib/python3.9/site-packages/setuptools/__init__.py", line 163, in setup
              return distutils.core.setup(**attrs)
            File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/core.py", line 148, in setup
              dist.run_commands()
            File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/dist.py", line 966, in run_commands
              self.run_command(cmd)
            File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/dist.py", line 985, in run_command
              cmd_obj.run()
            File "/home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/lib/python3.9/site-packages/setuptools/command/install.py", line 61, in run
              return orig.install.run(self)
            File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/command/install.py", line 546, in run
              self.run_command('build')
            File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/cmd.py", line 313, in run_command
              self.distribution.run_command(command)
            File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/dist.py", line 985, in run_command
              cmd_obj.run()
            File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/command/build.py", line 135, in run
              self.run_command(cmd_name)
            File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/cmd.py", line 313, in run_command
              self.distribution.run_command(command)
            File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/distutils/dist.py", line 985, in run_command
              cmd_obj.run()
            File "/tmp/pip-req-build-ll5zoboi/setup.py", line 55, in run
              self.build_extension(ext)
            File "/tmp/pip-req-build-ll5zoboi/setup.py", line 93, in build_extension
              subprocess.check_call(['cmake', sourcedir] + cmake_args, cwd=self.build_temp, env=env)
            File "/home/fishy/.pyenv/versions/3.9.1/lib/python3.9/subprocess.py", line 373, in check_call
              raise CalledProcessError(retcode, cmd)
          subprocess.CalledProcessError: Command '['cmake', '/tmp/pip-req-build-ll5zoboi/src', '-DCMAKE_LIBRARY_OUTPUT_DIRECTORY=/tmp/pip-req-build-ll5zoboi/build/lib.linux-x86_64-3.9/triton/_C', '-DBUILD_TESTS=OFF', '-DBUILD_PYTHON_MODULE=ON', '-DPYT
    HON_INCLUDE_DIRS=/home/fishy/.pyenv/versions/3.9.1/include/python3.9;/home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/lib/python3.9/site-packages/torch/include;/home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/lib/python3.9/si
    te-packages/torch/include/torch/csrc/api/include;/home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/lib/python3.9/site-packages/torch/include/TH;/home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/lib/python3.9/site-packages/torch
    /include/THC;/usr/local/cuda-11.1/include', '-DPYTHON_LINK_DIRS=/home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/lib/python3.9/site-packages/torch/lib;/usr/local/cuda-11.1/lib64', '-DTORCH_LIBRARIES=c10;c10_cuda;torch;torch_cuda;torch_c
    pu;torch_python;triton', '-DLLVM_CONFIG=/usr/bin/llvm-config', '-DCMAKE_BUILD_TYPE=Release']' returned non-zero exit status 1.
          ----------------------------------------
      ERROR: Command errored out with exit status 1: /home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-ll5zoboi/setup.py'"'"'; __file__='"'"'/tmp/pip-r
    eq-build-ll5zoboi/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-8fxaw_pr/install-reco
    rd.txt --single-version-externally-managed --compile --install-headers /home/fishy/.cache/pypoetry/virtualenvs/limp-MNsrVWB5-py3.9/include/site/python3.9/triton Check the logs for full command output.
    
    opened by fishbotics 9
  • Optimized GEMM & GEMV for Intel platforms

    Optimized GEMM & GEMV for Intel platforms

    This PR implemented optimized GEMV and GEMM kernels for Intel Gen Graphics. For the GEMM function, we force the profile to always choose image based GEMM implementation as we found for the real work load, the image based kernels always get better performance. If we use the default tuning mechanism, isaac may choose different implementations which are much slower.

    opened by gongzg 9
  • adjust heuristics for dwdb kernel in layer_norm tutorial

    adjust heuristics for dwdb kernel in layer_norm tutorial

    This improves ln backward perf in tutorial for smaller (more common) sizes from

    layer-norm:
              N      Triton       Torch
    0    1024.0  140.434291  323.368435
    1    1536.0  165.309420  368.640009
    2    2048.0  207.392405  402.885254
    3    2560.0  244.780886  435.744691
    4    3072.0  284.664100  460.800012
    5    3584.0  322.157308  407.658756
    6    4096.0  357.469093  406.214885
    7    4608.0  380.041232  408.088565
    8    5120.0  410.969891  412.348995
    9    5632.0  438.857160  423.724120
    10   6144.0  463.698105  427.408686
    11   6656.0  488.513781  432.910585
    12   7168.0  512.000004  406.695045
    13   7680.0  534.260858  406.887430
    14   8192.0  556.963156  429.275114
    15   8704.0  566.113842  419.469872
    16   9216.0  583.598937  432.000001
    17   9728.0  600.185100  435.582084
    18  10240.0  617.487431  423.724143
    19  10752.0  629.385354  425.821771
    20  11264.0  640.606617  428.424741
    21  11776.0  655.740150  430.173518
    22  12288.0  659.758391  435.615969
    23  12800.0  670.742365  438.231109
    24  13312.0  672.606304  442.504139
    25  13824.0  667.557343  442.367996
    26  14336.0  646.736871  437.740464
    27  14848.0  643.234625  441.576222
    28  15360.0  638.890784  436.777239
    29  15872.0  638.070346  442.425098
    

    to

    layer-norm:
              N      Triton       Torch
    0    1024.0  176.805760  323.368435
    1    1536.0  215.578939  368.640009
    2    2048.0  267.130429  406.214885
    3    2560.0  310.303039  429.650357
    4    3072.0  356.173905  460.800012
    5    3584.0  398.222222  405.735851
    6    4096.0  427.408686  402.885254
    7    4608.0  451.395913  405.098883
    8    5120.0  481.882369  410.969891
    9    5632.0  502.483251  423.724120
    10   6144.0  526.628557  429.900884
    11   6656.0  548.948446  432.910585
    12   7168.0  567.762361  407.658756
    13   7680.0  581.451107  403.326051
    14   8192.0  604.947691  429.275114
    15   8704.0  600.275860  419.469872
    16   9216.0  616.111424  432.000001
    17   9728.0  627.612887  435.582084
    18  10240.0  643.350788  430.402815
    19  10752.0  630.924205  427.940303
    20  11264.0  643.657148  427.746848
    21  11776.0  657.265138  429.519754
    22  12288.0  665.715580  437.554890
    23  12800.0  672.210084  436.984354
    24  13312.0  674.025316  444.350485
    25  13824.0  677.093870  444.144588
    26  14336.0  630.153818  437.184246
    27  14848.0  629.597170  440.484554
    28  15360.0  631.232852  436.260369
    29  15872.0  621.415976  442.425098
    

    but it's still much (like, 3x) worse than pytorch and nvfuser kernels for dwdb.

    opened by ngimel 0
  • Correctness issue with `tl.atomic_add`

    Correctness issue with `tl.atomic_add`

    I mentioned this on slack, and https://github.com/openai/triton/pull/556 is a partial fix, but wanted to create an issue here, so I have something to link to from workarounds.

    import torch
    import triton
    import triton.language as tl
    
    
    @triton.jit
    def kernel1(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK: tl.constexpr):
        xoffset = tl.program_id(0) * XBLOCK
        xindex = xoffset + tl.reshape(tl.arange(0, XBLOCK), [XBLOCK])
        xmask = xindex < xnumel
        tmp0 = tl.load(in_ptr0 + xindex, xmask)
        tmp1 = tl.load(in_ptr1 + xindex, xmask)
    
        tl.atomic_add(out_ptr0 + tmp0, tmp1, xmask)
    
        # This works only for the randperm() case, but does not handle duplicates:
        # tl.store(out_ptr0 + tmp0, tl.load(out_ptr0 + tmp0, xmask)+tmp1, xmask)
    
    
    def call(dtype, n, bs, duplicates):
        values = torch.randn(n, device="cuda", dtype=dtype)
    
        if duplicates:
            # harder case (possible duplicates)
            indices = torch.randint(0, n, size=[n], device="cuda", dtype=torch.int64)
        else:
            # simple case (no duplicates)
            indices = torch.randperm(n, device="cuda", dtype=torch.int64)
    
        self = torch.randn(n, device="cuda", dtype=dtype)
        self2 = self.clone()
    
        kernel1[(triton.cdiv(n, bs),)](indices, values, self, n, bs)
        torch.index_put_(self2, [indices], values, accumulate=True)
    
        if torch.allclose(self, self2):
            print(f"{dtype} n={n} block={bs:4} dups={str(duplicates):5} CORRECT")
        else:
            print(f"{dtype} n={n} block={bs:4} dups={str(duplicates):5} INCORRECT")
    
    
    call(torch.float32, 1024, 1024, False)
    call(torch.float32, 1024, 1024, True)
    call(torch.float32, 1024, 64, False)
    call(torch.float32, 1024, 64, True)
    call(torch.float16, 1024, 1024, False)
    call(torch.float16, 1024, 1024, True)
    call(torch.float16, 1024, 64, False)
    call(torch.float16, 1024, 64, True)
    

    Output:

    torch.float32 n=1024 block=1024 dups=False CORRECT
    torch.float32 n=1024 block=1024 dups=True  CORRECT
    torch.float32 n=1024 block=  64 dups=False INCORRECT
    torch.float32 n=1024 block=  64 dups=True  INCORRECT
    torch.float16 n=1024 block=1024 dups=False INCORRECT
    torch.float16 n=1024 block=1024 dups=True  INCORRECT
    torch.float16 n=1024 block=  64 dups=False INCORRECT
    torch.float16 n=1024 block=  64 dups=True  INCORRECT
    
    opened by jansel 0
  • Consider changing the name of the default branch

    Consider changing the name of the default branch

    Many communities, both on GitHub and outside are considering changing the default branch name from master. The most common alternative is main, currently used by default for all new projects on GitHub. Moreover, the git project itself is encouraging the change of the default branch names.

    opened by z-a-f 0
  • Adding explicit version check in the github actions

    Adding explicit version check in the github actions

    According to the documentation, the supported python (PyPy) versions range from 3.6 to 3.9. However, some current changes were introduced in the later versions of Python. Specifically, f-strings for self-documenting expressions were introduced in 3.8.

    This PR only removes the new f-strings and does not check for ALL compatibilities. There might be other issues with python versions that might be missed.

    Current changes:

    • [X] Add python-version to the github actions
    • [X] Change the f-string in the code_gen.py to support python <3.8
    opened by z-a-f 1
Owner
OpenAI
OpenAI
Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with CUDA-capable GPU.

Isaac ROS DNN Inference Overview This repository provides two NVIDIA GPU-accelerated ROS2 nodes that perform deep learning inference using custom mode

NVIDIA Isaac ROS 36 Jun 20, 2022
PPLNN is a high-performance deep-learning inference engine for efficient AI inferencing.

PPLNN, which is short for "PPLNN is a Primitive Library for Neural Network", is a high-performance deep-learning inference engine for efficient AI inferencing.

null 814 Jun 19, 2022
A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms.

iNeural A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms. What is a Neural Network? Work on

Fatih Küçükkarakurt 5 Apr 5, 2022
Deep Learning in C Programming Language. Provides an easy way to create and train ANNs.

cDNN is a Deep Learning Library written in C Programming Language. cDNN provides functions that can be used to create Artificial Neural Networks (ANN)

Vishal R 11 Apr 18, 2022
Deploying Deep Learning Models in C++: BERT Language Model

This repository show the code to deploy a deep learning model serialized and running in C++ backend.

null 42 Mar 24, 2022
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

This is the Vowpal Wabbit fast online learning code. Why Vowpal Wabbit? Vowpal Wabbit is a machine learning system which pushes the frontier of machin

Vowpal Wabbit 8k Jul 1, 2022
Efficient training of deep recommenders on cloud.

HybridBackend Introduction HybridBackend is a training framework for deep recommenders which bridges the gap between evolving cloud infrastructure and

Alibaba 86 Jun 23, 2022
Cooperative primitives for CUDA C++.

CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model

NVIDIA Corporation 1.2k Jun 25, 2022
An unified library for fitting primitives from 3D point cloud data with both C++&Python API.

PrimitivesFittingLib An unified library for fitting multiple primitives from 3D point cloud data with both C++&Python API. The supported primitives ty

Yueci Deng 11 Apr 14, 2022
A pytorch implementation of instant-ngp, as described in Instant Neural Graphics Primitives with a Multiresolution Hash Encoding.

torch-ngp A pytorch implementation of instant-ngp, as described in Instant Neural Graphics Primitives with a Multiresolution Hash Encoding. Note: This

hawkey 487 Jun 30, 2022
Deep Learning API and Server in C++11 support for Caffe, Caffe2, PyTorch,TensorRT, Dlib, NCNN, Tensorflow, XGBoost and TSNE

Open Source Deep Learning Server & API DeepDetect (https://www.deepdetect.com/) is a machine learning API and server written in C++11. It makes state

JoliBrain 2.4k Jun 30, 2022
Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

Apache MXNet (incubating) for Deep Learning Apache MXNet is a deep learning framework designed for both efficiency and flexibility. It allows you to m

The Apache Software Foundation 20k Jun 23, 2022
tutorial on how to train deep learning models with c++ and dlib.

Dlib Deep Learning tutorial on how to train deep learning models with c++ and dlib. usage git clone https://github.com/davisking/dlib.git mkdir build

Abdolkarim Saeedi 1 Dec 21, 2021
TensorRT is a C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.

TensorRT Open Source Software This repository contains the Open Source Software (OSS) components of NVIDIA TensorRT. Included are the sources for Tens

NVIDIA Corporation 5.5k Jun 27, 2022
Caffe2 is a lightweight, modular, and scalable deep learning framework.

Source code now lives in the PyTorch repository. Caffe2 Caffe2 is a lightweight, modular, and scalable deep learning framework. Building on the origin

Meta Archive 8.4k Jun 22, 2022
Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

Microsoft 17.2k Jun 24, 2022
header only, dependency-free deep learning framework in C++14

The project may be abandoned since the maintainer(s) are just looking to move on. In the case anyone is interested in continuing the project, let us k

tiny-dnn 5.5k Jun 22, 2022
LibDEEP BSD-3-ClauseLibDEEP - Deep learning library. BSD-3-Clause

LibDEEP LibDEEP is a deep learning library developed in C language for the development of artificial intelligence-based techniques. Please visit our W

Joao Paulo Papa 18 Mar 15, 2022