a software library containing BLAS functions written in OpenCL

Related tags

Numerical clBLAS
Overview

Build Status

Build branch master develop
GCC/Clang x64 Build Status Build Status
Visual Studio x64 Build status Build status

clBLAS

This repository houses the code for the OpenCL™ BLAS portion of clMath. The complete set of BLAS level 1, 2 & 3 routines is implemented. Please see Netlib BLAS for the list of supported routines. In addition to GPU devices, the library also supports running on CPU devices to facilitate debugging and multicore programming. APPML 1.12 is the most current generally available pre-packaged binary version of the library available for download for both Linux and Windows platforms.

The primary goal of clBLAS is to make it easier for developers to utilize the inherent performance and power efficiency benefits of heterogeneous computing. clBLAS interfaces do not hide nor wrap OpenCL interfaces, but rather leaves OpenCL state management to the control of the user to allow for maximum performance and flexibility. The clBLAS library does generate and enqueue optimized OpenCL kernels, relieving the user from the task of writing, optimizing and maintaining kernel code themselves.

clBLAS update notes 01/2017

  • v2.12 is a bugfix release as a rollup of all fixes in /develop branch
    • Thanks to @pavanky, @iotamudelta, @shahsan10, @psyhtest, @haahh, @hughperkins, @tfauck @abhiShandy, @IvanVergiliev, @zougloub, @mgates3 for contributions to clBLAS v2.12
  • Summary of fixes available to read on the releases tab

clBLAS library user documentation

Library and API documentation for developers is available online as a GitHub Pages website

Google Groups

Two mailing lists have been created for the clMath projects:

  • [email protected] - group whose focus is to answer questions on using the library or reporting issues

  • [email protected] - group whose focus is for developers interested in contributing to the library code itself

clBLAS Wiki

The project wiki contains helpful documentation, including a build primer

Contributing code

Please refer to and read the Contributing document for guidelines on how to contribute code to this open source project. The code in the /master branch is considered to be stable, and all pull-requests should be made against the /develop branch.

License

The source for clBLAS is licensed under the Apache License, Version 2.0

Example

The simple example below shows how to use clBLAS to compute an OpenCL accelerated SGEMM

    #include <sys/types.h>
    #include <stdio.h>

    /* Include the clBLAS header. It includes the appropriate OpenCL headers */
    #include <clBLAS.h>

    /* This example uses predefined matrices and their characteristics for
     * simplicity purpose.
    */

    #define M  4
    #define N  3
    #define K  5

    static const cl_float alpha = 10;

    static const cl_float A[M*K] = {
    11, 12, 13, 14, 15,
    21, 22, 23, 24, 25,
    31, 32, 33, 34, 35,
    41, 42, 43, 44, 45,
    };
    static const size_t lda = K;        /* i.e. lda = K */

    static const cl_float B[K*N] = {
    11, 12, 13,
    21, 22, 23,
    31, 32, 33,
    41, 42, 43,
    51, 52, 53,
    };
    static const size_t ldb = N;        /* i.e. ldb = N */

    static const cl_float beta = 20;

    static cl_float C[M*N] = {
        11, 12, 13,
        21, 22, 23,
        31, 32, 33,
        41, 42, 43,
    };
    static const size_t ldc = N;        /* i.e. ldc = N */

    static cl_float result[M*N];

    int main( void )
    {
    cl_int err;
    cl_platform_id platform = 0;
    cl_device_id device = 0;
    cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
    cl_context ctx = 0;
    cl_command_queue queue = 0;
    cl_mem bufA, bufB, bufC;
    cl_event event = NULL;
    int ret = 0;

    /* Setup OpenCL environment. */
    err = clGetPlatformIDs( 1, &platform, NULL );
    err = clGetDeviceIDs( platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL );

    props[1] = (cl_context_properties)platform;
    ctx = clCreateContext( props, 1, &device, NULL, NULL, &err );
    queue = clCreateCommandQueue( ctx, device, 0, &err );

    /* Setup clBLAS */
    err = clblasSetup( );

    /* Prepare OpenCL memory objects and place matrices inside them. */
    bufA = clCreateBuffer( ctx, CL_MEM_READ_ONLY, M * K * sizeof(*A),
                          NULL, &err );
    bufB = clCreateBuffer( ctx, CL_MEM_READ_ONLY, K * N * sizeof(*B),
                          NULL, &err );
    bufC = clCreateBuffer( ctx, CL_MEM_READ_WRITE, M * N * sizeof(*C),
                          NULL, &err );

    err = clEnqueueWriteBuffer( queue, bufA, CL_TRUE, 0,
        M * K * sizeof( *A ), A, 0, NULL, NULL );
    err = clEnqueueWriteBuffer( queue, bufB, CL_TRUE, 0,
        K * N * sizeof( *B ), B, 0, NULL, NULL );
    err = clEnqueueWriteBuffer( queue, bufC, CL_TRUE, 0,
        M * N * sizeof( *C ), C, 0, NULL, NULL );

        /* Call clBLAS extended function. Perform gemm for the lower right sub-matrices */
        err = clblasSgemm( clblasRowMajor, clblasNoTrans, clblasNoTrans,
                                M, N, K,
                                alpha, bufA, 0, lda,
                                bufB, 0, ldb, beta,
                                bufC, 0, ldc,
                                1, &queue, 0, NULL, &event );

    /* Wait for calculations to be finished. */
    err = clWaitForEvents( 1, &event );

    /* Fetch results of calculations from GPU memory. */
    err = clEnqueueReadBuffer( queue, bufC, CL_TRUE, 0,
                                M * N * sizeof(*result),
                                result, 0, NULL, NULL );

    /* Release OpenCL memory objects. */
    clReleaseMemObject( bufC );
    clReleaseMemObject( bufB );
    clReleaseMemObject( bufA );

    /* Finalize work with clBLAS */
    clblasTeardown( );

    /* Release OpenCL working objects. */
    clReleaseCommandQueue( queue );
    clReleaseContext( ctx );

    return ret;
    }

Build dependencies

Library for Windows

  • Windows® 7/8
  • Visual Studio 2010 SP1, 2012
  • An OpenCL SDK, such as APP SDK 2.8
  • Latest CMake

Library for Linux

  • GCC 4.6 and onwards
  • An OpenCL SDK, such as APP SDK 2.9
  • Latest CMake

Library for Mac OSX

  • Recommended to generate Unix makefiles with cmake

Test infrastructure

  • Googletest v1.6
  • Latest Boost
  • CPU BLAS
  • Netlib CBLAS (recommended) Ubuntu: install by "apt-get install libblas-dev" Windows: download & install lapack-3.6.0 which comes with CBLAS
  • or ACML on windows/linux; Accelerate on Mac OSX

Performance infrastructure

  • Python
Comments
  • Reopeinig: thread-safety fail: clblasDgemm assertion fail within openmp-environment using multiple GPUs

    Reopeinig: thread-safety fail: clblasDgemm assertion fail within openmp-environment using multiple GPUs

    The code follows a multithreaded program calling 2 gemm in parallel. The code has been made thread safe

    An example using clBLAS-2.11.0-Linux-x64 (package from master) ./sgemm2 2 10000 10000 10000 10 0 ----------> get time 4.755824e+01 sec<------ 2 GFLOPS 420.537027

    the code on the release: using directly the Fiji Release clBLAS-2.10.0-Fiji-Linux-x64-CL2.0 OpenCL error -45 on line 281 of /home/fpadmin/Timmy/clBLAS2-10/clBLAS/src/library/blas/xgemm.cc Segmentation fault (core dumped)

    but considering the previous performance I do not think is a fix With the old /opt/clAmdBlas-1.10.321 ----------> get time 8.035512e+00 sec<------ 2 GFLOPS 2,488.951544

    opened by paolodalberto 33
  • Sgemm crashes on Intel Platform when transA is false and transB is true

    Sgemm crashes on Intel Platform when transA is false and transB is true

    Using Sgemm of clBlas version 2.3. with transA = false and transB = true results in clblasBuildProgramFailure on my machine when using Intel platform 4.5.0.8. On NVidia/AMD platform I do not get this error. Can anyone confirm this?

    bug Windows OSX Intel Apple CPU GPU 
    opened by TillAlex 28
  • test-functional fails xgemm.cc

    test-functional fails xgemm.cc

    I compiled clBLAS cloned from the git (c2.8) with no bigger problems, but any kind of tests fail, for example

    $ /opt/clBLAS/bin/test-functional 
    
    
    Initialize OpenCL and clblas...
    ---- Advanced Micro Devices, Inc.
    SetUp: about to create command queues
    [==========] Running 715 tests from 5 test cases.
    [----------] Global test environment set-up.
    [----------] 203 tests from ERROR
    [ RUN      ] ERROR.InvalidCommandQueue
    OpenCL error -36 on line 350 of /home/marcin/Downloads/clBLAS/src/library/blas/xgemm.cc
    test-functional: /home/marcin/Downloads/clBLAS/src/library/blas/xgemm.cc:350: clblasStatus clblasGemm(clblasOrder, clblasTranspose, clblasTranspose, size_t, size_t, size_t, Precision, cl_mem, size_t, size_t, cl_mem, size_t, size_t, Precision, cl_mem, size_t, size_t, cl_uint, _cl_command_queue**, cl_uint, _cl_event* const*, _cl_event**) [with Precision = float; clblasStatus = clblasStatus_; clblasOrder = clblasOrder_; clblasTranspose = clblasTranspose_; size_t = long unsigned int; cl_mem = _cl_mem*; cl_uint = unsigned int; cl_command_queue = _cl_command_queue*; cl_event = _cl_event*]: Assertion `false' failed.
    Aborted (core dumped)
    

    I run it on lubuntu (upgraded straight after installation to kernel 4.2.0-22) , with Radeon R9 290. First I installed divers from AMD's website (15.12), then AMDSDK-3.0, and then ACML 5.3.1 with an update 6.1.0.31. At the moment clRNG and clFFT compile and run without problems. I also tried compiling with gcc/g++ 4.7 and 4.8, or using fglrx-update, but with the same result.

    I used gcc 5.2.1 g++ 5.2.1

    uname -am

    Linux thesun 4.2.0-22-generic #27-Ubuntu SMP Thu Dec 17 22:57:08 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
    

    clinfo

    Number of platforms:                 1
      Platform Profile:              FULL_PROFILE
      Platform Version:              OpenCL 2.0 AMD-APP (1912.5)
      Platform Name:                 AMD Accelerated Parallel Processing
      Platform Vendor:               Advanced Micro Devices, Inc.
      Platform Extensions:               cl_khr_icd cl_amd_event_callback cl_amd_offline_devices 
    

    dpkg -l fglrx fglrx-core fglrx-dev fglrx-amdcccle

    ||/ Name           Version      Architecture Description
    +++-==============-============-============-=================================
    ii  fglrx          2:15.302-0ub amd64        Video driver for the AMD graphics
    ii  fglrx-amdcccle 2:15.302-0ub amd64        Catalyst Control Center for the A
    ii  fglrx-core     2:15.302-0ub amd64        Minimal video driver for the AMD 
    

    fglrxinfo

    display: :0  screen: 0
    OpenGL vendor string: Advanced Micro Devices, Inc.
    OpenGL renderer string: AMD Radeon R9 200 Series
    OpenGL version string: 4.5.13416 Compatibility Profile Context 15.302
    

    cmake - mCMakeCache.txt

    cmake ../src -DOPENCL_VERSION:STRING=2.0 -DACML_INCLUDE_DIRS:PATH=/opt/acml5.3.1/gfortran64_mp/include -DACML_LIBRARIES:FILEPATH=/opt/acml5.3.1/gfortran64_mp/lib/libacml_mp.so -DBLAS_DEBUG_TOOLS=ON -DOPENCL_OFFLINE_BUILD_HAWAII_KERNEL=ON -DBUILD_PERFORMANCE=ON -DCMAKE_INSTALL_PREFIX=/opt/clBLAS -DBUILD_SHARED_LIBS=ON -DUSE_SYSTEM_GTEST=ON -DOPENCL_LIBRARIES=/usr/lib/libOpenCL.so.1
    

    UPDATE 1: I tried downgrading the driver to from 1912.5 to 1800.8, but it did not help (clinfo itself crashed)

    clinfo

    Number of platforms:                 1
      Platform Profile:              FULL_PROFILE
      Platform Version:              OpenCL 2.0 AMD-APP (1800.8)
      Platform Name:                 AMD Accelerated Parallel Processing
      Platform Vendor:               Advanced Micro Devices, Inc.
      Platform Extensions:               cl_khr_icd cl_amd_event_callback cl_amd_offline_devices 
    

    UPDATE 2: The error is caused by a code around line 100 in src/tests/functional/func-error.cpp

    TEST(ERROR, InvalidEventWaitList) {
        ErrorClass<GemmMetod<float> > ec;
        ec.error(CL_INVALID_EVENT_WAIT_LIST);
    }
    

    I tried to put some other tests in front of this one and their did pass.

    opened by mpekalski 21
  • clBLAS fails when using multiple contexts

    clBLAS fails when using multiple contexts

    The code to reproduce the problem is here:

    #include <stdio.h>
    #include <iostream>
    #include <vector>
    
    /* Include the clBLAS header. It includes the appropriate OpenCL headers */
    #include <clBLAS.h>
    
    /* This example uses predefined matrices and their characteristics for
     * simplicity purpose.
     */
    
    #define M  4
    #define N  3
    #define K  5
    
    static const cl_float alpha = 1;
    
    static const cl_float A[M*K] = {
        11, 12, 13, 14, 15,
        21, 22, 23, 24, 25,
        31, 32, 33, 34, 35,
        41, 42, 43, 44, 45,
    };
    static const size_t lda = K;        /* i.e. lda = K */
    
    static const cl_float B[K*N] = {
        11, 12, 13,
        21, 22, 23,
        31, 32, 33,
        41, 42, 43,
        51, 52, 53,
    };
    static const size_t ldb = N;        /* i.e. ldb = N */
    
    static const cl_float beta = 0;
    
    static cl_float C[M*N] = {
        11, 12, 13,
        21, 22, 23,
        31, 32, 33,
        41, 42, 43,
    };
    static const size_t ldc = N;        /* i.e. ldc = N */
    
    static cl_float result[M*N];
    
    void print(const char *msg, const cl_float *ptr, int m, int n)
    {
        std::cout << msg << std::endl;
        for(int i = 0; i < n; i++) {
            for(int j = 0; j < m; j++) {
                std::cout << ptr[i * m + j] << "\t";
            }
            std::cout << std::endl;
        }
    }
    
    #define ERR() do {      \
        if(err != 0) {      \
            printf("%d Error = %d\n", __LINE__, err);   \
            exit(err);      \
        }                   \
    } while(0);
    
    int func(cl_device_id &device, cl_context_properties *props)
    {
        cl_int err;
        cl_context ctx = 0;
        cl_command_queue queue = 0;
        cl_mem bufA, bufB, bufC;
        cl_event event = NULL;
        int ret = 0;
    
        ctx = clCreateContext( props, 1, &device, NULL, NULL, &err );
        queue = clCreateCommandQueue( ctx, device, 0, &err );
    
        /* Prepare OpenCL memory objects and place matrices inside them. */
        bufA = clCreateBuffer( ctx, CL_MEM_READ_ONLY, M * K * sizeof(*A),
                NULL, &err );
        ERR();
        bufB = clCreateBuffer( ctx, CL_MEM_READ_ONLY, K * N * sizeof(*B),
                NULL, &err );
        ERR();
        bufC = clCreateBuffer( ctx, CL_MEM_READ_WRITE, M * N * sizeof(*C),
                NULL, &err );
        ERR();
    
        err = clEnqueueWriteBuffer( queue, bufA, CL_TRUE, 0,
                M * K * sizeof( *A ), A, 0, NULL, NULL );
        ERR();
        err = clEnqueueWriteBuffer( queue, bufB, CL_TRUE, 0,
                K * N * sizeof( *B ), B, 0, NULL, NULL );
        ERR();
        err = clEnqueueWriteBuffer( queue, bufC, CL_TRUE, 0,
                M * N * sizeof( *C ), C, 0, NULL, NULL );
        ERR();
    
        print("A", A, K, M);
        print("B", B, N, K);
    
        /* Call clBLAS extended function. Perform gemm for the lower right sub-matrices */
        err = clblasSgemm( clblasRowMajor, clblasNoTrans, clblasNoTrans,
                M, N, K,
                alpha, bufA, 0, lda,
                bufB, 0, ldb, beta,
                bufC, 0, ldc,
                1, &queue, 0, NULL, &event );
        ERR();
    
        /* Wait for calculations to be finished. */
        err = clWaitForEvents( 1, &event );
        ERR();
    
        /* Fetch results of calculations from GPU memory. */
        err = clEnqueueReadBuffer( queue, bufC, CL_TRUE, 0,
                M * N * sizeof(*result),
                result, 0, NULL, NULL );
        ERR();
    
        print("R", result, N, M);
    
        /* Release OpenCL memory objects. */
        clReleaseMemObject( bufC );
        clReleaseMemObject( bufB );
        clReleaseMemObject( bufA );
    
        /* Release OpenCL working objects. */
        clReleaseCommandQueue( queue );
        clReleaseContext( ctx );
    
        return ret;
    }
    
    int main()
    {
        cl_int err;
        cl_platform_id platform = 0;
        cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
        cl_uint ndevices = 0;
    
        /* Setup OpenCL environment. */
        err = clGetPlatformIDs( 1, &platform, NULL );
        ERR();
        err = clGetDeviceIDs( platform, CL_DEVICE_TYPE_GPU, 0, NULL, &ndevices );
        ERR();
    
        std::vector<cl_device_id> devices(ndevices);
    
        err = clGetDeviceIDs( platform, CL_DEVICE_TYPE_GPU, ndevices, &devices.front(), NULL );
        ERR();
    
        props[1] = (cl_context_properties)platform;
    
        err = clblasSetup( );
        ERR();
    
        // This bug happens when atleast 2 contexts are created.
        // We can use either multiple gpus if available
        // Or create 2 contexts for the same device
        int runs = ndevices > 1 ? ndevices : 2;
        for(int idx = 0; idx < runs; idx++) {
            int id = idx % ndevices;
            cl_device_id device = devices[id];
            char name[256];
            clGetDeviceInfo(device, CL_DEVICE_NAME, 256, name, NULL);
            printf("Device %d = %s\n", id, name);
    
            func(device, props);
        }
    
        /* Finalize work with clBLAS */
        clblasTeardown( );
    
        return 0;
    }
    
    bug 
    opened by pavanky 20
  • Build fails on OSX with XCode 7.1 because

    Build fails on OSX with XCode 7.1 because "embedding a #pragma directive within macro arguments is not supported"

    I have no trouble building using XCode 6, but the latest version is more stringent and fails on a few files because "embedding a #pragma directive within macro arguments is not supported."

    In file included from /tmp/clblas20151115-84503-1lfrpf0/clBLAS-2.8/src/library/blas/xtrsm.cc:29:
    In file included from /tmp/clblas20151115-84503-1lfrpf0/clBLAS-2.8/src/library/blas/trtri/TrtriKernelSourceIncludes.h:5:
    In file included from /tmp/clblas20151115-84503-1lfrpf0/clBLAS-2.8/src/library/blas/trtri/TrtriKernelSourceIncludes.cpp:22:
    /tmp/clblas20151115-84503-1lfrpf0/clBLAS-2.8/src/library/blas/trtri/diag_dtrtri_upper_128_16.cpp:62:4: error: embedding a #pragma directive within macro arguments is not supported
      #pragma unroll \n
       ^
    /tmp/clblas20151115-84503-1lfrpf0/clBLAS-2.8/src/library/blas/trtri/diag_dtrtri_upper_128_16.cpp:114:6: error: embedding a #pragma directive within macro arguments is not supported
        #pragma unroll\n
         ^
    /tmp/clblas20151115-84503-1lfrpf0/clBLAS-2.8/src/library/blas/trtri/diag_dtrtri_upper_128_16.cpp:142:2: error: embedding a #pragma directive within macro arguments is not supported
    #pragma unroll\n
     ^
    In file included from /tmp/clblas20151115-84503-1lfrpf0/clBLAS-2.8/src/library/blas/xtrsm.cc:29:
    In file included from /tmp/clblas20151115-84503-1lfrpf0/clBLAS-2.8/src/library/blas/trtri/TrtriKernelSourceIncludes.h:5:
    In file included from /tmp/clblas20151115-84503-1lfrpf0/clBLAS-2.8/src/library/blas/trtri/TrtriKernelSourceIncludes.cpp:33:
    /tmp/clblas20151115-84503-1lfrpf0/clBLAS-2.8/src/library/blas/trtri/diag_dtrtri_lower_128_16.cpp:65:2: error: embedding a #pragma directive within macro arguments is not supported
    #pragma unroll\n
     ^
    /tmp/clblas20151115-84503-1lfrpf0/clBLAS-2.8/src/library/blas/trtri/diag_dtrtri_lower_128_16.cpp:141:2: error: embedding a #pragma directive within macro arguments is not supported
    #pragma unroll\n
     ^
    /tmp/clblas20151115-84503-1lfrpf0/clBLAS-2.8/src/library/blas/trtri/diag_dtrtri_lower_128_16.cpp:166:2: error: embedding a #pragma directive within macro arguments is not supported
    #pragma unroll\n
     ^
    9 errors generated.
    make[2]: *** [library/CMakeFiles/clBLAS.dir/blas/xtrsm.cc.o] Error 1
    make[1]: *** [library/CMakeFiles/clBLAS.dir/all] Error 2
    

    The macro in question is STRINGIFY(...).

    opened by GOFAI 19
  • clBLASgemm:

    clBLASgemm: "mixed vector-scalar operation not allowed unless up-convertable" on Hawaii

    Related to this Caffe PR, when I run the Caffe tests on an R9-290X, I see a test failure apparently trying to compile clBLAS kernel for clBLASgemm. This is the error. If instead I run the same tests on the same machine but specify them to use the 7950/Tahiti device, all the tests pass. So is clBLASgemm broken on Hawaii? Also, if I run the tests choosing the CPU (FX6350) as the OpenCL device, I also get the same compile error that I get as when I specify the Hawaii device. This is what clinfo shows on the machine, an Ubuntu all-AMD and OpenCL machine with no Cuda.

    To run these tests, I'm building what's in this branch. I expect the same problem would manifest with @lunochod 's branch, but it doesn't have the ability to specify which OpenCL device the tests should be run on, whereas mine does. So I can't verify on his branch, since device defaults to the first GPU with his branch, which is the Tahiti device. Namely, when you run test.testbin you add one command line argument to specify which device. In my case 0 (the default) is the CPU, 1 is the Tahiti device and 2 is the Hawaii card. If someone wants to reproduce this by building and running the tests, you will need to install all the caffe prerequisites first (except CUDA), and build using cmake, not the Makefile that comes in the caffe directory. I can help if you have questions about that.

    opened by jyegerlehner 18
  • Documentation for the callbacks on the CBLAS-like API

    Documentation for the callbacks on the CBLAS-like API

    I just got clBLAS compiling on OS X (see #7) and did a very simple DGEMM test, see https://github.com/fommil/netlib-java/

    The results are, frankly, a little unbelievable... so I'm going to have to check that the DGEMM is actually being performed.

    However, as part of the setup I found it very hard to understand the clblasDgemm API. It looks like you've added offsets to the arrays (which doesn't make much sense in C, since this can be done by just moving the pointer) and also added the following

        cl_uint numCommandQueues,
        cl_command_queue *commandQueues,
        cl_uint numEventsInWaitList,
        const cl_event *eventWaitList,
        cl_event *events
    

    I just set these to NULL or 0 as appropriate: https://github.com/fommil/netlib-java/blob/master/perf/src/main/c/clwrapper.c

    What are these for and what are sensible defaults just to get me up and running?

    enhancement 
    opened by fommil 17
  • thread-safety fail: clblasDgemm assertion fail within openmp-environment using multiple GPUs

    thread-safety fail: clblasDgemm assertion fail within openmp-environment using multiple GPUs

    I've tried to run run a single matrix-multiplication C = A x B on multiple GPUs by splitting the columns of matrix B into multiple batches.

    Therefore, I've set up different contexts/com-queues for each GPU-device and executed the dgemm-batches within an openmp-parallelized loop (see code-snippet below).

    Using a test-system with 250x250 matrices and 4 GPUs (B-col batches: 62,62,62,64), the following error pops up:

    OpenCL error -34 on line 281 of <path>/clBLAS-master/src/library/blas/xgemm.cc
    <binary>: <path>/clBLAS-master/src/library/blas/xgemm.cc:281: void enqueueGemmKernel(cl_command_queue, cl_kernel, void**, size_t*, unsigned int, const size_t*, const size_t*, cl_uint, _cl_event* const*, _cl_event**): Assertion `false' failed.
    

    However, no errors pop up if the call to clblasDgemm is within a omp critical section.

      // skipped error-checks...
      std::vector<cl_mem> theAs(ocl_devices.size());
      std::vector<cl_mem> theBs(ocl_devices.size());
      std::vector<cl_mem> theCs(ocl_devices.size());
      std::vector<cl_context> cxGPUContext(ocl_devices.size());
      std::vector<cl_command_queue> commandQueue(ocl_devices.size());
    
      // setup ctx/com and allocated mem on devs
      #pragma omp parallel for default(shared) schedule(dynamic)
      for(size_t ii=0;ii<ocl_devices.size();ii++){
        cl_device_id theID = ocl_devices[ii];
        cl_int ciErrNum2 = CL_SUCCESS;
        cxGPUContext[ii] = clCreateContext(0, 1, &theID, NULL, NULL, &ciErrNum2);
        #ifdef CL_VERSION_2_0
          commandQueue[ii] = clCreateCommandQueueWithProperties(cxGPUContext[ii], theID, NULL, &ciErrNum2);
        #else
          commandQueue[ii] = clCreateCommandQueue(cxGPUContext[ii], theID, 0, &ciErrNum2);
        #endif
    
        theAs[ii] = clCreateBuffer(cxGPUContext[ii], CL_MEM_READ_ONLY, ( nra*nca*sizeof(double)), NULL, &ciErrNum2);
        theBs[ii] = clCreateBuffer(cxGPUContext[ii], CL_MEM_READ_ONLY, ( nrb*(ncol_per_gpu+nrest)*sizeof(double)), NULL, &ciErrNum2);
        theCs[ii] = clCreateBuffer(cxGPUContext[ii], CL_MEM_READ_WRITE, ( nrc*(ncol_per_gpu+nrest)*sizeof(double)), NULL, &ciErrNum2);
    
        clEnqueueWriteBuffer(commandQueue[ii], theAs[ii], CL_TRUE, 0, ( nra*nca*sizeof(double)), Amat, 0, NULL, NULL);
      }
    
     // execute batches
      for(size_t isub=0;isub<nsub;isub++){
        #pragma omp parallel for default(shared) schedule(dynamic)
        for(size_t ii=0;ii<ocl_devices.size();ii++){
          size_t N = ncols[isub*ocl_devices.size()+ii];
          clEnqueueWriteBuffer(commandQueue[ii], theBs[ii], CL_TRUE, 0, ( nrb*N*sizeof(double)),
                                           Bmat+nrb*ncol_per_gpu*(isub*ocl_devices.size()+ii), 0, NULL, NULL);
          cl_event event = NULL;
         // works if the following com is within omp-critical section
         clblasDgemm(clblasColumnMajor, clblasNoTrans, clblasNoTrans, M, N, K, one, theAs[ii],
                             0,lda, theBs[ii], 0,ldb, zero, theCs[ii], 0,ldc, 1, &commandQueue[ii], 0, NULL, &event);
    
          ciErrNum2 = clWaitForEvents(1, &event);
          ciErrNum2 = clEnqueueReadBuffer(commandQueue[ii], theCs[ii], CL_TRUE, 0, nrc*N* sizeof(double),
                                            Cmat+nrc*ncol_per_gpu*(isub*ocl_devices.size()+ii), 0, NULL, NULL);
    
    
        }
      }
    
    
    opened by jkn93 15
  • sger fails on nvidia

    sger fails on nvidia

    following code falis, with error -36 'invalid queue'. This usually menas an out of bounds array access occurred.

    /* ************************************************************************
     * Copyright 2013 Advanced Micro Devices, Inc.
     *
     * Licensed under the Apache License, Version 2.0 (the "License");
     * you may not use this file except in compliance with the License.
     * You may obtain a copy of the License at
     *
     * http://www.apache.org/licenses/LICENSE-2.0
     *
     * Unless required by applicable law or agreed to in writing, software
     * distributed under the License is distributed on an "AS IS" BASIS,
     * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
     * See the License for the specific language governing permissions and
     * limitations under the License.
     * ************************************************************************/
    #include <sys/types.h>
    #include <stdio.h>
    #include <string.h>
    #include <math.h>
    /* Include CLBLAS header. It automatically includes needed OpenCL header,
     * so we can drop out explicit inclusion of cl.h header.
     */
    #include <clBLAS.h>
    /* This example uses predefined matrices and their characteristics for
     * simplicity purpose.
     */
    static const clblasOrder order = clblasColumnMajor;
    static const size_t M = 5;
    static const size_t N = 5;
    static const cl_float alpha = 1;
    static const size_t lda = 5;
    static const int incx = 1;
    static const int incy = 1;
    /*static void*/
    /*printResult(void)*/
    /*{*/
    /*    size_t i, j;*/
    /*    printf("\nResult:\n");*/
    /*    for (i = 0; i < M; i++) {*/
    /*        for(j = 0; j < N; j++)*/
    /*            printf("\t%f", A[ i*N + j ]);*/
    /*        printf("\n");*/
    /*    }*/
    /*}*/
    int
    main(void)
    {
        cl_int err;
        cl_platform_id platform = 0;
        cl_device_id device = 0;
        cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
        cl_context ctx = 0;
        cl_command_queue queue = 0;
        cl_mem bufA, bufX, bufY;
        cl_event event = NULL;
        int ret = 0;
        /* Setup OpenCL environment. */
        err = clGetPlatformIDs(1, &platform, NULL);
        if (err != CL_SUCCESS) {
            printf( "clGetPlatformIDs() failed with %d\n", err );
            return 1;
        }
        err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
        if (err != CL_SUCCESS) {
            printf( "clGetDeviceIDs() failed with %d\n", err );
            return 1;
        }
        props[1] = (cl_context_properties)platform;
        ctx = clCreateContext(props, 1, &device, NULL, NULL, &err);
        if (err != CL_SUCCESS) {
            printf( "clCreateContext() failed with %d\n", err );
            return 1;
        }
        queue = clCreateCommandQueue(ctx, device, 0, &err);
        if (err != CL_SUCCESS) {
            printf( "clCreateCommandQueue() failed with %d\n", err );
            clReleaseContext(ctx);
            return 1;
        }
        /* Setup clblas. */
        err = clblasSetup();
        if (err != CL_SUCCESS) {
            printf("clblasSetup() failed with %d\n", err);
            clReleaseCommandQueue(queue);
            clReleaseContext(ctx);
            return 1;
        }
        cl_float *pX = (cl_float *)malloc(M * sizeof(cl_float));
        cl_float *pY = (cl_float *)malloc(N * sizeof(cl_float));
        cl_float *pA = (cl_float *)malloc(M*N* sizeof(cl_float));
        memset(pX, 0, M*sizeof(cl_float));
        memset(pY, 0, N*sizeof(cl_float));
        memset(pA, 0, M*N*sizeof(cl_float));
        /* Prepare OpenCL memory objects and place matrices inside them. */
        bufA = clCreateBuffer(ctx, CL_MEM_READ_WRITE, M * lda * sizeof(cl_float),
                              NULL, &err);
        bufX = clCreateBuffer(ctx, CL_MEM_READ_ONLY, ( 1 + ( M - 1 )*abs( incx ) ) * sizeof(cl_float),
                              NULL, &err);
        bufY = clCreateBuffer(ctx, CL_MEM_READ_ONLY, ( 1 + ( N - 1 )*abs( incy ) ) * sizeof(cl_float),
                              NULL, &err);
        err = clEnqueueWriteBuffer(queue, bufA, CL_TRUE, 0,
            M * lda * sizeof(cl_float), pA, 0, NULL, NULL);
        err = clEnqueueWriteBuffer(queue, bufX, CL_TRUE, 0,
            ( 1 + ( M - 1 )*abs( incx ) ) * sizeof(cl_float), pX, 0, NULL, NULL);
        err = clEnqueueWriteBuffer(queue, bufY, CL_TRUE, 0,
            ( 1 + ( N - 1 )*abs( incy ) ) * sizeof(cl_float), pY, 0, NULL, NULL);
        /* Call clblas function. */
        err = clFinish(queue);
        printf("err %i\n", err);
        err = clblasSger(order, M, N, alpha, bufX, 0, incx, bufY, 0, incy,
            bufA, 0, lda, 1, &queue, 0, NULL, &event);
        if (err != CL_SUCCESS) {
            printf("clblasSger() failed with %d\n", err);
            ret = 1;
        }
        else {
            /* Wait for calculations to be finished. */
            err = clWaitForEvents(1, &event);
            /* Fetch results of calculations from GPU memory. */
            err = clEnqueueReadBuffer(queue, bufA, CL_TRUE, 0, (M * lda * sizeof(cl_float)),
                                      pA, 0, NULL, NULL);
            /* At this point you will get the result of SGER placed in A array. */
    /*        printResult();*/
        }
        err = clFinish(queue);
        printf("err %i\n", err);
        /* Release OpenCL memory objects. */
        clReleaseMemObject(bufY);
        clReleaseMemObject(bufX);
        clReleaseMemObject(bufA);
        free(pX);
        free(pY);
        free(pA);
        /* Finalize work with clblas. */
        clblasTeardown();
        /* Release OpenCL working objects. */
        clReleaseCommandQueue(queue);
        clReleaseContext(ctx);
        return ret;
    }
    
    opened by hughperkins 15
  • Scasum shows undefined behavior after running Dzasum.

    Scasum shows undefined behavior after running Dzasum.

    I'm using clBLAS 2.10 with an AMD R9 390 GPU on Windows 7 x64. The issue occurs when I call the single precision complex absolute sum function (scasum) after double precision complex absolute sum function (dzasum). The following code is modified from example_sasum.c.

    The following code gives me:
    (123.00, 0.00) (123.00, 1.88).

    Instead of the result I expected: (123.00, 0.00) (123.00, 0.00)

    /* ************************************************************************
     * check complex
     * ************************************************************************/
    
    #include <sys/types.h>
    #include <stdio.h>
    #include <string.h>
    #include <math.h>
    
    /* Include CLBLAS header. It automatically includes needed OpenCL header,
     * so we can drop out explicit inclusion of cl.h header.
     */
    #include "clBLAS.h"
    
    /* This example uses predefined matrices and their characteristics for
     * simplicity purpose.
     */
    static const size_t N = 7;
    static cl_double2 X[] = {
        {{1,0}},
        {{2,0}},
        {{-11,0}},
        {{17,0}},
        {{5,0}},
        {{6,0}},
        {{81,0}}
    };
    static const int incx = 1;
    static cl_double2 asum;
    static cl_float2 X2[] = {
        {{1,0}},
        {{2,0}},
        {{-11,0}},
        {{17,0}},
        {{5,0}},
        {{6,0}},
        {{81,0}}
    };
    static cl_float2 asum2;
    
    int
    main(void)
    {
        cl_int err;
        cl_platform_id platform = 0;
        cl_device_id device = 0;
        cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
        cl_context ctx = 0;
        cl_command_queue queue = 0;
        cl_mem bufX, bufAsum, scratchBuff;
        cl_event event = NULL;
        int ret = 0;
        int lenX = 1 + (N-1)*abs(incx);
    
        /* Setup OpenCL environment. */
        err = clGetPlatformIDs(1, &platform, NULL);
        if (err != CL_SUCCESS) {
            printf( "clGetPlatformIDs() failed with %d\n", err );
            return 1;
        }
    
        err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
        if (err != CL_SUCCESS) {
            printf( "clGetDeviceIDs() failed with %d\n", err );
            return 1;
        }
    
        props[1] = (cl_context_properties)platform;
        ctx = clCreateContext(props, 1, &device, NULL, NULL, &err);
        if (err != CL_SUCCESS) {
            printf( "clCreateContext() failed with %d\n", err );
            return 1;
        }
    
        queue = clCreateCommandQueue(ctx, device, 0, &err);
        if (err != CL_SUCCESS) {
            printf( "clCreateCommandQueue() failed with %d\n", err );
            clReleaseContext(ctx);
            return 1;
        }
    
        /* Setup clblas. */
        err = clblasSetup();
        if (err != CL_SUCCESS) {
            printf("clblasSetup() failed with %d\n", err);
            clReleaseCommandQueue(queue);
            clReleaseContext(ctx);
            return 1;
        }
    
        /* Prepare OpenCL memory objects and place matrices inside them. */
        bufX = clCreateBuffer(ctx, CL_MEM_READ_ONLY, (lenX*sizeof(cl_double2)), NULL, &err);
        bufAsum = clCreateBuffer(ctx, CL_MEM_WRITE_ONLY, (sizeof(cl_double2)), NULL, &err);
        // Allocate minimum of N elements
        scratchBuff = clCreateBuffer(ctx, CL_MEM_READ_WRITE, (N*sizeof(cl_double2)), NULL, &err);
    
        err = clEnqueueWriteBuffer(queue, bufX, CL_TRUE, 0, (lenX*sizeof(cl_double2)), X, 0, NULL, NULL);
    
        /* Call clblas function. */
        err = clblasDzasum( N, bufAsum, 0, bufX, 0, incx, scratchBuff,
                                        1, &queue, 0, NULL, &event);
        if (err != CL_SUCCESS) {
            printf("clblasSasum() failed with %d\n", err);
            ret = 1;
        }
        else {
            /* Wait for calculations to be finished. */
            err = clWaitForEvents(1, &event);
    
            /* Fetch results of calculations from GPU memory. */
            err = clEnqueueReadBuffer(queue, bufAsum, CL_TRUE, 0, sizeof(cl_double2),
                                        &asum, 0, NULL, NULL);
            //printf("Result : %f\n", asum);
            printf("(%9.2lf, %-9.2lf)\n", CREAL(asum), CIMAG(asum));
        }
    
        /* Release OpenCL events. */
        clReleaseEvent(event);
    
        /* Release OpenCL memory objects. */
        clReleaseMemObject(bufX);
        clReleaseMemObject(bufAsum);
        clReleaseMemObject(scratchBuff);
    
        /* Finalize work with clblas. */
        clblasTeardown();
    
        /* Release OpenCL working objects. */
        clReleaseCommandQueue(queue);
        clReleaseContext(ctx);
    ///////////////////////////////////////////////now do single precision complex numbers
        //cl_int err;
        platform = 0;
        device = 0;
        //props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
        ctx = 0;
        queue = 0;
        cl_mem bufX2, bufAsum2, scratchBuff2;
        event = NULL;
        //int ret = 0;
        //int lenX = 1 + (N-1)*abs(incx);
    
        /* Setup OpenCL environment. */
        err = clGetPlatformIDs(1, &platform, NULL);
        if (err != CL_SUCCESS) {
            printf( "clGetPlatformIDs() failed with %d\n", err );
            return 1;
        }
    
        err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
        if (err != CL_SUCCESS) {
            printf( "clGetDeviceIDs() failed with %d\n", err );
            return 1;
        }
    
        props[1] = (cl_context_properties)platform;
        ctx = clCreateContext(props, 1, &device, NULL, NULL, &err);
        if (err != CL_SUCCESS) {
            printf( "clCreateContext() failed with %d\n", err );
            return 1;
        }
    
        queue = clCreateCommandQueue(ctx, device, 0, &err);
        if (err != CL_SUCCESS) {
            printf( "clCreateCommandQueue() failed with %d\n", err );
            clReleaseContext(ctx);
            return 1;
        }
    
        /* Setup clblas. */
        err = clblasSetup();
        if (err != CL_SUCCESS) {
            printf("clblasSetup() failed with %d\n", err);
            clReleaseCommandQueue(queue);
            clReleaseContext(ctx);
            return 1;
        }
    
        /* Prepare OpenCL memory objects and place matrices inside them. */
        bufX2 = clCreateBuffer(ctx, CL_MEM_READ_ONLY, (lenX*sizeof(cl_float2)), NULL, &err);
        bufAsum2 = clCreateBuffer(ctx, CL_MEM_WRITE_ONLY, (sizeof(cl_float2)), NULL, &err);
        // Allocate minimum of N elements
        scratchBuff2 = clCreateBuffer(ctx, CL_MEM_READ_WRITE, (N*sizeof(cl_float2)), NULL, &err);
    
        err = clEnqueueWriteBuffer(queue, bufX2, CL_TRUE, 0, (lenX*sizeof(cl_float2)), X2, 0, NULL, NULL);
    
        /* Call clblas function. */
        err = clblasScasum( N, bufAsum2, 0, bufX2, 0, incx, scratchBuff2,
                                        1, &queue, 0, NULL, &event);
        if (err != CL_SUCCESS) {
            printf("clblasSasum() failed with %d\n", err);
            ret = 1;
        }
        else {
            /* Wait for calculations to be finished. */
            err = clWaitForEvents(1, &event);
    
            /* Fetch results of calculations from GPU memory. */
            err = clEnqueueReadBuffer(queue, bufAsum2, CL_TRUE, 0, sizeof(cl_float2),
                                        &asum2, 0, NULL, NULL);
            //printf("Result : %f\n", asum);
            printf("(%9.2lf, %-9.2lf)\n", CREAL(asum2), CIMAG(asum2));
        }
    
        /* Release OpenCL events. */
        clReleaseEvent(event);
    
        /* Release OpenCL memory objects. */
        clReleaseMemObject(bufX2);
        clReleaseMemObject(bufAsum2);
        clReleaseMemObject(scratchBuff2);
    
        /* Finalize work with clblas. */
        clblasTeardown();
    
        /* Release OpenCL working objects. */
        clReleaseCommandQueue(queue);
        clReleaseContext(ctx);
    
        return ret;
    }
    
    opened by mikhail-j 14
  • Fixing build warnings.

    Fixing build warnings.

    I did the mechanical things to fix build warnings, and the changes significantly reduce the warnings on linux. I don't know how that changes will play with windows, but C++ is new-line tolerant so there shouldn't be any issues anyways. I also fixed places where tabs and spaces got mixed up, defaulting to tabs, and converting the equivalent number of spaces into tabs for many files.

    As a note, it looks like many of the kernels in src/library/blas/trtri/ have uncanny similarity, so those files should be able to be shrunk down. I would also implore you to purge STRINGIFY.

    opened by anadon 13
  • add error checker when creating cmd queue in client: especially when OoO queue is not supported on many devices

    add error checker when creating cmd queue in client: especially when OoO queue is not supported on many devices

    By default , the client created a out-of-orer queue. However, most devices only support in-order queue. The main purpose of client is to test clBLAS maths functions not test queues. It is better to change into in-order queue by default.

    https://github.com/clMathLibraries/clBLAS/blob/master/src/client/clfunc_common.hpp#L306

    opened by tingxingdong 1
  • Installation procedure went wrong?

    Installation procedure went wrong?

    Hi,

    I cloned your repo, created a build and install folder inside my clBLAS folder. I run then the lines:

    > cmake ../src/ -DCMAKE_INSTALL_PREFIX=/home/PersonalRepository/clBLAS/install
    > make -j8
    > make install
    

    Everything went fine, so I went in the /install/bin folder (I thought these were some examples to run?) but none of them seem to work.

    I either get segmentation fault (core dumped) or The environment variable 'CLBLAS_STORAGE_PATH' is not defined

    I guess I'm either looking in the wrong place for example, or I did something wrong with the installation.

    Can you help?

    Update

    I figured for the examples I need to enable -DBUILD_SAMPLE=ON. However when running make -j8 I get this error while building:

    error: \u2018FloatComplex {aka union <anonymous>}\u2019 has no member named \u2018x\u2019
                 printf("%.5f ", result[i * ldb + j].x);
    
    opened by lukkio88 1
  • CMake compilation with clBLAS fails on hard-coded AMDADDPSDK path

    CMake compilation with clBLAS fails on hard-coded AMDADDPSDK path

    When I compile a program using CMake, the target that links directly to clBLAS builds, but anything that then links to that target fails because:

    make[2]: *** No rule to make target '/opt/AMDAPPSDK-3.0/lib/x86_64/libOpenCL.so', needed by [target].  Stop
    

    The offending line appears in clBLASTargets-release.cmake when the concrete path to the AMD APP SDK is set. Removing that path allows the compilation to succeed.

    We could remove that path or substitute in ${OpenCL_LIBRARY}, but I think the real goal should be to move towards a Modern CMake approach where targets, not variables, are defined as this is best-practice. Something like clBLAS::clBLAS rather than the variables.

    refer to: https://pabloariasal.github.io/2018/02/19/its-time-to-do-cmake-right/#model-dependencies-with-target_link_libraries

    opened by tsjordan-eng 1
  • test-short failure on gfx1010 (RX5700 XT)

    test-short failure on gfx1010 (RX5700 XT)

    ========================================================
    
    AN INTERNAL KERNEL BUILD ERROR OCCURRED!
    device name = gfx1010
    error = -11
    memory pattern = Cached global memory based trmm, computing kernel generator
    Subproblem dimensions: dims[0].itemY = 32, dims[0].itemX = 8, dims[0].y = 32, dims[0].x = 8, dims[0].bwidth = 32; ; dims[1].itemY = 4, dims[1].itemX = 4, dims[1].y = 4, dims[1].x = 4, dims[1].bwidth = 8; ; 
    Parallelism granularity: pgran->wgDim = 1, pgran->wgSize[0] = 64, pgran->wgSize[1] = 1, pgran->wfSize = 64
    Kernel extra flags: 942700465
    Source:
    
    typedef union GPtr {
        __global float *f;
        __global float2 *f2v;
        __global float4 *f4v;
        __global float8 *f8v;
        __global float16 *f16v;
    } GPtr;
    
    typedef union LPtr {
        __local float *f;
        __local float2 *f2v;
        __local float4 *f4v;
        __local float8 *f8v;
        __local float16 *f16v;
    } LPtr;
    
    typedef union PPtr {
        float *f;
        float2 *f2v;
        float4 *f4v;
        float8 *f8v;
        float16 *f16v;
    } PPtr;
    
    __attribute__((reqd_work_group_size(64, 1, 1)))
    void __kernel
    strmmSubgroup(
        uint M,
        uint N,
        float alpha,
        const __global float *restrict A,
        uint lda,
        const __global float *restrict B,
        __global float *C,
        uint ldb)
    {
        float8 a0, a1, a2, a3;
        float8 b0, b1, b2, b3;
        float4 c0, c1, c2, c3;
        uint currM, currN;
        uint4 coord = 0; /* contains coordB, coordA, k */
        const int lid = get_local_id(0);
        const int gid = get_global_id(0) / 64;
    
    
        int2 itemId;
        int2 subgCoord;
        itemId.x = get_local_id(0)%4;
        itemId.y = get_local_id(0)/4;
        subgCoord.x = itemId.y/8;
        subgCoord.y = itemId.y%8;
    
        currN = gid * 8;
        currM = (M - 1) / 32 * 32;
    
        GPtr Ag = {A};
        GPtr Bg = {B};
        coord.x = currN + subgCoord.x*4;
        for (uint m0 = 0; m0 < M; m0 += 32) {
            uint kBegin = 0;
            coord.z = kBegin;
            coord.y = currM + subgCoord.y*4;
            c0 = 0;
            c1 = 0;
            c2 = 0;
            c3 = 0;
    
            if ((coord.x < N) && (coord.y < M)) {
                if (coord.y + 4 > M) {
                    coord.y -= 4 - M % 4;
                }
                if (coord.x + 4 > N) {
                    coord.x -= 4 - N % 4;
                }
    
                uint k0;
                uint kMax;
                kMax = currM - currM%8;
                for( k0 = 0; k0 < kMax; k0 += 32 ) {
                    coord.z=(k0+itemId.x*8+64*gid)%kMax;
                    /* -- Tiles multiplier -- */
                    const uint4 bx = {mad24(coord.x % N, ldb, 0u), mad24((coord.x + 1) % N, ldb, 0u), mad24((coord.x + 2) % N, ldb, 0u),
                                    mad24((coord.x + 3) % N, ldb, 0u)};
                    const uint8 bk = ((uint8)(0, 1, 2, 3, 4, 5, 6, 7) + coord.z) % M;
    
                    b0.s0 = (Bg).f[bx.s0 + bk.s0];
                    b0.s1 = (Bg).f[bx.s0 + bk.s1];
                    b0.s2 = (Bg).f[bx.s0 + bk.s2];
                    b0.s3 = (Bg).f[bx.s0 + bk.s3];
                    b0.s4 = (Bg).f[bx.s0 + bk.s4];
                    b0.s5 = (Bg).f[bx.s0 + bk.s5];
                    b0.s6 = (Bg).f[bx.s0 + bk.s6];
                    b0.s7 = (Bg).f[bx.s0 + bk.s7];
                    b1.s0 = (Bg).f[bx.s1 + bk.s0];
                    b1.s1 = (Bg).f[bx.s1 + bk.s1];
                    b1.s2 = (Bg).f[bx.s1 + bk.s2];
                    b1.s3 = (Bg).f[bx.s1 + bk.s3];
                    b1.s4 = (Bg).f[bx.s1 + bk.s4];
                    b1.s5 = (Bg).f[bx.s1 + bk.s5];
                    b1.s6 = (Bg).f[bx.s1 + bk.s6];
                    b1.s7 = (Bg).f[bx.s1 + bk.s7];
                    b2.s0 = (Bg).f[bx.s2 + bk.s0];
                    b2.s1 = (Bg).f[bx.s2 + bk.s1];
                    b2.s2 = (Bg).f[bx.s2 + bk.s2];
                    b2.s3 = (Bg).f[bx.s2 + bk.s3];
                    b2.s4 = (Bg).f[bx.s2 + bk.s4];
                    b2.s5 = (Bg).f[bx.s2 + bk.s5];
                    b2.s6 = (Bg).f[bx.s2 + bk.s6];
                    b2.s7 = (Bg).f[bx.s2 + bk.s7];
                    b3.s0 = (Bg).f[bx.s3 + bk.s0];
                    b3.s1 = (Bg).f[bx.s3 + bk.s1];
                    b3.s2 = (Bg).f[bx.s3 + bk.s2];
                    b3.s3 = (Bg).f[bx.s3 + bk.s3];
                    b3.s4 = (Bg).f[bx.s3 + bk.s4];
                    b3.s5 = (Bg).f[bx.s3 + bk.s5];
                    b3.s6 = (Bg).f[bx.s3 + bk.s6];
                    b3.s7 = (Bg).f[bx.s3 + bk.s7];
    
                    const uint4 ay = {mad24(coord.y % M, lda, 0u), mad24((coord.y + 1) % M, lda, 0u), mad24((coord.y + 2) % M, lda, 0u),
                                    mad24((coord.y + 3) % M, lda, 0u)};
                    const uint8 ak = ((uint8)(0, 1, 2, 3, 4, 5, 6, 7) + coord.z) % M;
    
                    a0.s0 = (Ag).f[ay.s0 + ak.s0];
                    a0.s1 = (Ag).f[ay.s0 + ak.s1];
                    a0.s2 = (Ag).f[ay.s0 + ak.s2];
                    a0.s3 = (Ag).f[ay.s0 + ak.s3];
                    a0.s4 = (Ag).f[ay.s0 + ak.s4];
                    a0.s5 = (Ag).f[ay.s0 + ak.s5];
                    a0.s6 = (Ag).f[ay.s0 + ak.s6];
                    a0.s7 = (Ag).f[ay.s0 + ak.s7];
                    a1.s0 = (Ag).f[ay.s1 + ak.s0];
                    a1.s1 = (Ag).f[ay.s1 + ak.s1];
                    a1.s2 = (Ag).f[ay.s1 + ak.s2];
                    a1.s3 = (Ag).f[ay.s1 + ak.s3];
                    a1.s4 = (Ag).f[ay.s1 + ak.s4];
                    a1.s5 = (Ag).f[ay.s1 + ak.s5];
                    a1.s6 = (Ag).f[ay.s1 + ak.s6];
                    a1.s7 = (Ag).f[ay.s1 + ak.s7];
                    a2.s0 = (Ag).f[ay.s2 + ak.s0];
                    a2.s1 = (Ag).f[ay.s2 + ak.s1];
                    a2.s2 = (Ag).f[ay.s2 + ak.s2];
                    a2.s3 = (Ag).f[ay.s2 + ak.s3];
                    a2.s4 = (Ag).f[ay.s2 + ak.s4];
                    a2.s5 = (Ag).f[ay.s2 + ak.s5];
                    a2.s6 = (Ag).f[ay.s2 + ak.s6];
                    a2.s7 = (Ag).f[ay.s2 + ak.s7];
                    a3.s0 = (Ag).f[ay.s3 + ak.s0];
                    a3.s1 = (Ag).f[ay.s3 + ak.s1];
                    a3.s2 = (Ag).f[ay.s3 + ak.s2];
                    a3.s3 = (Ag).f[ay.s3 + ak.s3];
                    a3.s4 = (Ag).f[ay.s3 + ak.s4];
                    a3.s5 = (Ag).f[ay.s3 + ak.s5];
                    a3.s6 = (Ag).f[ay.s3 + ak.s6];
                    a3.s7 = (Ag).f[ay.s3 + ak.s7];
    
                    c0.s0 = mad(a0.s0, b0.s0, c0.s0);
                    c0.s0 = mad(a0.s1, b0.s1, c0.s0);
                    c0.s0 = mad(a0.s2, b0.s2, c0.s0);
                    c0.s0 = mad(a0.s3, b0.s3, c0.s0);
                    c0.s0 = mad(a0.s4, b0.s4, c0.s0);
                    c0.s0 = mad(a0.s5, b0.s5, c0.s0);
                    c0.s0 = mad(a0.s6, b0.s6, c0.s0);
                    c0.s0 = mad(a0.s7, b0.s7, c0.s0);
                    c1.s0 = mad(a0.s0, b1.s0, c1.s0);
                    c1.s0 = mad(a0.s1, b1.s1, c1.s0);
                    c1.s0 = mad(a0.s2, b1.s2, c1.s0);
                    c1.s0 = mad(a0.s3, b1.s3, c1.s0);
                    c1.s0 = mad(a0.s4, b1.s4, c1.s0);
                    c1.s0 = mad(a0.s5, b1.s5, c1.s0);
                    c1.s0 = mad(a0.s6, b1.s6, c1.s0);
                    c1.s0 = mad(a0.s7, b1.s7, c1.s0);
                    c2.s0 = mad(a0.s0, b2.s0, c2.s0);
                    c2.s0 = mad(a0.s1, b2.s1, c2.s0);
                    c2.s0 = mad(a0.s2, b2.s2, c2.s0);
                    c2.s0 = mad(a0.s3, b2.s3, c2.s0);
                    c2.s0 = mad(a0.s4, b2.s4, c2.s0);
                    c2.s0 = mad(a0.s5, b2.s5, c2.s0);
                    c2.s0 = mad(a0.s6, b2.s6, c2.s0);
                    c2.s0 = mad(a0.s7, b2.s7, c2.s0);
                    c3.s0 = mad(a0.s0, b3.s0, c3.s0);
                    c3.s0 = mad(a0.s1, b3.s1, c3.s0);
                    c3.s0 = mad(a0.s2, b3.s2, c3.s0);
                    c3.s0 = mad(a0.s3, b3.s3, c3.s0);
                    c3.s0 = mad(a0.s4, b3.s4, c3.s0);
                    c3.s0 = mad(a0.s5, b3.s5, c3.s0);
                    c3.s0 = mad(a0.s6, b3.s6, c3.s0);
                    c3.s0 = mad(a0.s7, b3.s7, c3.s0);
    
                    c0.s1 = mad(a1.s0, b0.s0, c0.s1);
                    c0.s1 = mad(a1.s1, b0.s1, c0.s1);
                    c0.s1 = mad(a1.s2, b0.s2, c0.s1);
                    c0.s1 = mad(a1.s3, b0.s3, c0.s1);
                    c0.s1 = mad(a1.s4, b0.s4, c0.s1);
                    c0.s1 = mad(a1.s5, b0.s5, c0.s1);
                    c0.s1 = mad(a1.s6, b0.s6, c0.s1);
                    c0.s1 = mad(a1.s7, b0.s7, c0.s1);
                    c1.s1 = mad(a1.s0, b1.s0, c1.s1);
                    c1.s1 = mad(a1.s1, b1.s1, c1.s1);
                    c1.s1 = mad(a1.s2, b1.s2, c1.s1);
                    c1.s1 = mad(a1.s3, b1.s3, c1.s1);
                    c1.s1 = mad(a1.s4, b1.s4, c1.s1);
                    c1.s1 = mad(a1.s5, b1.s5, c1.s1);
                    c1.s1 = mad(a1.s6, b1.s6, c1.s1);
                    c1.s1 = mad(a1.s7, b1.s7, c1.s1);
                    c2.s1 = mad(a1.s0, b2.s0, c2.s1);
                    c2.s1 = mad(a1.s1, b2.s1, c2.s1);
                    c2.s1 = mad(a1.s2, b2.s2, c2.s1);
                    c2.s1 = mad(a1.s3, b2.s3, c2.s1);
                    c2.s1 = mad(a1.s4, b2.s4, c2.s1);
                    c2.s1 = mad(a1.s5, b2.s5, c2.s1);
                    c2.s1 = mad(a1.s6, b2.s6, c2.s1);
                    c2.s1 = mad(a1.s7, b2.s7, c2.s1);
                    c3.s1 = mad(a1.s0, b3.s0, c3.s1);
                    c3.s1 = mad(a1.s1, b3.s1, c3.s1);
                    c3.s1 = mad(a1.s2, b3.s2, c3.s1);
                    c3.s1 = mad(a1.s3, b3.s3, c3.s1);
                    c3.s1 = mad(a1.s4, b3.s4, c3.s1);
                    c3.s1 = mad(a1.s5, b3.s5, c3.s1);
                    c3.s1 = mad(a1.s6, b3.s6, c3.s1);
                    c3.s1 = mad(a1.s7, b3.s7, c3.s1);
    
                    c0.s2 = mad(a2.s0, b0.s0, c0.s2);
                    c0.s2 = mad(a2.s1, b0.s1, c0.s2);
                    c0.s2 = mad(a2.s2, b0.s2, c0.s2);
                    c0.s2 = mad(a2.s3, b0.s3, c0.s2);
                    c0.s2 = mad(a2.s4, b0.s4, c0.s2);
                    c0.s2 = mad(a2.s5, b0.s5, c0.s2);
                    c0.s2 = mad(a2.s6, b0.s6, c0.s2);
                    c0.s2 = mad(a2.s7, b0.s7, c0.s2);
                    c1.s2 = mad(a2.s0, b1.s0, c1.s2);
                    c1.s2 = mad(a2.s1, b1.s1, c1.s2);
                    c1.s2 = mad(a2.s2, b1.s2, c1.s2);
                    c1.s2 = mad(a2.s3, b1.s3, c1.s2);
                    c1.s2 = mad(a2.s4, b1.s4, c1.s2);
                    c1.s2 = mad(a2.s5, b1.s5, c1.s2);
                    c1.s2 = mad(a2.s6, b1.s6, c1.s2);
                    c1.s2 = mad(a2.s7, b1.s7, c1.s2);
                    c2.s2 = mad(a2.s0, b2.s0, c2.s2);
                    c2.s2 = mad(a2.s1, b2.s1, c2.s2);
                    c2.s2 = mad(a2.s2, b2.s2, c2.s2);
                    c2.s2 = mad(a2.s3, b2.s3, c2.s2);
                    c2.s2 = mad(a2.s4, b2.s4, c2.s2);
                    c2.s2 = mad(a2.s5, b2.s5, c2.s2);
                    c2.s2 = mad(a2.s6, b2.s6, c2.s2);
                    c2.s2 = mad(a2.s7, b2.s7, c2.s2);
                    c3.s2 = mad(a2.s0, b3.s0, c3.s2);
                    c3.s2 = mad(a2.s1, b3.s1, c3.s2);
                    c3.s2 = mad(a2.s2, b3.s2, c3.s2);
                    c3.s2 = mad(a2.s3, b3.s3, c3.s2);
                    c3.s2 = mad(a2.s4, b3.s4, c3.s2);
                    c3.s2 = mad(a2.s5, b3.s5, c3.s2);
                    c3.s2 = mad(a2.s6, b3.s6, c3.s2);
                    c3.s2 = mad(a2.s7, b3.s7, c3.s2);
    
                    c0.s3 = mad(a3.s0, b0.s0, c0.s3);
                    c0.s3 = mad(a3.s1, b0.s1, c0.s3);
                    c0.s3 = mad(a3.s2, b0.s2, c0.s3);
                    c0.s3 = mad(a3.s3, b0.s3, c0.s3);
                    c0.s3 = mad(a3.s4, b0.s4, c0.s3);
                    c0.s3 = mad(a3.s5, b0.s5, c0.s3);
                    c0.s3 = mad(a3.s6, b0.s6, c0.s3);
                    c0.s3 = mad(a3.s7, b0.s7, c0.s3);
                    c1.s3 = mad(a3.s0, b1.s0, c1.s3);
                    c1.s3 = mad(a3.s1, b1.s1, c1.s3);
                    c1.s3 = mad(a3.s2, b1.s2, c1.s3);
                    c1.s3 = mad(a3.s3, b1.s3, c1.s3);
                    c1.s3 = mad(a3.s4, b1.s4, c1.s3);
                    c1.s3 = mad(a3.s5, b1.s5, c1.s3);
                    c1.s3 = mad(a3.s6, b1.s6, c1.s3);
                    c1.s3 = mad(a3.s7, b1.s7, c1.s3);
                    c2.s3 = mad(a3.s0, b2.s0, c2.s3);
                    c2.s3 = mad(a3.s1, b2.s1, c2.s3);
                    c2.s3 = mad(a3.s2, b2.s2, c2.s3);
                    c2.s3 = mad(a3.s3, b2.s3, c2.s3);
                    c2.s3 = mad(a3.s4, b2.s4, c2.s3);
                    c2.s3 = mad(a3.s5, b2.s5, c2.s3);
                    c2.s3 = mad(a3.s6, b2.s6, c2.s3);
                    c2.s3 = mad(a3.s7, b2.s7, c2.s3);
                    c3.s3 = mad(a3.s0, b3.s0, c3.s3);
                    c3.s3 = mad(a3.s1, b3.s1, c3.s3);
                    c3.s3 = mad(a3.s2, b3.s2, c3.s3);
                    c3.s3 = mad(a3.s3, b3.s3, c3.s3);
                    c3.s3 = mad(a3.s4, b3.s4, c3.s3);
                    c3.s3 = mad(a3.s5, b3.s5, c3.s3);
                    c3.s3 = mad(a3.s6, b3.s6, c3.s3);
                    c3.s3 = mad(a3.s7, b3.s7, c3.s3);
                    /* ---------------------- */
                }
                if( itemId.x == 0 ) {
                    for( k0 = kMax; (k0 < currM+32)&&(k0 < M); k0 += 1 ) {
                        coord.z=k0;
                        /* -- Tiles multiplier -- */
                        const uint bk = coord.z % M;
    
                        b0.s0 = (Bg).f[mad24(coord.x % N, ldb, bk)];
                        b1.s0 = (Bg).f[mad24((coord.x + 1) % N, ldb, bk)];
                        b2.s0 = (Bg).f[mad24((coord.x + 2) % N, ldb, bk)];
                        b3.s0 = (Bg).f[mad24((coord.x + 3) % N, ldb, bk)];
    
                        b0.s0 = (coord.z < M) ? b0.s0 : 0;
                        b1.s0 = (coord.z < M) ? b1.s0 : 0;
                        b2.s0 = (coord.z < M) ? b2.s0 : 0;
                        b3.s0 = (coord.z < M) ? b3.s0 : 0;
    
                        const uint ak = coord.z % M;
    
                        a0.s0 = (Ag).f[mad24(coord.y % M, lda, ak)];
                        a1.s0 = (Ag).f[mad24((coord.y + 1) % M, lda, ak)];
                        a2.s0 = (Ag).f[mad24((coord.y + 2) % M, lda, ak)];
                        a3.s0 = (Ag).f[mad24((coord.y + 3) % M, lda, ak)];
    
                        a0.s0 = (coord.z < M) ? a0.s0 : 0;
                        a1.s0 = (coord.z < M) ? a1.s0 : 0;
                        a2.s0 = (coord.z < M) ? a2.s0 : 0;
                        a3.s0 = (coord.z < M) ? a3.s0 : 0;
                        // post fetch A
                        {
                            uint zy = coord.y;
                            a0.s0 = zy < coord.z ? 0 : a0.s0;
                            a0.s0 = zy == coord.z ? 1 : a0.s0;
                            zy++;
                            a1.s0 = zy < coord.z ? 0 : a1.s0;
                            a1.s0 = zy == coord.z ? 1 : a1.s0;
                            zy++;
                            a2.s0 = zy < coord.z ? 0 : a2.s0;
                            a2.s0 = zy == coord.z ? 1 : a2.s0;
                            zy++;
                            a3.s0 = zy < coord.z ? 0 : a3.s0;
                            a3.s0 = zy == coord.z ? 1 : a3.s0;
                        }
    
                        c0.s0 = mad(a0.s0, b0.s0, c0.s0);
                        c1.s0 = mad(a0.s0, b1.s0, c1.s0);
                        c2.s0 = mad(a0.s0, b2.s0, c2.s0);
                        c3.s0 = mad(a0.s0, b3.s0, c3.s0);
    
                        c0.s1 = mad(a1.s0, b0.s0, c0.s1);
                        c1.s1 = mad(a1.s0, b1.s0, c1.s1);
                        c2.s1 = mad(a1.s0, b2.s0, c2.s1);
                        c3.s1 = mad(a1.s0, b3.s0, c3.s1);
    
                        c0.s2 = mad(a2.s0, b0.s0, c0.s2);
                        c1.s2 = mad(a2.s0, b1.s0, c1.s2);
                        c2.s2 = mad(a2.s0, b2.s0, c2.s2);
                        c3.s2 = mad(a2.s0, b3.s0, c3.s2);
    
                        c0.s3 = mad(a3.s0, b0.s0, c0.s3);
                        c1.s3 = mad(a3.s0, b1.s0, c1.s3);
                        c2.s3 = mad(a3.s0, b2.s0, c2.s3);
                        c3.s3 = mad(a3.s0, b3.s0, c3.s3);
                        /* ---------------------- */
                    }
                }
            }
            barrier(CLK_GLOBAL_MEM_FENCE);
            if ((coord.y + 4 == M) && (M % 4)) {
                coord.y += 4 - M % 4;
            }
            if ((coord.x + 4 == N) && (N % 4)) {
                coord.x += 4 - N % 4;
            }
    
            //-----MergeUpdateResult
    
            // veclenC scratch[SUBG_ITEMS*MSTEP_SUBG*vecNumC]
            __local float4 ascratch[4*16*4];
            __local float4 *scratch = ascratch;
    
            //LDS block has the same vectorization as C matrix block
            //VNUM_C*((get_local_id(1)%MSTEP_SUBG)*SUBG_ITEMS +get_local_id(0) );
            scratch += 4*((itemId.y%16)*4 +itemId.x );
    
            for( uint mstep = 0; mstep < 16; mstep += 16 ) {
    
                if( (itemId.y >= mstep)&&(itemId.y < (mstep+16)) ) {
    
                    scratch[0] = c0;
                    scratch[1] = c1;
                    scratch[2] = c2;
                    scratch[3] = c3;
    
                    c0 = 0;
                    c1 = 0;
                    c2 = 0;
                    c3 = 0;
    
                }
    
                barrier(CLK_LOCAL_MEM_FENCE);
    
                if( (itemId.y >= mstep)&&(itemId.y < (mstep+16)) ) {
                    if ( 0 == itemId.x ) {
    
                        for(uint k = 0; k < 4 * 4; k += 4) {
    
                            c0 += scratch[0];
                            c1 += scratch[1];
                            c2 += scratch[2];
                            c3 += scratch[3];
    
                            //Adding the LDS block size in vectors
                            scratch += 4;
                        }
    
                        if ((coord.y < M) && (coord.x < N)) {
                            uint y = min(4u, M - (uint)coord.y);
                            uint x = min(4u, N - (uint)coord.x);
                            if ((y == 4) && (x == 4)) {
                                GPtr uC;
    
                                uC.f = C + coord.x * ldb + coord.y;
    
                                __global float *pC = uC.f;
    
                                float4 tempC0, tempC1, tempC2, tempC3;
    
                                tempC0 = mad(c0, alpha, 0);
                                tempC1 = mad(c1, alpha, 0);
                                tempC2 = mad(c2, alpha, 0);
                                tempC3 = mad(c3, alpha, 0);
                                pC[0] = tempC0.s0;
                                pC[1] = tempC0.s1;
                                pC[2] = tempC0.s2;
                                pC[3] = tempC0.s3;
                                pC[ldb] = tempC1.s0;
                                pC[ldb + 1] = tempC1.s1;
                                pC[ldb + 2] = tempC1.s2;
                                pC[ldb + 3] = tempC1.s3;
                                pC[(ldb << 1)] = tempC2.s0;
                                pC[mad24(2u, ldb, 1u)] = tempC2.s1;
                                pC[mad24(2u, ldb, 2u)] = tempC2.s2;
                                pC[mad24(2u, ldb, 3u)] = tempC2.s3;
                                pC[mad24(3u, ldb, 0u)] = tempC3.s0;
                                pC[mad24(3u, ldb, 1u)] = tempC3.s1;
                                pC[mad24(3u, ldb, 2u)] = tempC3.s2;
                                pC[mad24(3u, ldb, 3u)] = tempC3.s3;
                            }
                            else  {
                                GPtr uC;
                                int i, j;
                                PPtr res;
    
                                uC.f = C + coord.x * ldb + coord.y;
    
                                uC.f += (x-1) * ldb;
    
                                if (x)  {
                                    switch (y) {
                                        case 4:
                                        uC.f[(y+0) % 4] = c3.s0 * alpha;
                                        case 3:
                                        uC.f[(y+1) % 4] = c3.s1 * alpha;
                                        case 2:
                                        uC.f[(y+2) % 4] = c3.s2 * alpha;
                                        case 1:
                                        uC.f[(y+3) % 4] = c3.s3 * alpha;
                                    }
                                    uC.f -= ldb;
                                    x--;
                                }
                                if (x)  {
                                    switch (y) {
                                        case 4:
                                        uC.f[(y+0) % 4] = c2.s0 * alpha;
                                        case 3:
                                        uC.f[(y+1) % 4] = c2.s1 * alpha;
                                        case 2:
                                        uC.f[(y+2) % 4] = c2.s2 * alpha;
                                        case 1:
                                        uC.f[(y+3) % 4] = c2.s3 * alpha;
                                    }
                                    uC.f -= ldb;
                                    x--;
                                }
                                if (x)  {
                                    switch (y) {
                                        case 4:
                                        uC.f[(y+0) % 4] = c1.s0 * alpha;
                                        case 3:
                                        uC.f[(y+1) % 4] = c1.s1 * alpha;
                                        case 2:
                                        uC.f[(y+2) % 4] = c1.s2 * alpha;
                                        case 1:
                                        uC.f[(y+3) % 4] = c1.s3 * alpha;
                                    }
                                    uC.f -= ldb;
                                    x--;
                                }
                                if (x)  {
                                    switch (y) {
                                        case 4:
                                        uC.f[(y+0) % 4] = c0.s0 * alpha;
                                        case 3:
                                        uC.f[(y+1) % 4] = c0.s1 * alpha;
                                        case 2:
                                        uC.f[(y+2) % 4] = c0.s2 * alpha;
                                        case 1:
                                        uC.f[(y+3) % 4] = c0.s3 * alpha;
                                    }
                                    uC.f -= ldb;
                                    x--;
                                }
                            }
                        }
    
                    }
                }
                barrier(CLK_LOCAL_MEM_FENCE);
            }
            currM -= 32;
        }
    }
    
    
    
    --------------------------------------------------------
    
    Build log:
    
    /tmp/comgr-a1a18b/input/CompileCLSource:56:16: warning: initializing '__global float *' with an expression of type 'const __global float *restrict' discards qualifiers
        GPtr Ag = {A};
                   ^
    /tmp/comgr-a1a18b/input/CompileCLSource:57:16: warning: initializing '__global float *' with an expression of type 'const __global float *restrict' discards qualifiers
        GPtr Bg = {B};
                   ^
    /tmp/comgr-a1a18b/input/CompileCLSource:366:24: error: variables in the local address space can only be declared in the outermost scope of a kernel function
            __local float4 ascratch[4*16*4];
                           ^
    2 warnings and 1 error generated.
    Error: Failed to compile opencl source (from CL to LLVM IR).
    
    ========================================================
    
    Segmentation fault (core dumped)
    
    

    Additionally, several tests fail, e.g.

    [ RUN      ] ColumnMajor_SmallRange_BigLDA_OffSet/GEMM.zgemm/39
    m : 6    n: 63
    /home/robin/dev/clBLAS/src/tests/include/matrix.h:472: Failure
    The difference between ((ref).s[0]) and ((clresult).s[0]) is 58230133, which exceeds delta, where
    ((ref).s[0]) evaluates to -270497230451976,
    ((clresult).s[0]) evaluates to -270497288682109, and
    delta evaluates to 0.
    clblasColumnMajor, clblasTrans, clblasTrans, M = 128, N = 128, K = 128, offA = 1, offB = 0, offC = 0, lda = 500, ldb = 501, ldc = 502
                 seed = 12345, queues = 1, [  FAILED  ] ColumnMajor_SmallRange_BigLDA_OffSet/GEMM.zgemm/39, where GetParam() = (1, 1, 1, 128, 128, 128, 48-byte object <F4-01 00-00 00-00 00-00 F5-01 00-00 00-00 00-00 F6-01 00-00 00-00 00-00 01-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00>, 1) (9 ms)
    
    opened by robinchrist 0
Releases(v2.12)
  • v2.12(Jan 18, 2017)

    • Fixes to AutoGemm with beta == 0 and removing trtri hard coded options (@pavanky,@shehzan10)
    • test-functional passes all tests
    • Fixes to compile with clang 3.7 (@iotamudelta)
    • Fixes to dtrsm (@shehzan10)
    • gcc compiler flags refactored for non-x86 machines (@psyhtest)
    • Travis and Appveyor fixes (@haahh)
    • Added TLS memory and removed global cl_kernel objects to fix threading issues
    • improvements to the unit tests to increase reliability comparing floating point values
    • Add opencl device selection logic for test correctness programs
    • Removed the -pedantic flag from gcc builds to reduce the amount of warnings
    • Support for altivec on powerpc64 P8 systems (@tfauck)
    • Fixes to syr2 (@mgates3)
    Source code(tar.gz)
    Source code(zip)
    clBLAS-2.12.0-Linux-x64.tar.gz(8.60 MB)
    clBLAS-2.12.0-Windows-x64.zip(4.65 MB)
  • v2.10(Jan 6, 2016)

    This clBLAS release is tagged as v2.10 and is part of AMD Compute Libraries (ACL) 1.0 GA. This release is based on merge from develop branch to master branch.

    • AutoGemm now contains optimized parameters for Fiji GPUs with HBM (High-Bandwidth Memory) as well as optimized parameters for non-HBM devices, such as Hawaii, from release 2.8. The selection of logic can be done in CMake.
    • Many bug fixes, including:
      • Restore ability to use multiple different devices (not concurrently) via different contexts.
      • AutoGemm works with Python 2 and 3.
      • Better memory cleanup during teardown.

    Thank you to the following contributors for this release: @pavanky , @shehzan10 , @hughperkins , @ghisvail , @notorca

    • The release binaries are online compiled only, assuming OpenCL 2.0 compiler. The ASIC name (Hawaii or Fiji) in the binary titles indicates the kernel selection logic used to generate the binary; use the Fiji version for Fiji only (due to HBM) and use the Hawaii version for all other (non-HBM) GPUs.
    Source code(tar.gz)
    Source code(zip)
    clBLAS-2.10.0-Fiji-Linux-x64-CL2.0.tar.gz(13.98 MB)
    clBLAS-2.10.0-Fiji-Windows-x64.zip(5.45 MB)
    clBLAS-2.10.0-Hawaii-Linux-x64-CL2.0.tar.gz(14.05 MB)
    clBLAS-2.10.0-Hawaii-Windows-x64.zip(5.47 MB)
  • v2.8(Oct 19, 2015)

    This clBLAS release is tagged as v2.8 is part of AMD Compute Libraries (ACL) 1.0 beta 2. This release is based on merge from develop branch to master branch.

    The highlights of the release:

    • Introduced AutoGemm, the new high-performing GEneric Matrix Matrix multiplication (GEMM) backend for clBLAS, is a suite of Python scripts which:
      • generates thousands of optimized GEMM OpenCL kernels
      • benchmarks these kernels for a particular GPU and different matrix sizes to determine which are the fastest
      • automatically chooses the optimal kernel within clBLAS for peak performance
      • allows applications with unique GEMM requirements (such as very small or very skinny matrices) to generate customized application-specific GEMM kernels for additional performance.
    • Incorporated new faster DTRSM algorithm that:
      • enable the use of more hardware friendly algorithm for both online and offline compilation
      • leverages the DGEMM performance improvement from AutoGemm
    • MISC
      • fixes SGEMM performance drop at big multiples of 1024
      • fixes DGEMM performance drop at big sizes (ranging from 18000 by 18000 to 36000 by 36000)
      • supports Visual Studio 2015
      • adds CI support of Windows and Mac OS
    Source code(tar.gz)
    Source code(zip)
    clBLAS-2.8.0-Linux-x64.tar.gz(11.92 MB)
    clBLAS-2.8.0-Windows-x64.zip(5.28 MB)
  • v2.6(Aug 6, 2015)

    This clBLAS release is tagged as v2.6 is part of AMD Compute Libraries (ACL) 1.0 beta 1. This release is based on merge from develop branch to master branch.

    The highlights of the release:

    • Introduced offline kernel compilation
    • Improved performance (with offline kernel compilation) of
      • sgemm small matrices NN, TN, NT
      • sgemm large matrices NN, TN, NT
      • zgemm large matrices NT for m,n,k multiples of 32,64,8 respectively
      • dtrsm large matrices for m,n multiples of 192
    • Incorporated some CMake configuration changes
    • Released binaries now includes offline compiled library for certain device and driver.
      • "clBLAS-2.6.0-Windows-x64-Hawaii-14502.zip" is a binary built for Hawaii device with 14.502 driver on Windows platform
      • Binary built for Hawaii device with 14.502 driver on Linux platform will be released once the driver is released
      • (update 08/06/2015) "clBLAS-2.6.0-Linux-x64-Hawaii-14502.tar.gz" is a binary build for Hawaii device with 14.502 driver on Linux platform
    Source code(tar.gz)
    Source code(zip)
    clBLAS-2.6.0-Linux-x64-Hawaii-14502.tar.gz(57.20 MB)
    clBLAS-2.6.0-Linux-x64.tar.gz(38.85 MB)
    clBLAS-2.6.0-Windows-x64-Hawaii-14502.zip(9.34 MB)
    clBLAS-2.6.0-Windows-x64.zip(4.63 MB)
  • v2.4(Apr 13, 2015)

    Release based on merge from develop branch to master branch. The highlights of the merge:

    • fix correctness bug of c/zsyr2k; fix correctness bug in ktest's reference code
    • improve tuning tool coverage
    • allow another parent CMake project to call clBLAS as subdirectory (thanks to contributions from @robertsdionne )
    • bug fix related to Intel CPU (thanks for contributions from @pavanky )
    • bug fix related to Intel OpenCL driver (thanks for contributions from @pavanky )
    • bug fix related to Intel SDKs on Windows, Apple SDKs on OSX (thanks for contributions from @pavanky )
    • enable build static library (thanks for contributions from @glehmann )
    • some installation and prefix fixes (thanks for contributions from @glehmann )
    • allow user to build gtest from source (thanks for contributions from @glehmann )
    Source code(tar.gz)
    Source code(zip)
    clBLAS-2.4.0-Linux-x32.tar.gz(10.27 MB)
    clBLAS-2.4.0-Linux-x64.tar.gz(10.34 MB)
    clBLAS-2.4.0-Windows-x32.zip(3.47 MB)
    clBLAS-2.4.0-Windows-x64.zip(4.59 MB)
  • v2.2(Jun 20, 2014)

    The /develop branch has seen improvements and bug fixes since the source posted on github, and it was time to merge that activity into /master. The highlights of the merge:

    • The code can compile and run tests on the MacOSX platform (thanks to contributions from @gicmo & @abergeron)
    • Hang fixed in hemm & symm when tuning
    • Client program extended with an option to use *copyBufferRect() API to copy data
    • Support for vs2013 added
    • Proof-of-concept wrapper for clBLAS added with sgemm
    • Cmake improvements to detect and copy dependencies of targets
    • A staging directory was added to the build process to ease debugging during development

    and many other bug fixes. In addition, this release of clBLAS will provide binary packages for those who do not want to go through the steps of compiling the source on supported platforms. However, test dependencies are not packaged, and will need to be downloaded by the user. The clBLAS test programs depend on ACML.

    Source code(tar.gz)
    Source code(zip)
    clBLAS-2.2.0-Linux-x32.tar.gz(10.26 MB)
    clBLAS-2.2.0-Linux-x64.tar.gz(10.33 MB)
    clBLAS-2.2.0-Windows-x32.zip(3.46 MB)
    clBLAS-2.2.0-Windows-x64.zip(4.58 MB)
  • v2.0(Aug 13, 2013)

    This release is the open-sourcing of the APPML clAmdBlas project. It provides the complete set of BLAS level 1, 2 and 3 routines, written with an API that exposes OpenCL objects to allow the library developer to maximize performance by controlling the OpenCL state.

    The version number of the clBLAS project is starting at v2.0, to distinguish it from the closed source clAmdBlas project. All the API's have been changed to provide a vendor neutral naming scheme, and a new clBLAS.h header file has been introduced.

    The original clAmdBlas.h header file has been heavily modified to provide backwards compatibility for clAmdBlas users transitioning to clBLAS. It is a 'wrapper' header around clBLAS.h and users should convert to the new header file at earliest convenience.

    Source code(tar.gz)
    Source code(zip)
Owner
null
Lightweight OpenCL-Wrapper to greatly simplify OpenCL software development with C++ while keeping functionality and performance

OpenCL-Wrapper OpenCL is the most powerful programming language ever created. Yet the OpenCL C++ bindings are very cumbersome and the code overhead pr

Moritz Lehmann 52 Sep 29, 2022
Experimental OpenCL SPIR-V to OpenCL C translator

spirv2clc spirv2clc is an experimental OpenCL SPIR-V to OpenCL C translator currently targeting OpenCL 1.2 support. It can generate OpenCL C code equi

Kévin Petit 19 Oct 1, 2022
BLAS-like Library Instantiation Software Framework

Contents Introduction Education and Learning What's New What People Are Saying About BLIS Key Features How to Download BLIS Getting Started Example Co

null 1.4k Sep 25, 2022
An open source standard C library that includes useful functions && (Reimplementation of libc functions + own functions).

?? LIBFT-42 : Artistic view of LIBC: ?? HOW DOES IT FEEL HAVING YOUR OWN LIB: SUBJECT : ENGLISH PDF ℹ️ What is LIBFT : This project aims to code a C l

Abdessamad Laamimi 11 Sep 21, 2022
OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.

OpenBLAS Travis CI: AppVeyor: Drone CI: Introduction OpenBLAS is an optimized BLAS (Basic Linear Algebra Subprograms) library based on GotoBLAS2 1.13

Zhang Xianyi 4.8k Oct 2, 2022
Using PLT trampolines to provide a BLAS and LAPACK demuxing library.

libblastrampoline All problems in computer science can be solved by another level of indirection Using PLT trampolines to provide a BLAS and LAPACK de

Elliot Saba 50 Sep 30, 2022
C library containing useful base64 related functions.

b64 C library containing useful fast base64 related functions. Usage Get the library: It's very simple to use it, run the "build_lib.sh" shell script

null 5 May 8, 2022
(for Casio ClassPad II fx-CP400, with hollyhock) tech demo thing containing useful functions

here's ponggers2, mainly a tech demo to show off the trig functions, fps display, texture and font conversion, texture and text (with the custom fonts

InterChan 3 Jan 15, 2022
Libft is an individual project at 42 that requires us to re-create some standard C library functions including some additional ones that can be used later to build a library of useful functions for the rest of the program.

?? Index What is Libft? List of Functions Technologies ✨ What is Libft? Libft is an individual project at 42 that requires us to re-create some standa

Paulo Rafael Ramalho 7 Jan 17, 2022
Libft is an individual project at 42 that requires us to re-create some standard C library functions including some additional ones that can be used later to build a library of useful functions for the rest of the program.

Libft is an individual project at 42 that requires us to re-create some standard C library functions including some additional ones that can be used later to build a library of useful functions for the rest of the program.

Paulo Rafael Ramalho 0 Apr 5, 2022
A C++ GPU Computing Library for OpenCL

Boost.Compute Boost.Compute is a GPU/parallel-computing library for C++ based on OpenCL. The core library is a thin C++ wrapper over the OpenCL API an

Boost.org 1.3k Sep 29, 2022
OpenCL based GPU accelerated SPH fluid simulation library

libclsph An OpenCL based GPU accelerated SPH fluid simulation library Can I see it in action? Demo #1 Demo #2 Why? Libclsph was created to explore the

null 47 Jul 27, 2022
VexCL is a C++ vector expression template library for OpenCL/CUDA/OpenMP

VexCL VexCL is a vector expression template library for OpenCL/CUDA. It has been created for ease of GPGPU development with C++. VexCL strives to redu

Denis Demidov 678 Sep 2, 2022
A C++ GPU Computing Library for OpenCL

Boost.Compute Boost.Compute is a GPU/parallel-computing library for C++ based on OpenCL. The core library is a thin C++ wrapper over the OpenCL API an

Boost.org 1.3k Oct 2, 2022
A C++ GPGPU OpenCL library for Android and Unix systems.

CLImage A Modern Approach to using OpenCL with C++ on Android and Unix. Fabio Riccardi Glass Imaging, Inc. [email protected] Introduction CLImag

Glass Imaging 8 Sep 19, 2022
A small C OpenCL wrapper

oclkit, plain and stupid OpenCL helper oclkit is a small set of C functions, to avoid writing the same OpenCL boiler plate over and over again, yet ke

Matthias Vogelgesang 15 Jul 22, 2022
Ethereum miner with OpenCL, CUDA and stratum support

Ethminer is an Ethash GPU mining worker: with ethminer you can mine every coin which relies on an Ethash Proof of Work thus including Ethereum, Ethereum Classic, Metaverse, Musicoin, Ellaism, Pirl, Expanse and others. This is the actively maintained version of ethminer. It originates from cpp-ethereum project (where GPU mining has been discontinued) and builds on the improvements made in Genoil's fork. See FAQ for more details.

null 5.9k Oct 4, 2022
DLPrimitives/OpenCL out of tree backend for pytorch

Pytorch OpenCL backend based on dlprimitives DLPrimitives-OpenCL out of tree backend for pytorch It is only beginning, but you can train some vision n

Artyom Beilis 68 Sep 29, 2022
A heterogeneous OpenCL implementation of AutoDock Vina

Vina-GPU A heterogeneous OpenCL implementation of AutoDock Vina Compiling and Running Note: at least one GPU card is required and make sure the versio

Nanjing University of Posts and Telecommunications 37 Sep 26, 2022