OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version.

Related tags

Math OpenBLAS


Join the chat at

Travis CI: Build Status

AppVeyor: Build status

Drone CI: Build Status

Build Status


OpenBLAS is an optimized BLAS (Basic Linear Algebra Subprograms) library based on GotoBLAS2 1.13 BSD version.

Please read the documentation on the OpenBLAS wiki pages:

For a general introduction to the BLAS routines, please refer to the extensive documentation of their reference implementation hosted at netlib: On that site you will likewise find documentation for the reference implementation of the higher-level library LAPACK - the Linear Algebra Package that comes included with OpenBLAS. If you are looking for a general primer or refresher on Linear Algebra, the set of six 20-minute lecture videos by Prof. Gilbert Strang on either MIT OpenCourseWare or Youtube may be helpful.

Binary Packages

We provide official binary packages for the following platform:

  • Windows x86/x86_64

You can download them from file hosting on

Installation from Source

Download from project homepage,, or check out the code using Git from (If you want the most up to date version, be sure to use the develop branch - master is several years out of date due to a change of maintainership.) Buildtime parameters can be chosen in Makefile.rule, see there for a short description of each option. Most can also be given directly on the make or cmake command line.


Building OpenBLAS requires the following to be installed:

  • GNU Make
  • A C compiler, e.g. GCC or Clang
  • A Fortran compiler (optional, for LAPACK)
  • IBM MASS (optional, see below)

Normal compile

Simply invoking make (or gmake on BSD) will detect the CPU automatically. To set a specific target CPU, use make TARGET=xxx, e.g. make TARGET=NEHALEM. The full target list is in the file TargetList.txt. For building with cmake, the usual conventions apply, i.e. create a build directory either underneath the toplevel OpenBLAS source directory or separate from it, and invoke cmake there with the path to the source tree and any build options you plan to set.

Cross compile

Set CC and FC to point to the cross toolchains, and set HOSTCC to your host C compiler. The target must be specified explicitly when cross compiling.


  • On an x86 box, compile this library for a loongson3a CPU:

    make BINARY=64 CC=mips64el-unknown-linux-gnu-gcc FC=mips64el-unknown-linux-gnu-gfortran HOSTCC=gcc TARGET=LOONGSON3A

    or same with the newer mips-crosscompiler put out by Loongson that defaults to the 32bit ABI:

    make HOSTCC=gcc CC='/opt/mips-loongson-gcc7.3-linux-gnu/2019.06-29/bin/mips-linux-gnu-gcc -mabi=64' FC='/opt/mips-loongson-gcc7.3-linux-gnu/2019.06-29/bin/mips-linux-gnu-gfortran -mabi=64' TARGET=LOONGSON3A
  • On an x86 box, compile this library for a loongson3a CPU with loongcc (based on Open64) compiler:

    make CC=loongcc FC=loongf95 HOSTCC=gcc TARGET=LOONGSON3A CROSS=1 CROSS_SUFFIX=mips64el-st-linux-gnu-   NO_LAPACKE=1 NO_SHARED=1 BINARY=32

Debug version

A debug version can be built using make DEBUG=1.

Compile with MASS support on Power CPU (optional)

The IBM MASS library consists of a set of mathematical functions for C, C++, and Fortran applications that are tuned for optimum performance on POWER architectures. OpenBLAS with MASS requires a 64-bit, little-endian OS on POWER. The library can be installed as shown:

  • On Ubuntu:

    wget -q -O- | sudo apt-key add -
    echo "deb trusty main" | sudo tee /etc/apt/sources.list.d/ibm-xl-compiler-eval.list
    sudo apt-get update
    sudo apt-get install libxlmass-devel.8.1.5
  • On RHEL/CentOS:

    sudo rpm --import repomd.xml.key
    sudo cp ibm-xl-compiler-eval.repo /etc/yum.repos.d/
    sudo yum install libxlmass-devel.8.1.5

After installing the MASS library, compile OpenBLAS with USE_MASS=1. For example, to compile on Power8 with MASS support: make USE_MASS=1 TARGET=POWER8.

Install to a specific directory (optional)

Use PREFIX= when invoking make, for example

make install PREFIX=your_installation_directory

The default installation directory is /opt/OpenBLAS.

Supported CPUs and Operating Systems

Please read GotoBLAS_01Readme.txt for older CPU models already supported by the 2010 GotoBLAS.

Additional supported CPUs


  • Intel Xeon 56xx (Westmere): Used GotoBLAS2 Nehalem codes.
  • Intel Sandy Bridge: Optimized Level-3 and Level-2 BLAS with AVX on x86-64.
  • Intel Haswell: Optimized Level-3 and Level-2 BLAS with AVX2 and FMA on x86-64.
  • Intel Skylake-X: Optimized Level-3 and Level-2 BLAS with AVX512 and FMA on x86-64.
  • AMD Bobcat: Used GotoBLAS2 Barcelona codes.
  • AMD Bulldozer: x86-64 ?GEMM FMA4 kernels. (Thanks to Werner Saar)
  • AMD PILEDRIVER: Uses Bulldozer codes with some optimizations.
  • AMD STEAMROLLER: Uses Bulldozer codes with some optimizations.
  • AMD ZEN: Uses Haswell codes with some optimizations.


  • MIPS 1004K: uses P5600 codes
  • MIPS 24K: uses P5600 codes


  • ICT Loongson 3A: Optimized Level-3 BLAS and the part of Level-1,2.
  • ICT Loongson 3B: Experimental


  • ARMv6: Optimized BLAS for vfpv2 and vfpv3-d16 (e.g. BCM2835, Cortex M0+)
  • ARMv7: Optimized BLAS for vfpv3-d32 (e.g. Cortex A8, A9 and A15)


  • ARMv8: Basic ARMV8 with small caches, optimized Level-3 and Level-2 BLAS
  • Cortex-A53: same as ARMV8 (different cpu specifications)
  • Cortex A57: Optimized Level-3 and Level-2 functions
  • Cortex A72: same as A57 ( different cpu specifications)
  • Cortex A73: same as A57 (different cpu specifications)
  • Falkor: same as A57 (different cpu specifications)
  • ThunderX: Optimized some Level-1 functions
  • ThunderX2T99: Optimized Level-3 BLAS and parts of Levels 1 and 2
  • ThunderX3T110
  • TSV110: Optimized some Level-3 helper functions
  • EMAG 8180: preliminary support based on A57
  • Neoverse N1: (AWS Graviton2) preliminary support
  • Apple Vortex: preliminary support based on ARMV8


  • POWER8: Optimized BLAS, only for PPC64LE (Little Endian), only with USE_OPENMP=1
  • POWER9: Optimized Level-3 BLAS (real) and some Level-1,2. PPC64LE with OpenMP only.
  • POWER10:

IBM zEnterprise System

  • Z13: Optimized Level-3 BLAS and Level-1,2
  • Z14: Optimized Level-3 BLAS and (single precision) Level-1,2


  • C910V: Optimized Leve-3 BLAS (real) and Level-1,2 by RISC-V Vector extension 0.7.1.
    make HOSTCC=gcc TARGET=C910V CC=riscv64-unknown-linux-gnu-gcc FC=riscv64-unknown-linux-gnu-gfortran

Support for multiple targets in a single library

OpenBLAS can be built for multiple targets with runtime detection of the target cpu by specifiying DYNAMIC_ARCH=1 in Makefile.rule, on the gmake command line or as -DDYNAMIC_ARCH=TRUE in cmake.

For x86_64, the list of targets this activates contains Prescott, Core2, Nehalem, Barcelona, Sandybridge, Bulldozer, Piledriver, Steamroller, Excavator, Haswell, Zen, SkylakeX. For cpu generations not included in this list, the corresponding older model is used. If you also specify DYNAMIC_OLDER=1, specific support for Penryn, Dunnington, Opteron, Opteron/SSE3, Bobcat, Atom and Nano is added. Finally there is an option DYNAMIC_LIST that allows to specify an individual list of targets to include instead of the default.

DYNAMIC_ARCH is also supported on x86, where it translates to Katmai, Coppermine, Northwood, Prescott, Banias, Core2, Penryn, Dunnington, Nehalem, Athlon, Opteron, Opteron_SSE3, Barcelona, Bobcat, Atom and Nano.

On ARMV8, it enables support for CortexA53, CortexA57, CortexA72, CortexA73, Falkor, ThunderX, ThunderX2T99, TSV110 as well as generic ARMV8 cpus.

For POWER, the list encompasses POWER6, POWER8 and POWER9, on ZARCH it comprises Z13 and Z14.

The TARGET option can be used in conjunction with DYNAMIC_ARCH=1 to specify which cpu model should be assumed for all the common code in the library, usually you will want to set this to the oldest model you expect to encounter. Please note that it is not possible to combine support for different architectures, so no combined 32 and 64 bit or x86_64 and arm64 in the same library.

Supported OS


Statically link with libopenblas.a or dynamically link with -lopenblas if OpenBLAS was compiled as a shared library.

Setting the number of threads using environment variables

Environment variables are used to specify a maximum number of threads. For example,



If you compile this library with USE_OPENMP=1, you should set the OMP_NUM_THREADS environment variable; OpenBLAS ignores OPENBLAS_NUM_THREADS and GOTO_NUM_THREADS when compiled with USE_OPENMP=1.

Setting the number of threads at runtime

We provide the following functions to control the number of threads at runtime:

void goto_set_num_threads(int num_threads);
void openblas_set_num_threads(int num_threads);

Note that these are only used once at library initialization, and are not available for fine-tuning thread numbers in individual BLAS calls. If you compile this library with USE_OPENMP=1, you should use the above functions too.

Reporting bugs

Please submit an issue in


Change log

Please see Changelog.txt to view the differences between OpenBLAS and GotoBLAS2 1.13 BSD version.


  • Please read the FAQ first.
  • Please use GCC version 4.6 and above to compile Sandy Bridge AVX kernels on Linux/MinGW/BSD.
  • Please use Clang version 3.1 and above to compile the library on Sandy Bridge microarchitecture. Clang 3.0 will generate the wrong AVX binary code.
  • Please use GCC version 6 or LLVM version 6 and above to compile Skylake AVX512 kernels.
  • The number of CPUs/cores should be less than or equal to 256. On Linux x86_64 (amd64), there is experimental support for up to 1024 CPUs/cores and 128 numa nodes if you build the library with BIGNUMA=1.
  • OpenBLAS does not set processor affinity by default. On Linux, you can enable processor affinity by commenting out the line NO_AFFINITY=1 in Makefile.rule. However, note that this may cause a conflict with R parallel.
  • On Loongson 3A, make test may fail with a pthread_create error (EAGAIN). However, it will be okay when you run the same test case on the shell.


  1. Check for open issues or open a fresh issue to start a discussion around a feature idea or a bug.
  2. Fork the OpenBLAS repository to start making your changes.
  3. Write a test which shows that the bug was fixed or that the feature works as expected.
  4. Send a pull request. Make sure to add yourself to


Please read this wiki page.

  • Compiler Failure on POWER8 with /kernel/power/i{c,d,s,z}a{min,max}.c

    Compiler Failure on POWER8 with /kernel/power/i{c,d,s,z}a{min,max}.c

    I updated my installations on my POWER8 machine and since version 0.3.6 up to the current development branch, the build process fails with:

    $ make  MAKE_NB_JOBS=1
    cc -c -Ofast -mcpu=power8 -mtune=power8 -mvsx -malign-power -DUSE_OPENMP -fno-fast-math -fopenmp -DMAX_STACK_ALLOC=2048 -fopenmp -Wall -m64 -DF_INTERFACE_GFORT -fPIC -DSMP_SERVER -DUSE_OPENMP -DNO_WARMUP -DMAX_CPU_NUMBER=160 -DMAX_PARALLEL_NUMBER=1 -DVERSION=\"\" -DASMNAME=isamax_k -DASMFNAME=isamax_k_ -DNAME=isamax_k_ -DCNAME=isamax_k -DCHAR_NAME=\"isamax_k_\" -DCHAR_CNAME=\"isamax_k\" -DNO_AFFINITY -I.. -UDOUBLE  -UCOMPLEX -UCOMPLEX -UDOUBLE -DUSE_ABS  -UUSE_MIN ../kernel/power/isamax.c -o isamax_k.o
    ../kernel/power/isamax.c: In function ‘siamax_kernel_64’:
    ../kernel/power/isamax.c:288:1: internal compiler error: in build_int_cst_wide, at tree.c:1210
    Please submit a full bug report,
    with preprocessed source if appropriate.
    See <> for instructions.
    Preprocessed source stored into /tmp/ccriHyPV.out file, please attach this to your bugreport.
    make[1]: *** [isamax_k.o] Error 1
    make[1]: Leaving directory `/root/OpenBLAS/kernel'
    make: *** [libs] Error 1

    OS Details: CentOS 7.6 ppc64el, gcc 4.8.5, gfortran 4.8.5, IBM POWER8 LC822

    opened by grisuthedragon 132
  • OpenBLAS hangs when calling DGEMM after a Unix fork

    OpenBLAS hangs when calling DGEMM after a Unix fork

    Here is a sample program that illustrate a problem when I want to use OpenBLAS with Unix fork (without exec):

    $ wget
    $ gcc -lopenblas -L/opt/OpenBLAS/lib -I/opt/OpenBLAS/include \
               openblas_fork.c -o openblas_fork

    Then calling GEMM with a 7x7 squared matrix in parent and child process works without issue:

    $ ./openblas_fork 7
    computing for size 7
    c[0][0] is 7.000000
    parent d[0][0] is 7.000000
    child d[0][0] is 7.000000

    Doing the same for a 8x8 matrix will cause the child process to never completes (while burning 100% of a CPU) until I kill it.

    Might this be caused by some static initialization step in the OpenBLAS runtime? If so would it be possible during init to store the pid of the process at init time and then later on check that the current process pid still the same and if not reinit the runtime?

    Atlas, Apple vecLib and Intel MKL have no trouble with Unix forks. Hence this limitation (bug?) prevents OpenBLAS to be used as a drop-in replacement for those guys. This is especially annoying when used with Python that uses Unix forks in its standard library via the multiprocessing module. As Python threads suffer from a locking issue, many project use multiprocessing as a work around.

    opened by ogrisel 88
  • DGEMM regression on SkylakeX

    DGEMM regression on SkylakeX

    It looks like has created a regression in Julia's pinv() calculations on SkylakeX. In particular, creating a Hilbert matrix of size 1000 x 100 and asking for the pseudo-inverse now calculates the wrong thing:

    using LinearAlgebra
    function hilb(T::Type, m::Integer, n::Integer)
        a = Matrix{T}(undef, m, n)
        for i=1:n
            for j=1:m
        return a
    hilb(m::Integer, n::Integer) = hilb(Float64,m,n)
    a = hilb(1000, 100)
    apinv = pinv(a)

    Including the SkylakeX kernel gives the following answer:

    100×1000 Array{Float64,2}:
      2.57526e6   -2.33247e6    2.21848e6    2.19307e6   -4.13046e6   …  -4.71439e5   -6.80621e5   -6.56864e5   -8.6676e5    -3.86363e5
     -1.22338e11  -2.20992e11   2.36372e11  -1.14835e11  -9.1049e10       1.13475e10   8.51702e9    3.51379e9    1.54455e10   2.8167e8
      2.45922e11   3.06366e11  -3.45368e11   6.99788e10   1.34305e11     -2.0333e10   -1.72032e10  -6.35715e9   -3.15537e10   3.42079e9
     -1.98151e10  -5.04668e10   6.22131e9   -4.37235e10  -3.29137e9       2.4302e9     3.26304e9    2.4001e9     5.31362e9    1.42072e8
     -3.96966e11   1.59586e10  -1.05208e11  -3.27214e11   3.74498e11      5.9171e10    8.17795e10   7.30775e10   1.11105e11   3.58937e10
     -1.15417e11  -1.8089e10    3.02927e10  -7.71434e10   6.99771e10  …   1.55784e10   1.9823e10    1.69518e10   2.74274e10   8.50587e9
     -7.91383e11   3.19284e11  -5.46209e11  -6.27492e11   1.00493e12      1.26433e11   1.8369e11    1.69215e11   2.44706e11   8.48811e10
     -1.12133e11  -3.60367e10  -1.90203e8   -9.61073e10   7.40389e10      1.55288e10   2.07537e10   1.779e10     2.90644e10   7.91371e9
      4.19346e11  -2.37896e11   3.70455e11   3.44512e11  -6.0224e11      -6.9822e10   -1.03364e11  -9.66593e10  -1.36339e11  -4.94829e10
     -9.51913e9   -1.29776e11   1.82371e11   1.45922e10  -1.32419e11     -3.82507e9   -1.03485e10  -1.22032e10  -1.15361e10  -6.94914e9
      2.81647e12  -3.70604e11   9.33114e11   2.35652e12  -2.88098e12  …  -4.29963e11  -5.98574e11  -5.4097e11   -8.06547e11  -2.73127e11
      2.23042e11  -1.50478e11   1.87991e11   1.81127e11  -3.31279e11     -3.78415e10  -5.5537e10   -5.24234e10  -7.25081e10  -2.8155e10
      8.06723e11  -3.62461e11   6.16052e11   6.57412e11  -1.0719e12      -1.30794e11  -1.91569e11  -1.77425e11  -2.54474e11  -8.93792e10
      5.53103e10  -1.38396e10  -1.52335e10   3.90026e10  -4.70444e10     -8.49163e9   -1.06612e10  -9.72492e9   -1.3894e10   -6.12903e9
      9.43701e11  -5.29597e11   7.95135e11   7.22863e11  -1.31683e12     -1.54868e11  -2.28251e11  -2.12445e11  -3.0164e11   -1.08025e11
     -4.21561e11   1.74302e10  -1.05198e11  -3.6234e11    4.02288e11  …   6.38291e10   8.7825e10    7.88592e10   1.18726e11   3.97235e10
      ⋮                                                               ⋱   ⋮
      6.68906e10  -6.53051e9    3.21447e10   4.33472e10  -6.42927e10     -9.61282e9   -1.3713e10   -1.20454e10  -1.88794e10  -5.21562e9
     -4.3416e10    9.89228e9   -1.06994e10  -3.73798e10   4.63849e10  …   6.77796e9    9.31398e9    8.54109e9    1.23867e10   4.64291e9
     -3.36704e10  -2.19847e10   3.36634e10  -2.54336e10   4.56403e9       4.11065e9    4.50039e9    3.55012e9    6.46187e9    1.91442e9
      1.00752e11  -9.78859e10   1.5892e11    7.5809e10   -1.8612e11      -1.78414e10  -2.82659e10  -2.68784e10  -3.70109e10  -1.31011e10
     -6.24935e11   1.72017e11  -3.18826e11  -5.14607e11   7.21608e11      9.79871e10   1.39412e11   1.27482e11   1.86508e11   6.45552e10
     -3.56138e11   4.75922e10  -1.07088e11  -2.96353e11   3.60894e11      5.43011e10   7.52645e10   6.80322e10   1.01325e11   3.46671e10
      1.41477e11  -1.55157e11   2.42801e11   1.08982e11  -2.78805e11  …  -2.57232e10  -4.11459e10  -3.94458e10  -5.35713e10  -1.95093e10
     -1.71703e11   6.41864e9   -2.29926e10  -1.45472e11   1.56983e11      2.57486e10   3.4882e10    3.13021e10   4.71032e10   1.62219e10
     -1.57418e11  -2.50531e10   1.42196e10  -1.16675e11   1.07792e11      2.18507e10   2.86411e10   2.47383e10   3.9572e10    1.19948e10
     -5.09849e11   1.45783e11  -2.73511e11  -3.97073e11   5.85682e11      7.9147e10    1.13045e11   1.02973e11   1.5164e11    5.11554e10
     -2.01401e11   9.45476e10  -1.55585e11  -1.48556e11   2.6349e11       3.21347e10   4.71454e10   4.34308e10   6.28048e10   2.14634e10
      6.55703e11  -1.91764e11   3.65181e11   5.46763e11  -7.75768e11  …  -1.03493e11  -1.481e11    -1.35717e11  -1.97988e11  -6.84949e10
      4.39019e11  -7.20111e10   1.66746e11   3.76386e11  -4.6781e11      -6.78667e10  -9.50602e10  -8.63599e10  -1.27719e11  -4.38774e10
     -1.35314e11   1.43088e11  -2.17464e11  -8.70376e10   2.51535e11      2.36837e10   3.76675e10   3.57591e10   4.93057e10   1.72898e10
      2.95696e10   5.86712e10  -7.47343e10   2.73326e10   3.03537e10     -2.52465e9   -1.2501e9    -4.13464e7   -2.64329e9    3.99575e7
     -7.70318e10  -3.93945e10   4.41432e10  -8.97948e10   4.01424e10      1.12653e10   1.37315e10   1.20607e10   1.87392e10   7.00317e9

    Excluding the SkylakeX kernel (e.g. reverting to 544b069e85254d41699afde16e2e81c123cb5f28) gives the result:

    100×1000 Array{Float64,2}:
         112.527      -6192.3         1.06925e5   -8.28373e5    3.21394e6   -6.01292e6   …  -2.99287e5   -3.02032e5   -3.04795e5   -3.07576e5
       -6305.8            4.64899e5  -9.07773e6    7.54681e7   -3.07426e8    5.99356e8       3.28027e7    3.31027e7    3.34047e7    3.37085e7
           1.1309e5      -9.42656e6   1.9735e8    -1.71896e9    7.25526e9   -1.46068e10     -8.71604e8   -8.79551e8   -8.8755e8    -8.95596e8
          -9.32272e5      8.33527e7  -1.82785e9    1.64741e10  -7.1497e10    1.47819e11      9.57181e9    9.65882e9    9.74639e9    9.83447e9
           3.98657e6     -3.73868e8   8.48896e9   -7.86389e10   3.49436e11  -7.39605e11     -5.19571e10  -5.24279e10  -5.29016e10  -5.33781e10
          -8.8007e6       8.57783e8  -2.00715e10   1.90643e11  -8.66324e11   1.8764e12   …   1.44167e11   1.45468e11   1.46778e11   1.48094e11
           7.90418e6     -8.06081e8   1.95704e10  -1.91875e11   8.97794e11  -2.00535e12     -1.75621e11  -1.77197e11  -1.78783e11  -1.80377e11
           2.40961e6     -2.157e8     4.61896e9   -3.98835e10   1.62513e11  -3.0448e11      -1.76662e9   -1.79326e9   -1.82037e9   -1.84804e9
          -5.54279e6      5.75778e8  -1.41936e10   1.40992e11  -6.67618e11   1.50955e12      1.37796e11   1.39031e11   1.40274e11   1.41523e11
          -4.00904e6      3.9166e8   -9.13369e9    8.60798e10  -3.86441e11   8.21383e11      5.04267e10   5.08898e10   5.13561e10   5.18251e10
           1.63339e6     -1.8959e8    5.10176e9   -5.45604e10   2.76223e11  -6.69193e11  …  -8.18735e10  -8.25983e10  -8.33273e10  -8.40596e10
           4.57405e6     -4.75446e8   1.17311e10  -1.16639e11   5.52734e11  -1.25049e12     -1.12594e11  -1.13605e11  -1.14622e11  -1.15644e11
           3.29825e6     -3.27272e8   7.73331e9   -7.3721e10    3.34416e11  -7.18287e11     -4.43866e10  -4.47954e10  -4.52068e10  -4.56208e10
      -37289.1            2.26601e7  -9.77279e8    1.3636e10   -8.32615e10   2.3651e11       4.70926e10   4.75026e10   4.79148e10   4.83286e10
          -2.81123e6      3.03789e8  -7.75788e9    7.96192e10  -3.89208e11   9.11377e11      9.80715e10   9.89445e10   9.98226e10   1.00705e11
          -3.65969e6      3.80616e8  -9.39927e9    9.35385e10  -4.43642e11   1.00441e12  …   8.90641e10   8.98655e10   9.06717e10   9.14823e10
           ⋮                                                                 ⋮           ⋱
          -5.52909e5      6.12481e7  -1.61047e9    1.70918e10  -8.69055e10   2.14183e11      3.86899e10   3.90212e10   3.93541e10   3.96884e10
     -992608.0            1.08145e8  -2.79998e9    2.92745e10  -1.46579e11   3.54876e11  …   5.63476e10   5.68342e10   5.73232e10   5.78142e10
          -1.35946e6      1.47079e8  -3.78272e9    3.92898e10  -1.95373e11   4.69163e11      6.97768e10   7.03821e10   7.09905e10   7.16015e10
          -1.59876e6      1.72258e8  -4.41282e9    4.56535e10  -2.26067e11   5.40144e11      7.69398e10   7.76094e10   7.82824e10   7.89583e10
          -1.68423e6      1.80867e8  -4.61853e9    4.76281e10  -2.35036e11   5.59247e11      7.6759e10    7.74288e10   7.81023e10   7.87786e10
          -1.5861e6       1.69768e8  -4.32122e9    4.44178e10  -2.18431e11   5.17537e11      6.82601e10   6.88577e10   6.94585e10   7.0062e10
          -1.28093e6      1.36538e8  -3.46127e9    3.54304e10  -1.73446e11   4.08663e11  …   5.09162e10   5.13641e10   5.18145e10   5.2267e10
          -7.85551e5      8.29992e7  -2.08563e9    2.1155e10   -1.02525e11   2.38529e11      2.56914e10   2.59206e10   2.61511e10   2.63827e10
          -1.22319e5      1.1641e7   -2.59973e8    2.29043e9   -9.22435e9    1.59028e10     -5.87688e9   -5.92236e9   -5.96796e9   -6.01363e9
           6.34345e5     -6.95395e7   1.8113e9    -1.90533e10   9.60279e10  -2.3435e11      -4.02598e10  -4.06052e10  -4.09522e10  -4.13007e10
           1.37734e6     -1.49026e8   3.83373e9   -3.98355e10   1.98204e11  -4.76404e11     -7.24162e10  -7.30427e10  -7.36725e10  -7.43049e10
           1.94231e6     -2.09213e8   5.35875e9   -5.54398e10   2.74569e11  -6.56287e11  …  -9.49885e10  -9.58134e10  -9.66426e10  -9.74753e10
           2.11244e6     -2.26877e8   5.79482e9   -5.97807e10   2.95167e11  -7.02918e11     -9.83543e10  -9.92107e10  -1.00072e11  -1.00936e11
           1.59804e6     -1.71043e8   4.3541e9    -4.47652e10   2.20217e11  -5.22078e11     -6.99398e10  -7.0551e10   -7.11654e10  -7.17825e10
       23549.4           -1.60268e6   1.77704e7    5.98452e7   -1.59272e9    7.57578e9       6.25708e9    6.3078e9     6.35871e9    6.40975e9
          -3.07648e6      3.31117e8  -8.47524e9    8.76252e10  -4.33698e11   1.03595e12      1.49876e11   1.51177e11   1.52485e11   1.53798e11

    Note that the pinv() definition is using SVD internally, so this is turning into an LAPACK.gesdd() call, which is itself giving very different answers, so this should be easy to reproduce locally by passing a Hilbert matrix of the above dimensions in through whichever interface you wish to dgesdd.

    opened by staticfloat 85
  • Errors when using OpenBLAS through NumPy on Windows 10 2004

    Errors when using OpenBLAS through NumPy on Windows 10 2004

    This issue only appears on Windows 10 2004 (19041). It does not appear on Windows 10 1909 (18363).

    On a fresh install of NumPy from pip on a 2004 machine, e.g,

    pip install numpy ipython

    open ipython and enter

    import numpy as np
    a = np.arange(13 * 13, dtype=np.float64)
    a.shape = (13, 13)
    a = a % 17
    va, ve = np.linalg.eig(a)

    This produces an error:

     ** On entry to DGEBAL parameter number  3 had an illegal value
     ** On entry to DGEHRD  parameter number  2 had an illegal value
     ** On entry to DORGHR DORGQR parameter number  2 had an illegal value
     ** On entry to DHSEQR parameter number  4 had an illegal value
    LinAlgError                               Traceback (most recent call last)
    <ipython-input-1-bad305f0dfc7> in <module>
          3 a.shape = (13, 13)
          4 a = a % 17
    ----> 5 va, ve = np.linalg.eig(a)
    <__array_function__ internals> in eig(*args, **kwargs)
    c:\anaconda\envs\py-pip\lib\site-packages\numpy\linalg\ in eig(a)
       1322         _raise_linalgerror_eigenvalues_nonconvergence)
       1323     signature = 'D->DD' if isComplexType(t) else 'd->DD'
    -> 1324     w, vt = _umath_linalg.eig(a, signature=signature, extobj=extobj)
       1326     if not isComplexType(t) and all(w.imag == 0.0):
    c:\anaconda\envs\py-pip\lib\site-packages\numpy\linalg\ in _raise_linalgerror_eigenvalues_nonconvergence(err, flag)
         93 def _raise_linalgerror_eigenvalues_nonconvergence(err, flag):
    ---> 94     raise LinAlgError("Eigenvalues did not converge")
         96 def _raise_linalgerror_svd_nonconvergence(err, flag):
    LinAlgError: Eigenvalues did not converge

    What makes me suspect that it is OpenBLAS related is:

    1. It does not happen when using MKL.
    2. It does not happen when using LAPACK-lite that ships with NumPy and is used as a fallback.

    On the other hand, it may likely be a Windows bug since:

    1. It does not occur before 2004.
    2. It does not occur when using 32-bit Python on the same system.
    3. It does not occur when running Pythin in WSL on the same machine whether using WSL1, which runs Linux binaries directly on Windows, or WSL2, which runs them in a hypervisor.

    Any help is appreciated.

    The related NumPy issue is numpy/numpy#16744.

    Bug in other software 
    opened by bashtage 81
  • Patches for OpenBLAS 0.3.12 to support NVIDIA HPC SDK Compilers release 20.11 and later

    Patches for OpenBLAS 0.3.12 to support NVIDIA HPC SDK Compilers release 20.11 and later

    Good news! I have been working with our developers to clean up a number of the residual issues involved with compiling OpenBLAS with the NVIDIA HPC SDK Compilers, and as of the 20.11 release (due out shortly) - there are only a small number of changes needed against the OpenBLAS 0.3.12 source tree to allow it to compile it out of the box with our compilers.

    Summary of these changes:

    1. -Mnollvm is no longer needed to compile OpenBLAS with our compilers. In fact, the old NoLLVM compilers do not even exist anymore. This flag no longer has any effect and should be removed from the OpenBLAS build with modern releases of our compilers.

    2. The -D__MMX__ flag used to work around a previous issue is also no longer needed. That issue has been fixed.

    3. f_check needs a small tweak to correctly detect nvfortran as a PGI variant, and thus switch in the appropriate calling conventions for complex numbers.

    4. In order to compile kernels that use _mm256 and _mm512 intrinsics with our compilers, you have to pass the -mavx512f flag. Otherwise, these intrinsics are not enabled in our compilers by default. I have modified the SKYLAKEX and COOPERLAKE build rules in kernel/Makefile to pass this flag. I also tested this flag with GCC 10.2.0, and confirmed that it does not adversely affect compiling OpenBLAS with GCC.

    Note I built OpenBLAS 0.3.12 with these patches with our 20.11 compilers as follows:

    make CC=nvc FC=nvfortran HOSTCC=gcc DYNAMIC_ARCH=1 USE_OPENMP=1 NUM_THREADS=512 all

    Please give these patches a test with our 20.11 compilers once they are released, and if you agree they work well for you, consider integrating them into the OpenBLAS source tree ahead of the next OpenBLAS release. I think they will greatly help our customers who also use OpenBLAS. Cheers!


    opened by cparrott73 80
  • openblas build fails using Cray compiler cce

    openblas build fails using Cray compiler cce

    [email protected] build fails using Cray compiler on OLCF Crusher.

    I think this issue extends beyond Crusher, wherever Cray compilers are used to build openblas.

    Spack compiler definition for cce:

      - compiler:
          spec: [email protected]
            cc: /opt/cray/pe/craype/2.7.13/bin/cc
            cxx: /opt/cray/pe/craype/2.7.13/bin/CC
            f77: /opt/cray/pe/craype/2.7.13/bin/ftn
            fc: /opt/cray/pe/craype/2.7.13/bin/ftn
          flags: {}
          operating_system: sles15
          target: any
          - PrgEnv-cray/8.3.3
          - cce/14.0.0
          - craype-x86-trento
          - libfabric
          - cray-pmi/6.1.2

    Build error:

    $> spack install openblas %cce@14
    ==> Installing openblas-0.3.20-wcx5q4myxg5skj63wzuf24bn2rhyfac4
    ==> No binary for openblas-0.3.20-wcx5q4myxg5skj63wzuf24bn2rhyfac4 found: installing from source
    ==> Fetching
    ==> No patches needed for openblas
    ==> openblas: Executing phase: 'edit'
    ==> openblas: Executing phase: 'build'
    ==> Error: ProcessError: Command exited with status 2:
        'make' '-j16' 'CC=/gpfs/alpine/csc439/world-shared/E4S/ParaTools/22.05/PrgEnv-cray/spack/lib/spack/env/cce/cc' 'FC=/gpfs/alpine/csc439/world-shared/E4S/ParaTools/22.05/PrgEnv-cray/spack/lib/spack/env/cce/ftn' 'MAKE_NB_JOBS=0' 'ARCH=x86_64' 'TARGET=ZEN' 'USE_LOCKING=1' 'USE_OPENMP=1' 'USE_THREAD=1' 'RANLIB=ranlib' 'libs' 'netlib' 'shared'
    18 errors found in build log:
         2879    ftn-2103 ftn: WARNING in command line
         2880      The -W all option is not supported or invalid and will be ignored.
         2881    ftn-2307 ftn: ERROR in command line
         2882      The "-m" option must be followed by 0, 1, 2, 3 or 4.
         2883    ftn-2307 ftn: ERROR in command line
         2884      The "-m" option must be followed by 0, 1, 2, 3 or 4.
      >> 2885    make[2]: *** [<builtin>: spotrf2.o] Error 1
         2886    make[2]: *** Waiting for unfinished jobs....
      >> 2887    make[2]: *** [<builtin>: sgecon.o] Error 1
      >> 2888    make[2]: *** [<builtin>: sgebak.o] Error 1
      >> 2889    make[2]: *** [<builtin>: sgbsvx.o] Error 1
      >> 2890    make[2]: *** [<builtin>: sgetrf2.o] Error 1
      >> 2891    make[2]: *** [<builtin>: sgbbrd.o] Error 1
      >> 2892    make[2]: *** [<builtin>: sgebal.o] Error 1
      >> 2893    make[2]: *** [<builtin>: sgbcon.o] Error 1
      >> 2894    make[2]: *** [<builtin>: sgbtrs.o] Error 1
      >> 2895    make[2]: *** [<builtin>: sgbrfs.o] Error 1
      >> 2896    make[2]: *** [<builtin>: sgebrd.o] Error 1
      >> 2897    make[2]: *** [<builtin>: sgebd2.o] Error 1
      >> 2898    make[2]: *** [<builtin>: sgbtrf.o] Error 1
      >> 2899    make[2]: *** [<builtin>: sgbequ.o] Error 1
      >> 2900    make[2]: *** [<builtin>: sgbtf2.o] Error 1
      >> 2901    make[2]: *** [<builtin>: sgbsv.o] Error 1
         2902    make[2]: Leaving directory '/tmp/eugenewalker/spack-stage/spack-stage-openblas-0.3.20-wcx5q4myxg5skj63wzuf24bn2rhyfac4/spack-src/lapack-netlib/SRC'
      >> 2903    make[1]: *** [Makefile:27: lapacklib] Error 2
         2904    make[1]: Leaving directory '/tmp/eugenewalker/spack-stage/spack-stage-openblas-0.3.20-wcx5q4myxg5skj63wzuf24bn2rhyfac4/spack-src/lapack-netlib'
      >> 2905    make: *** [Makefile:250: netlib] Error 2


    opened by eugeneswalker 79
  • make error: vfork: resource temporarily unavailable

    make error: vfork: resource temporarily unavailable


    I'm trying to install OpenBLAS 0.2.20 on a node in a local directory over which I have permissions (I don't have root access). I seem to be encountering an error when attempting the build process. Trying make with or without any flags is resulting in a long string of errors that look like: make[1]: vfork: Resource temporarily unavailable

    I've posted the whole output in a text file. Could anyone give any suggestions on how to troubleshoot the problem? make_error.txt

    Many thanks. Tim

    EDIT: In case this is useful, here is also the output of a few server parameters.

    uname - or GNU/Linux

    lsb_release -a LSB Version: core-2.0-noarch:core-3.2-noarch:core-4.0-noarch:core-2.0-x86_64:core-3.2-x86_64:core-4.0-x86_64:desktop-4.0-amd64:desktop-4.0-noarch:graphics-2.0-amd64:graphics-2.0-noarch:graphics-3.2-amd64:graphics-3.2-noarch:graphics-4.0-amd64:graphics-4.0-noarch Distributor ID: SUSE LINUX Description: SUSE Linux Enterprise Server 11 (x86_64) Release: 11 Codename: n/a

    cat /etc/*-release LSB_VERSION="core-2.0-noarch:core-3.2-noarch:core-4.0-noarch:core-2.0-x86_64:core-3.2-x86_64:core-4.0-x86_64" SGI Accelerate 1.3, Build 705r10.sles11-1110192111 SGI Foundation Software 2.5, Build 705r10.sles11-1110192111 SGI MPI 1.3, Build 705r10.sles11-1110192111 SGI Performance Suite 1.3, Build 705r10.sles11-1110192111 SGI UPC 1.3, Build 705r10.sles11-1110192111 SUSE Linux Enterprise Server 11 (x86_64) VERSION = 11 PATCHLEVEL = 1

    lscpu: Architecture: x86_64 CPU(s): 64 Thread(s) per core: 1 Core(s) per socket: 8 CPU socket(s): 8 NUMA node(s): 8 Vendor ID: GenuineIntel CPU family: 6 Model: 46 Stepping: 6 CPU MHz: 2266.424 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 24576K

    opened by timjim333 78
  • Build OpenBLAS 0.3.6 for iOS

    Build OpenBLAS 0.3.6 for iOS

    Hello! I'm trying to build Open BLAS 0.3.5 version for iOS. Here's my shell script I'm using for building:

    make TARGET=ARMV8 BINARY=64 HOSTCC=clang CC="$TOOLCHAIN_PATH/clang -isysroot $SYSROOT_PATH -arch arm64 -miphoneos-version-min=10.0 -O2" NOFORTRAN=1 libs

    When I execute it, I'm happen to have this error:

    clang: error: unknown argument: '-ru'
    clang: error: no such file or directory: '../libopenblas_armv8p-r0.3.6.a'
    clang: warning: no such sysroot directory: '/Applications/' [-Wmissing-sysroot]

    How can this be fixed? Thanks!

    opened by L1onKing 71
  • Optimizing for POWER8 on big-endian

    Optimizing for POWER8 on big-endian

    We're preparing an update of OpenBLAS port in FreeBSD (to 0.3.7) and I'm doing tests on POWER. Previously, we optimized for PPC970 with an option to optimize for POWER6. Now I also tested newer CPU's (they didn't work before). POWER9 still fails, but it looks like OpenBLAS built for POWER8 passes all tests. This is all done on big-endian.

    Can I conclude that it's safe to optimize for POWER8 on big-endian variant?

    opened by pkubaj 70
  • amax:samax utest failure

    amax:samax utest failure

    System configuration:

    • Compile host is ODROID-C2, but I'm running the compile inside of a docker image that was created from a NI RoboRIO disk image (ARMv7, Cortex A9). I'm also using setarch linux32.
    • Compilers are gcc 6.3.0, gfortran 6.3.0
    • OpenBLAS 0.3.4 (but the error occurs on 0.2.20 also)

    Compile was via:

    make TARGET=CORTEXA9 PREFIX=/usr/local

    Here's the output:

    TEST 1/23 amax:samax [FAIL]
      ERR: test_amax.c:44  expected 3.300e+00, got 4.204e-45 (diff 3.300e+00, tol 1.000e-04)
    RESULTS: 23 tests (22 ok, 1 failed, 0 skipped) ran in 1860 ms

    What steps should I take next to diagnose this issue? Thanks!

    opened by virtuald 68
  • Wrong results for small DTRMV on x86-64

    Wrong results for small DTRMV on x86-64

    Usually I use the POWER8 platform but today I found a bad error in the x86-64 code for the DTRMV routine.

    I want to compute x <- A*x, where

    • A is nonunit upper triangular,
    • N = 1...64,
    • LDA = 64,
    • and INCX=1

    and get wrong results on a Haswell (16 cores, Xeon CPU E5-2640 v3) based system using gcc (Ubuntu 5.4.0-6ubuntu1~16.04.5) if OpenBLAS is compiled via

    make USE_OPENMP=1 

    I compared the results with the Netlib implementation and obtained:

    Correct RESULT: DTRMV(U,N,N,    1, A,   64, X,     1)  MAXERR = .000000000000000D+00
    Correct RESULT: DTRMV(U,N,N,    2, A,   64, X,     1)  MAXERR = .000000000000000D+00
    Correct RESULT: DTRMV(U,N,N,    3, A,   64, X,     1)  MAXERR = .000000000000000D+00
    Correct RESULT: DTRMV(U,N,N,    4, A,   64, X,     1)  MAXERR = .000000000000000D+00
    Correct RESULT: DTRMV(U,N,N,    5, A,   64, X,     1)  MAXERR = .000000000000000D+00
    Correct RESULT: DTRMV(U,N,N,    6, A,   64, X,     1)  MAXERR = .000000000000000D+00
    Correct RESULT: DTRMV(U,N,N,    7, A,   64, X,     1)  MAXERR = .000000000000000D+00
    Correct RESULT: DTRMV(U,N,N,    8, A,   64, X,     1)  MAXERR = .000000000000000D+00
    Correct RESULT: DTRMV(U,N,N,    9, A,   64, X,     1)  MAXERR = .000000000000000D+00
    Correct RESULT: DTRMV(U,N,N,   10, A,   64, X,     1)  MAXERR = .000000000000000D+00
    Correct RESULT: DTRMV(U,N,N,   11, A,   64, X,     1)  MAXERR = .000000000000000D+00
    Correct RESULT: DTRMV(U,N,N,   12, A,   64, X,     1)  MAXERR = .000000000000000D+00
    Correct RESULT: DTRMV(U,N,N,   13, A,   64, X,     1)  MAXERR = .000000000000000D+00
    Correct RESULT: DTRMV(U,N,N,   14, A,   64, X,     1)  MAXERR = .000000000000000D+00
    Correct RESULT: DTRMV(U,N,N,   15, A,   64, X,     1)  MAXERR = .000000000000000D+00
    Correct RESULT: DTRMV(U,N,N,   16, A,   64, X,     1)  MAXERR = .000000000000000D+00
    Wrong RESULT  : DTRMV(U,N,N,   17, A,   64, X,     1)  MAXERR = .568434188608080D-13
    Wrong RESULT  : DTRMV(U,N,N,   18, A,   64, X,     1)  MAXERR = .113686837721616D-12
    Wrong RESULT  : DTRMV(U,N,N,   19, A,   64, X,     1)  MAXERR = .113686837721616D-12
    Wrong RESULT  : DTRMV(U,N,N,   20, A,   64, X,     1)  MAXERR = .227373675443232D-12
    Wrong RESULT  : DTRMV(U,N,N,   21, A,   64, X,     1)  MAXERR = .227373675443232D-12
    Wrong RESULT  : DTRMV(U,N,N,   22, A,   64, X,     1)  MAXERR = .454747350886464D-12
    Wrong RESULT  : DTRMV(U,N,N,   23, A,   64, X,     1)  MAXERR = .909494701772928D-12
    Wrong RESULT  : DTRMV(U,N,N,   24, A,   64, X,     1)  MAXERR = .909494701772928D-12
    Wrong RESULT  : DTRMV(U,N,N,   25, A,   64, X,     1)  MAXERR = .181898940354586D-11
    Wrong RESULT  : DTRMV(U,N,N,   26, A,   64, X,     1)  MAXERR = .363797880709171D-11
    Wrong RESULT  : DTRMV(U,N,N,   27, A,   64, X,     1)  MAXERR = .727595761418343D-11
    Wrong RESULT  : DTRMV(U,N,N,   28, A,   64, X,     1)  MAXERR = .363797880709171D-11
    Wrong RESULT  : DTRMV(U,N,N,   29, A,   64, X,     1)  MAXERR = .291038304567337D-10
    Wrong RESULT  : DTRMV(U,N,N,   30, A,   64, X,     1)  MAXERR = .582076609134674D-10
    Wrong RESULT  : DTRMV(U,N,N,   31, A,   64, X,     1)  MAXERR = .145519152283669D-10
    Wrong RESULT  : DTRMV(U,N,N,   32, A,   64, X,     1)  MAXERR = .582076609134674D-10
    Wrong RESULT  : DTRMV(U,N,N,   33, A,   64, X,     1)  MAXERR = .596598620663450D+06
    Wrong RESULT  : DTRMV(U,N,N,   34, A,   64, X,     1)  MAXERR = .605368209769108D+06
    Wrong RESULT  : DTRMV(U,N,N,   35, A,   64, X,     1)  MAXERR = .114251427897636D+07
    Wrong RESULT  : DTRMV(U,N,N,   36, A,   64, X,     1)  MAXERR = .176011411015678D+07
    Wrong RESULT  : DTRMV(U,N,N,   37, A,   64, X,     1)  MAXERR = .273520848158376D+07
    Wrong RESULT  : DTRMV(U,N,N,   38, A,   64, X,     1)  MAXERR = .400535392194670D+07
    Wrong RESULT  : DTRMV(U,N,N,   39, A,   64, X,     1)  MAXERR = .583437738276019D+07
    Wrong RESULT  : DTRMV(U,N,N,   40, A,   64, X,     1)  MAXERR = .850833665131698D+07
    Wrong RESULT  : DTRMV(U,N,N,   41, A,   64, X,     1)  MAXERR = .123493561359394D+08
    Wrong RESULT  : DTRMV(U,N,N,   42, A,   64, X,     1)  MAXERR = .173391397036582D+08
    Wrong RESULT  : DTRMV(U,N,N,   43, A,   64, X,     1)  MAXERR = .260484477387416D+08
    Wrong RESULT  : DTRMV(U,N,N,   44, A,   64, X,     1)  MAXERR = .376086864422978D+08
    Wrong RESULT  : DTRMV(U,N,N,   45, A,   64, X,     1)  MAXERR = .562860046618746D+08
    Wrong RESULT  : DTRMV(U,N,N,   46, A,   64, X,     1)  MAXERR = .722429815166366D+08
    Wrong RESULT  : DTRMV(U,N,N,   47, A,   64, X,     1)  MAXERR = .129301313570969D+09
    Wrong RESULT  : DTRMV(U,N,N,   48, A,   64, X,     1)  MAXERR = .193137989170661D+09
    Wrong RESULT  : DTRMV(U,N,N,   49, A,   64, X,     1)  MAXERR = .279222204411076D+09
    Wrong RESULT  : DTRMV(U,N,N,   50, A,   64, X,     1)  MAXERR = .678662007793433D+09
    Wrong RESULT  : DTRMV(U,N,N,   51, A,   64, X,     1)  MAXERR = .121269930403095D+10
    Wrong RESULT  : DTRMV(U,N,N,   52, A,   64, X,     1)  MAXERR = .106155723183043D+10
    Wrong RESULT  : DTRMV(U,N,N,   53, A,   64, X,     1)  MAXERR = .306408230139373D+10
    Wrong RESULT  : DTRMV(U,N,N,   54, A,   64, X,     1)  MAXERR = .470622143023516D+10
    Wrong RESULT  : DTRMV(U,N,N,   55, A,   64, X,     1)  MAXERR = .716298126556327D+10
    Wrong RESULT  : DTRMV(U,N,N,   56, A,   64, X,     1)  MAXERR = .103315607770989D+11
    Wrong RESULT  : DTRMV(U,N,N,   57, A,   64, X,     1)  MAXERR = .837779499609675D+10
    Wrong RESULT  : DTRMV(U,N,N,   58, A,   64, X,     1)  MAXERR = .255092826195343D+11
    Wrong RESULT  : DTRMV(U,N,N,   59, A,   64, X,     1)  MAXERR = .388799116737581D+11
    Wrong RESULT  : DTRMV(U,N,N,   60, A,   64, X,     1)  MAXERR = .612787305842372D+11
    Wrong RESULT  : DTRMV(U,N,N,   61, A,   64, X,     1)  MAXERR = .421813369205623D+11
    Wrong RESULT  : DTRMV(U,N,N,   62, A,   64, X,     1)  MAXERR = .531064994597937D+11
    Wrong RESULT  : DTRMV(U,N,N,   63, A,   64, X,     1)  MAXERR = .218143907859646D+12
    Wrong RESULT  : DTRMV(U,N,N,   64, A,   64, X,     1)  MAXERR = .160600130355114D+12

    The demonstration code is here:

    opened by grisuthedragon 68
  • Performance variability

    Performance variability

    I am observing a lot of performance variability for matrix multiplication in sizes ranging from ~100 to ~1000 I have been investigating this without a lot of success. The timing can be up to twice as large, depending on the order in which I run the benchmarks. However, for any chosen order, the accuracy of the timing is high.

    I am a bit at a loss here. Because the order only influences the position in memory of the input data to the benchmarks, I am inclined to think that this may be a memory alignment issue. The position of the matrices in the heap is rounded only to 16 bytes (double floats, alignment imposed by C++), but not to any other size and perhaps there is some kind of SIMD code making some kind of ugly magic that disturbs my benchmarks?

    I have tested this with serial, openmp and pthreads versions, on Linux and on Windows, with similar outcomes. This is an AMD Ryzen processor, but I have witnessed even greater variability on Intel.

    opened by juanjosegarciaripoll 1
  • Don't gate reading of Makefile.conf

    Don't gate reading of Makefile.conf

    This was gated behind GOTOBLAS_MAKEFILE, which looks like it's intended to stop the much more expensive run of Makefile.prebuild rather than prevent reading the configuration. Moving this endif means that things such as dynamic_<X>.c correctly get the architecture flags.

    You can then end up in a situation where the HAVE_x/NO_x flags are used for compiling param.h but not Makefile.conf which leads to further inconsistencies. This uses the existing logic to parse ARCHCONFIG and ensures it ends up in the eventual Makefile.conf whenever FORCE is enabled.

    opened by Mousius 0
  • Large matrix mul can't be run on Xuantie C906

    Large matrix mul can't be run on Xuantie C906

    Matrix A is mm, matrix B is also mm, I use this function ''cblas_dgemm()''to calculate A*B on Xuantie C906, a RISC-V CPU. However, when m>=8,it reports error "Illegal instruction"; When m<8,it works and gives a correct answer. But when I run it on my laptop, even m==3200,it still runs well.

    opened by yamato720 2
  • Use SVE kernel for SGEMM/DGEMM on Arm(R) Neoverse(TM) V1

    Use SVE kernel for SGEMM/DGEMM on Arm(R) Neoverse(TM) V1

    After #3868, the SVE kernels represent a pretty good boost.

    This re-uses ARMV8SVE as a base and I'm going to incrementally move everything to use ARMV8SVE in additional patches (as well as fix up anything that's not already in ARMV8SVE).

    opened by Mousius 0
  • ZSYTRF yields wrong result when OpenBLAS is built using CMake

    ZSYTRF yields wrong result when OpenBLAS is built using CMake


    For version 0.3.16 - 0.3.21, if I build OpenBLAS with cmake -DCMAKE_BUILD_TYPE=Release -DBUILD_SHARED_LIBS=ON, ZSYTRF can yield wrong result.

    But the result is correct for version 0.3.15, or if I simply build OpenBLAS with make.

    How to Reproduce

    It can be tested with the following code

    #include <complex>
    #include <iostream>
    #include <lapacke.h>
    int main()
        std::complex<double> A[9]{
            {2, 2}, {0, 0}, {1, 1},
            {0, 0}, {2, 2}, {0, 0},
            {1, 1}, {0, 0}, {2, 2}
        lapack_int ipiv[3]{ 0 };
        auto info = LAPACKE_zsytrf(LAPACK_COL_MAJOR, 'L', 3, (lapack_complex_double*)A, 3, ipiv);
        std::cout << "info = " << info << '\n';
        std::cout << "ipiv = " << ipiv[0] << ", " << ipiv[1] << ", " << ipiv[2] << '\n';
        std::cout << "A =\n";
        std::cout << A[0] << '\n';
        std::cout << A[1] << ", " << A[4] << '\n';
        std::cout << A[2] << ", " << A[5] << ", " << A[8] << '\n';
        return 0;

    It prints

    info = 0
    ipiv = 1, 2, 3
    A =
    (0,0), (2,2)
    (0.5,0), (0,0), (2,2)

    Whereas the correct A after factorization should be

    (0,0), (2,2)
    (0.5,0), (0,0), (1.5,1.5)

    It's not the only error case, I just picked a simple one for demonstration. And I can reproduce it on various platforms, including Linux+GCC, Windows+MinGW64, Intel 8700, Intel E5 2620 v3, AMD Ryzen 5900HX

    My Question

    Is my way of build OpenBLAS using CMake wrong, or there are some bugs in the project's CMake scripts?

    opened by A205B064 13
  • SWITCH_RATIO for Arm(R) Neoverse(TM) architecture

    SWITCH_RATIO for Arm(R) Neoverse(TM) architecture

    This seems like a good balance of values for reasonably sized matrices. With SWITCH_RATIO=16 the DGEMM scales better to bigger sizes but the better solution would be some kind of thread throttling so I've gone with SWITCH_RATIO=8.

    opened by Mousius 1
  • v0.3.21(Aug 7, 2022)


    • updated the included LAPACK to Reference-LAPACK 3.10.1
    • when no Fortran compiler is available, OpenBLAS builds will now automatically build LAPACK from an f2c-converted copy of LAPACK 3.9.0 unless the NO_LAPACK option is specified (more recent releases make too heavy use of Fortran90+ features to be easily convertible to C)
    • similarly added C versions of the BLAS and CBLAS tests
    • enabled building of the ReLAPACK GEMMT kernels when ReLAPACK is built
    • function LAPACKE_lsame is now annotated with the GCC attribute "const" to aid static analyzers
    • added USE_TLS to the list of options reported by the openblas_get_config() function
    • added openblas_getaffinity() as a Linux-only convenience function wrapping pthread_getaffinity_np()
    • CMAKE builds now support the BUILD_TESTING keyword (to disable the LAPACK testsuite) of Reference-LAPACK
    • fixed CMAKE builds of the laswp_ncopy and neg_tcopy kernels
    • removed the build system requirements for PERL (while keeping the original perl scripts as backup)
    • handle building and running OpenBLAS on systems that report zero available cpu cores
    • added SYMBOLPREFIX/SYMBOLSUFFIX handling for LAPACK 3.10.0 functions added in 0.3.20
    • fixed linking of the utests on QNX
    • Added support for compilation with the Intel ifx compiler
    • Added support for compilation with the Fujitsu FCC compiler for Fugaku
    • Added support for compilation with the Cray C and Fortran compilers
    • reverted OpenMP threadpool behaviour in the exec_blas call to its state before 0.3.11, that is the threadpool will no longer grow or shrink on demand as the overhead for this is too big at least with GNU OpenMP. The adaptive behaviour introduced in 0.3.11 can still be requested at runtime by setting the environment variable OMP_ADAPTIVE
    • worked around spurious STFSM/CTFSM errors reported by the LAPACK testsuite


    • fixed determination of compiler support for AVX512 and removed the 0.3.19 workaround for building SKYLAKEX kernels on Sandybridge hardware
    • fixed compilation for the SKYLAKEX target with gcc 6
    • fixed compilation of the CooperLake SBGEMM kernel with LLVM
    • fixed compilation of the SkyLakeX small matrix GEMM kernels with LLVM or ICC
    • fixed compilation of some BFLOAT16 kernels with CMAKE
    • added support for the Zhaoxin/Centaur KH40000 cpu
    • fixed a potential crash in the ZSYMV kernel used for all targets except generic
    • fixed gmake compilation for DYNAMIC_ARCH with a DYNAMIC_LIST including ATOM
    • fixed compilation of LAPACKE with the INTEGER64 option on Windows
    • added support for cross-compiling to individual Intel or AMD targets using CMAKE (previously only CORE2 supported, added targets are ATOM, PRESCOTT, NEHALEM, SANDYBRIDGE, HASWELL,SKYLAKEX, COOPERLAKE, SAPPHIRERAPIDS, OPTERON, BARCELONA, BULLDOZER, PILEDRIVER, STEAMROLLER,EXCAVATOR, ZEN)


    • worked around an overflow error in the DNRM2 kernel


    • worked around an overflow error in the POWER6 DNRM2 kernel
    • fixed compilation on PPC440
    • fixed a performance regression in the level1 BLAS on POWER10
    • fixed the POWER10 ZGEMM kernel
    • fixed singlethreaded builds for POWER10
    • fixed compilation of the POWER10 DGEMV kernel with older gcc versions
    • enabled compilation of the BFLOAT16 kernels by default
    • enabled the small matrix kernels by default for DYNAMIC_ARCH builds
    • added a workaround for a miscompilation of the CDOT and ZDOT kernels by GCC 12


    • fixed cpu autodetection logic


    • added an SBGEMM kernel for Neoverse N2
    • worked around an overflow error in the DNRM2 kernel used on M1, NeoverseN1, ThunderX2T99
    • added support for ARM64 systems running MS Windows
    • added support for cross-compiling to the GENERIC ARMV8 target under CMAKE (Windows/MSVC)
    • fixed a performance regression in the generic ARMV8 DGEMM kernel introduced in 0.3.19
    • added initial support for the Apple M1 cpu under Linux
    • added initial support for the Phytium FT2000 cpu
    • added initial support for the Cortex A510, A710, X1 and X2 cpu
    • fixed an accidental mixup of cpu identifiers in the autodetection code introduced in 0.3.20
    • fixed linking of Apple M1 builds on macOS 12 and later with recent XCode
    • made NeoverseN2 available in DYNAMIC_ARCH builds


    • worked around an overflow error in the DNRM2 kernel


    • worked around an overflow error in the DNRM2 kernel
    • added preliminary support for the LOONGSON2K1000 cpu
    • added DYNAMIC_ARCH support

    md5sum ffb6120e2309a2280471716301824805 OpenBLAS-0.3.21.tar.gz 4f013627138be6ecbd2c8d1435f2ec40 c605e9e4ef227605ebcafa6466f14e25 16e2cc782e893df47fef97be09896ae1

    Download OpenBLAS

    Source code(tar.gz)
    Source code(zip) MB) MB)
    OpenBLAS-0.3.21.tar.gz(22.63 MB) MB)
  • v0.3.20(Feb 20, 2022)


    • some code cleanup, with added casts etc.
    • fixed obtaining the cpu count with OpenMP and OMP_PROC_BIND unset
    • fixed pivot index calculation by ?LASWP for negative increments other than one
    • fixed input argument check in LAPACK ? GEQRT2
    • improved the check for a Fortran compiler in CMAKE builds
    • disabled building OpenBLAS' optimized versions of LAPACK complex SPMV,SPR,SYMV,SYR with NO_LAPACK=1
    • fixed building of LAPACK on certain distributed filesystems with parallel gmake
    • fixed building the shared library on MacOS with classic flang


    • fixed cross-compilation with CMAKE for CORE2 target
    • fixed miscompilation of AVX512 code in DYNAMIC_ARCH builds
    • added support for the "incidental" AVX512 hardware in Alder Lake when enabled in BIOS


    • add new architecture (Russian Elbrus E2000 family)


    • fix IMIN/IMAX


    • added SVE-enabled CGEMM and ZGEMM kernels for ARMV8SVE and A64FX
    • added support for Neoverse N2 and V1 cpus


    • fixed autodetection of MSA capability


    • added an optimized DGEMM kernel

    abfaa43d995046ca4c56ccf14165c93c OpenBLAS-0.3.20.tar.gz 33526b15e15971edb657edc15de0c67f 3d9daef71592665261c032888bd810d6 5bfe847082510e44cdc59755cd49b941

    Download OpenBLAS

    Source code(tar.gz)
    Source code(zip) MB) MB)
    OpenBLAS-0.3.20.tar.gz(12.15 MB) MB)
  • v0.3.19(Dec 19, 2021)


    • reverted unsafe TRSV/ZRSV optimizations introduced in 0.3.16
    • fixed a potential thread race in the thread buffer reallocation routines that were introduced in 0.3.18
    • fixed miscounting of thread pool size on Linux with OMP_PROC_BIND=TRUE
    • fixed CBLAS interfaces for CSROT/ZSROT and CROTG/ZROTG
    • made automatic library suffix for CMAKE builds with INTERFACE64 available to CBLAS-only builds


    • DYNAMIC_ARCH builds now fall back to the cpu with most similar capabilities when an unknown CPUID is encountered, instead of defaulting to Prescott
    • added cpu detection for Intel Alder Lake
    • added cpu detection for Intel Sapphire Rapids
    • added an optimized SBGEMM kernel for Sapphire Rapids
    • fixed DYNAMIC_ARCH builds on OSX with CMAKE
    • worked around DYNAMIC_ARCH builds made on Sandybridge failing on SkylakeX
    • fixed missing thread initialization for static builds on Windows/MSVC
    • fixed an excessive read in ZSYMV


    • added support for POWER10 in big-endian mode
    • added support for building with CMAKE
    • added optimized SGEMM and DGEMM kernels for small matrix sizes


    • added basic support and cputype detection for Fujitsu A64FX
    • added a generic ARMV8SVE target
    • added SVE-enabled SGEMM and DGEMM kernels for ARMV8SVE and A64FX
    • added optimized CGEMM and ZGEMM kernels for Cortex A53 and A55 cpus
    • fixed cpuid detection for Apple M1 and improved performance
    • improved compiler flag setting in CMAKE builds


    • fixed improper initialization in CSCAL/ZSCAL for strided access patterns


    • added a GENERIC target for MIPS32
    • added support for cross-compiling to MIPS32 on x86_64 using CMAKE


    • fixed misdetection of MSA capability

    9721d04d72a7d601c81eafb54520ba2c OpenBLAS-0.3.19.tar.gz bd74be5bafbc748266b4e9578bba955b 507a02d501944bd7586caeee4944d409 0cff635aeda36435813caeac391ca39e

    Download OpenBLAS

    Source code(tar.gz)
    Source code(zip) MB) MB)
    OpenBLAS-0.3.19.tar.gz(12.11 MB) MB)
  • v0.3.18(Oct 2, 2021)


    • when the build-time number of preconfigured threads is exceeded at runtime (by an external program calling BLAS functions from a larger number of threads), OpenBLAS will now allocate an auxiliary control structure for up to 512 additional threads instead of aborting
    • added support for Loongson's LoongArch64 cpu architecture
    • fixed building OpenBLAS with CMAKE and -DBUILD_BFLOAT16=ON
    • added support for building OpenBLAS as a CMAKE subproject
    • added support for building for Windows/ARM64 targets with clang
    • improved support for building with the IBM xlf compiler
    • imported Reference-LAPACK PR 625 (out-of-bounds access in ?LARRV)
    • imported Reference-LAPACK PR 597 for testsuite compatibility with LLVM's libomp


    • added SkylakeX S/DGEMM kernels for small problem sizes (MNK<=1000000)
    • added optimized SBGEMM for Intel Cooper Lake
    • reinstated the performance patch for AVX512 SGEMV_T with a proper fix
    • added a workaround for a gcc11 tree-vectorizer bug that caused spurious failures in the test programs for complex BLAS3 when compiling at -O3 (the default for cmake "release" builds)
    • added support for runtime cpu count detection under Haiku OS
    • worked around a long-standing miscompilation issue of the Haswell DGEMV_T kernel with gcc that could produce NaN output in some corner cases


    • improved performance of DASUM on POWER10


    • fixed crashes (use of reserved register x18) on Apple M1 under OSX
    • fixed building with gcc releases earlier than 5.1


    • fixed building under BSD


    • fixed building under BSD

    5cd5df5a1541ad414f5874aaae17730f OpenBLAS-0.3.18.tar.gz 0ebf2e1ddc491f37be26bea4e0d1239a b76692df00d0b655d4f14058f6c2e10f b421f7c47223c5f228c1fe1c66f3f0e1 Download OpenBLAS

    Source code(tar.gz)
    Source code(zip) MB) MB)
    OpenBLAS-0.3.18.tar.gz(12.03 MB) MB)
  • v0.3.17(Jul 15, 2021)


    • reverted the optimization of SGEMV_N/DGEMV_N for small input sizes and consecutive arguments as it led to stack overflows on x86_64 with some operating systems (notably OSX and Windows)


    • reverted the performance patch for SGEMV_T on AVX512 as it caused wrong results in some applications


    • fixed compilation with compilers other than gcc

    5429954163bcbaccaa13e11fe30ca5b6 OpenBLAS-0.3.17.tar.gz d7e52f5f9ed8e6fc5a5269634ecface3 1ee19f55bfd46120689cd260e16e7ce6 40f6c0ac1b33729cc94f7f9af177e3a6

    Download OpenBLAS

    Source code(tar.gz)
    Source code(zip) MB) MB)
    OpenBLAS-0.3.17.tar.gz(11.93 MB) MB)
  • v0.3.16(Jul 11, 2021)


    • drastically reduced the stack size requirements for running the LAPACK testsuite (Reference-LAPACK PR 553)
    • fixed spurious test failures in the LAPACK testsuite (Reference-LAPACK PR 564)
    • expressly setting DYNAMIC_ARCH=0 no longer enables dynamic_arch mode
    • improved performance of xGER, xSPR, xSPR2, xSYR, xSYR2, xTRSV, SGEMV_N and DGEMV_N, for small input sizes and consecutive arguments
    • improved performance of xGETRF, xPORTF and xPOTRI for small input sizes by disabling multithreading
    • fixed installing with BSD versions of the "install" utility


    • fixed the implementation of xIMIN
    • improved the performance of DSDOT
    • fixed linking of the tests on C910V with current vendor gcc


    • fixed SBGEMM computation for some odd value inputs
    • fixed compilation for PPCG4, PPC970, POWER3, POWER4 and POWER5


    • improved performance of SGEMV_N and SGEMV_T for small N on AVX512-capable cpus
    • worked around a miscompilation of ZGEMM/ZTRMM on Sandybridge with old gcc versions
    • fixed compilation with MS Visual Studio versions older than 2017
    • fixed macro name collision with winnt.h from the latest Win10 SDK
    • added cpu type autodetection for Intel Ice Lake SP
    • fixed cpu type autodetection for Intel Tiger Lake
    • added cpu type autodetection for recent Centaur/Zhaoxin models
    • fixed compilation with musl libc


    • fixed compilation with gcc/gfortran on the Apple M1
    • fixed linking of the tests on FreeBSD
    • fixed missing restore of a register in the recently rewritten DNRM2 kernel for ThunderX2 and Neoverse N1 that could cause spurious failures in e.g. DGEEV
    • added compiler optimization flags for the EMAG8180
    • added initial support for Cortex A55


    • fixed linking of the tests on FreeBSD

    md5sum: 78cc2d682cfcd64edf982173420c06c0 OpenBLAS-0.3.16.tar.gz 1a35c22920aca83828eb2478043847b9 2f4404a4da21b319447b3ce7fe351426 2eb01f0b2eb31c7938d7a3a536638aaa

    Download OpenBLAS

    Source code(tar.gz)
    Source code(zip) MB) MB)
    OpenBLAS-0.3.16.tar.gz(11.93 MB) MB)
  • v0.3.15(May 2, 2021)


    • imported improvements and bugfixes from Reference-LAPACK 3.9.1
    • imported LAPACKE interface fixes from Reference-LAPACK PRs 534 + 537
    • fixed a problem in the cpu detection of 0.3.14 that prevented cross-compilation
    • fixed a sequence problem in the generation of softlinks to the library in GMAKE

    RISC V:

    • fixed a potential division by zero in CROTG and ZROTG


    • fixed LAPACK testsuite failures seen with the NVIDIA HPC compiler
    • improved CGEMM, DGEMM and ZGEMM performance on POWER10
    • added an optimized ZGEMV kernel for POWER10
    • fixed a potential division by zero in CROTG and ZROTG


    • added support for Intel Control-flow Enforcement Technology (CET)
    • reverted the DOMATCOPY_RT code to the generic C version
    • fixed a bug in the AVX512 SGEMM kernel introduced in 0.3.14
    • fixed misapplication of -msse flag to non-SSE cpus in DYNAMIC_ARCH
    • added support for compilation of the benchmarks on older OSX versions
    • fixed propagation of the NO_AVX512 option in CMAKE builds
    • fixed compilation of the AVX512 SGEMM kernel with clang-cl on Windows
    • fixed compilation of the CTESTs with INTERFACE64=1 (random faults on OSX)
    • corrected the Haswell DROT kernel to require AVX2/FMA3 rather than AVX512


    • fixed a potential division by zero in CROTG and ZROTG
    • fixed a potential overflow in IMATCOPY/ZIMATCOPY and the CTESTs


    • fixed spurious reads outside the array in the SGEMM tcopy macro
    • fixed a potential division by zero in CROTG and ZROTG
    • fixed a segmentation fault in DYNAMIC_ARCH builds (reappeared in 0.3.14)


    • fixed a potential division by zero in CROTG and ZROTG
    • fixed a potential overflow in IMATCOPY/ZIMATCOPY and the CTESTs


    • fixed a potential division by zero in CROTG and ZROTG


    • fixed a potential division by zero in CROTG and ZROTG

    183dbd71895f2018d297be271cb31128 OpenBLAS-0.3.15.tar.gz 7bc1ea337884df348ddc87ac27a801d6 acffcbe7be0bb22d28320e39a4439d1f ef587a916b7b44c328ad553dbc630646

    Download OpenBLAS

    Source code(tar.gz)
    Source code(zip) MB) MB)
    OpenBLAS-0.3.15.tar.gz(11.91 MB) MB)
  • v0.3.14(Mar 17, 2021)

    NOTE: this has a (now) known regression in AVX512 SGEMM


    • Fixed a race condition on thread shutdown in non-OpenMP builds
    • Fixed custom BUFFERSIZE option getting ignored in gmake builds
    • Fixed CMAKE compilation of the TRMM kernels for GENERIC platforms
    • Added CBLAS interfaces for CROTG, ZROTG, CSROT and ZDROT
    • improved performance of OMATCOPY_RT across all platforms
    • Changed perl scripts to use env instead of a hardcoded /usr/bin/perl
    • Fixed potential misreading of the GCC compiler version in the build scripts
    • Fixed convergence problems in LAPACK complex GGEV/GGES (Reference-LAPACK #477)
    • Reduced the stacksize requirements for running the LAPACK testsuite (Reference-LAPACK #335)

    RISC V:

    • Fixed compilation on RISCV (missing entry in getarch)


    • Fixed compilation for DYNAMIC_ARCH with clang and with older gcc versions
    • Added support for compilation on FreeBSD/ppc64le
    • Added optimized POWER10 kernels for SSCAL, DSCAL, CSCAL, ZSCAL
    • Added optimized POWER10 kernels for SROT, DROT, CDOT, SASUM, DASUM
    • improved SSWAP, DSWAP, CSWAP, ZSWAP performance on POWER10
    • improved SCOPY and CCOPY performance on POWER10
    • improved SGEMM and DGEMM performance on POWER10
    • Added support for compilation with the NVIDIA HPC compiler


    • Added an optimized bfloat16 GEMM kernel for Cooperlake
    • Added CPUID autodetection for Intel Rocket Lake and Tiger Lake cpus
    • improved the performance of SASUM,DASUM,SROT,DROT on AMD Ryzen cpus
    • Added support for compilation with the NAG Fortran compiler
    • Fixed recognition of the AMD AOCC compiler
    • Fixed compilation for DYNAMIC_ARCH with clang on Windows
    • Added support for running the BLAS/CBLAS tests on Windows
    • Fixed signatures of the tls callback functions for Windows x64
    • Fixed various issues with fma intrinsics support handling


    • Support compilation for embedded Cortex M4 targets via a new option EMBEDDED


    • Fixed the THUNDERX2T99 and NEOVERSEN1 DNRM2/ZNRM2 kernels for inputs with Inf
    • Added support for the DYNAMIC_LIST option
    • Added support for compilation with the NVIDIA HPC compiler
    • Added support for compiling with the NAG Fortran compiler

    md5sum: a5aa1d61d4b27f471dc60c40c11e61fe OpenBLAS-0.3.14.tar.gz f8fe13f5ebf9c4c487784f4e6a7b1a56

    Download OpenBLAS

    Source code(tar.gz)
    Source code(zip)
    OpenBLAS-0.3.14.tar.gz(11.88 MB) MB)
  • v0.3.13(Dec 12, 2020)


    • Added a generic bfloat16 SBGEMV kernel
    • Fixed a potentially severe memory leak after fork in OpenMP builds that was introduced in 0.3.12
    • Added detection of the Fujitsu Fortran compiler
    • Added detection of the (e)gfortran compiler on OpenBSD
    • Added support for overriding the default name of the library independently from symbol suffixing in the gmake builds (already supported in cmake)

    RISC V:

    • Added a RISC V port optimized for C910V


    • Added optimized POWER10 kernels for SAXPY, CAXPY, SDOT, DDOT and DGEMV_N
    • Improved DGEMM performance on POWER10
    • Improved STRSM and DTRSM performance on POWER9 and POWER10
    • Fixed segmemtation faults in DYNAMIC_ARCH builds
    • Fixed compilation with the PGI compiler


    • Fixed compilation of kernels that require SSE2 intrinsics since 0.3.12


    • Added an optimized bfloat16 SBGEMV kernel for SkylakeX and Cooperlake
    • Improved the performance of SASUM and DASUM kernels through parallelization
    • Improved the performance of SROT and DROT kernels
    • Improved the performance of multithreaded xSYRK
    • Fixed OpenMP builds that use the LLVM Clang compiler together with GNU gfortran (where linking of both the LLVM libomp and GNU libgomp could lead to lockups or wrong results)
    • Fixed miscompilations by old gcc 4.6
    • Fixed misdetection of AVX2 capability in some Sandybridge cpus
    • Fixed lockups in builds combining DYNAMIC_ARCH with TARGET=GENERIC on OpenBSD


    • Fixed segmentation faults in DYNAMIC_ARCH builds


    • Improved kernels for Loongson 3R3 ("3A") and 3R4 ("3B") models, including MSA
    • Fixed bugs in the MSA kernels for CGEMM, CTRMM, CGEMV and ZGEMV
    • Added handling of zero increments in the MSA kernels for SSWAP and DSWAP
    • Added DYNAMIC_ARCH support for MIPS64 (currently Loongson3R3/3R4 only)


    • Fixed building 32 and 64 bit SPARC kernels with the SolarisStudio compilers

    md5sum: 2ca05b9cee97f0d1a8ab15bd6ea2b747 OpenBLAS-0.3.13.tar.gz ab433ae7ed37ad282a67c2cfcc7c4301 855469f768c6e32cf68f9cdb6f5fa69e 467463847f57f54b94242fb6393a0bf9

    Download OpenBLAS

    Source code(tar.gz)
    Source code(zip) MB) MB)
    OpenBLAS-0.3.13.tar.gz(11.86 MB) MB)
  • v0.3.12(Oct 24, 2020)


    • Fixed missing BLAS/LAPACK functions (inadvertently dropped during the build system restructuring to support selective compilation)
    • Fixed argument conversion macro in LAPACKE_zgesvdq (LAPACK #458)


    • Added optimized SCOPY/CCOPY kernels for POWER10
    • Increased and unified the default size of the GEMM buffer
    • Fixed building for POWER10 in DYNAMIC_ARCH mode
    • POWER10 compatibility test now checks binutils version as well
    • Cleaned up compiler warnings


    • corrected compiler version checks for AVX2 compatibility
    • added compiler option -mavx2 for building with flang
    • fixed direct SGEMM pathway for small matrix sizes (broken by the code refactoring in 0.3.11)
    • fixed unhandled partial register clobbers in several kernels for AXPY,DOT,GEMV_N and GEMV_T flagged by gcc10 tree-vectorizer


    • improved Apple Vortex support to include cross-compiling

    Download OpenBLAS

    md5sums: 03bff4558fc701b7d0e689814055ecb2 baf8c58c0ef6ebe0f9eb74a5c4acd662 OpenBLAS-0.3.12.tar.gz 4df4ebb7b5c4f1b5ec8fa58f48be6a51

    Source code(tar.gz)
    Source code(zip) MB)
    OpenBLAS-0.3.12.tar.gz(11.75 MB) MB)
  • v0.3.11(Oct 17, 2020)

    NOTE there appear to be several defects in this version unfortunately - this should not be redistributed or used in a production environment


    • API change:

          the newly added BFLOAT16 functions were renamed to use the
          letter "B" instead of "H" to avoid potential confusion with
          the IEEE "half precision float" type, i.e. the 0.3.10
          SHGEMM is now SBGEMM and the corresponding build option
          was changed from "BUILD_HALF" to "BUILD_BFLOAT16".
    • Reduced the default BLAS3_MEM_ALLOC_THRESHOLD (used as an upper limit for placing temporary arrays on the stack) to be compatible with a stack size of 1mb (as imposed by the JAVA runtime library)
    • Added mixed-precision dot function SBDOT and utility functions shstobf16, shdtobf16, sbf16tos and dbf16tod to convert between single or double precision float arrays and bfloat16 arrays
    • Fixed prototypes of LAPACK_?ggsvp and LAPACK_?ggsvd functions in lapack.h
    • Fixed underflow and rounding errors in LAPACK SLANV2 and DLANV2 (causing miscalculations in e.g. SHSEQR/DHSEQR, LAPACK issue #263)
    • Fixed workspace calculation in LAPACK ?GELQ (LAPACK issue #415)
    • Fixed several bugs in the LAPACK testsuite
    • Improved performance of TRMM and TRSM for certain problem sizes
    • Fixed infinite recursions and workspace miscalculations in ReLAPACK
    • CMAKE builds no longer require pkg-config for creating the .pc file
    • Makefile builds no longer misread NO_CBLAS=0 or NO_LAPACK=0 as enabling these options
    • Fixed detection of gfortran when invoked through an mpi wrapper
    • Improve thread reinitialization performance with OpenMP after a fork
    • Added support for building only the subset of the library required for a particular precision by specifying BUILD_SINGLE, BUILD_DOUBLE
    • Optional function name prefixes and suffixes are now correctly reflected in the generated cblas.h
    • Added CMAKE build support for the LAPACK and multithreading tests


    • Added optimized support for POWER10
    • Added support for compiling for POWER8 in 32bit mode
    • Added support for compilation with LLVM/clang
    • Added support for compilation with NVIDIA/PGI compilers
    • Fixed building on big-endian POWER8
    • Fixed miscompilation of ZDOTC by gcc10
    • Fixed alignment errors in the POWER8 SAXPY kernel
    • Improved CPU detection on AIX
    • Supported building with older compilers on POWER9


    • Added support for Intel Cooperlake
    • Added autodetection of AMD Renoir/Matisse/Zen3 cpus
    • Added autodetection of Intel Comet Lake cpus
    • Reimplemented ?sum, ?dot and daxpy using universal intrinsics
    • Reset the fpu state before using the fpu on Windows as a workaround for a problem introduced in Windows 10 build 19041 (a.k.a. SDK 2004)
    • Fixed potentially undefined behaviour in the dot and gemv_t kernels
    • Fixed a potential segmentation fault in DYNAMIC_ARCH builds
    • Fixed building for ZEN with PGI/NVIDIA and AMD AOCC compilers


    • Fixed cpu detection on BSD-like systems


    • Added preliminary support for Apple Vortex cpus
    • Added support for the Cavium ThunderX3T110 cpu
    • Fixed cpu detection on BSD-like systems
    • Fixed compilation in -std=C18 mode

    IBM Z:

    • Added support for compiling with the clang compiler
    • Improved GEMM performance on Z14

    Download OpenBLAS

    md5sums: dd211b73398383a44ebd75fffabd937a OpenBLAS-0.3.11.tar.gz a76bfee7c125071bce6b24eae5b07468 bad36be9fe4fe40372b06d326cfc5a2f

    Source code(tar.gz)
    Source code(zip) MB)
    OpenBLAS-0.3.11.tar.gz(11.75 MB) MB)
  • v0.3.10(Jun 14, 2020)


    • Improved thread locking behaviour in blas_server and parallel getrf
    • Imported bugfix 394 from LAPACK (spurious reference to "XERBL" due to overlong lines)
    • Imported bugfix 403 from LAPACK (compile option "recursive" required for correctness with Intel and PGI)
    • Imported bugfix 408 from LAPACK (wrong scaling in ZHEEQUB)
    • Imported bugfix 411 from LAPACK (infinite loop in LARGV/LARTG/LARTGP)
    • Fixed mismatches between BUFFERSIZE and GEMM_UNROLL parameters that could lead to crashes at large matrix sizes
    • Restored internal soname in dynamic libraries on FreeBSD and Dragonfly
    • Added API (openblas_setaffinity) to set thread affinity programmatically on Linux
    • Added initial infrastructure for half-precision floating point (bfloat16) support with a generic implementation of SHGEMM
    • Added CMAKE build system support for building the cblas_Xgemm3m functions
    • Fixed CMAKE support for building in a path with embedded spaces
    • Fixed CMAKE (non)handling of NO_EXPRECISION and MAX_STACK_ALLOC
    • Fixed GCC version detection in the Makefiles
    • Allowed overriding the names of AR, AS and LD in Makefile builds


    • fixed big-endian POWER8 ELFv2 builds on FreeBSD
    • Fixed GCC version checks and DYNAMIC_ARCH builds on POWER9
    • Fixed CMAKE build support for POWER9
    • fixed a potential race condition in the thread buffer allocation
    • Worked around LAPACK test failures on PPC G4


    • fixed a potential race condition in the thread buffer allocation
    • Added support for MIPS 24K/24KE family based on P5600 kernels


    • fixed a potential race condition in the thread buffer allocation


    • fixed a race condition in the thread buffer allocation


    • Fixed a race condition in the thread buffer allocation
    • Fixed zero initialisation in the assembly for SGEMM and DGEMM BETA
    • Improved performance of the ThunderX2 DAXPY kernel
    • Added an optimized SGEMM kernel for Cortex A53
    • Fixed Makefile support for INTERFACE64 (8-byte integer)


    • Fixed a syntax error in the CMAKE setup for SkylakeX
    • Improved performance of STRSM on Haswell, SkylakeX and Ryzen
    • Improved SGEMM performance on SGEMM for workloads with ldc a multiple of 1024
    • Improved DGEMM performance on Skylake X
    • Fixed unwanted AVX512-dependency of SGEMM in DYNAMIC_ARCH builds created on SkylakeX
    • Removed data alignment requirement in the SSE2 copy kernels that could cause spurious crashes
    • Added a workaround for an optimizer bug in AppleClang 11.0.3
    • Fixed LAPACK-TEST failures with Intel Fortran
    • Fixed compilation and LAPACK test results with recent Flang and AMD AOCC
    • Fixed DYNAMIC_ARCH builds with CMAKE on OS X
    • Fixed missing exports of cblas_i?amin, cblas_i?min, cblas_i?max, cblas_?sum, cblas_?gemm3m in the shared library on OS X
    • Fixed reporting of cpu name in DYNAMIC_ARCH builds (would sometimes show the name of an older generation chip supported by the same kernels)

    IBM Z:

    • Improved performance of SGEMM/STRMM and DGEMM/DTRMM on Z14

    Download OpenBLAS

    Source code(tar.gz)
    Source code(zip) MB) MB)
  • v0.3.9(Mar 1, 2020)


    • Fixed a miscompilation of the GETRF functions with CMAKE
    • The size of the memory buffer used for splitting GEMM tasks across multiple threads can now be configured in the build system.
    • Imported bugfix 390 from LAPACK (missing NaN propagation in xCOMBSSQ)


    • fixed several compilation problems related to endianness and ELF version support on POWER8 and POWER9.
    • fixed misuse of the absolute value IAMIN/IAMAX in place of IMIN/IMAX
    • fixed a race condition in the level3 blas code


    • fixed misuse of the absolute value IAMIN/IAMAX in place of IMIN/IMAX


    • fixed a race condition in the level3 blas code
    • fixed a compilation problem on Android


    • Added support for Ampere EMAG8180
    • Added support for Neoverse N1
    • improved performance of the blas_lock function
    • fixed a race condition in the level3 blas code
    • Fixed a performance regression on TSV110 servers


    • Fixed a long-standing error with undeclared register clobbers in the DSCAL microkernel for Haswell,SkylakeX and Zen exposed by gcc9.2
    • Fixed a long-standing bug in the SSE implementation of the IAMAX functions
    • Fixed a cmake build failure with DYNAMIC_ARCH on x86_64
    • Fixed an oversight in the cpu detection code for Intel Goldmont+, Cannon Lake and Ice Lake
    • Fixed compile failure on OSX when the compiler name contains a dash (e.g. gcc-9)
    • Fixed compilation with MinGW on SkylakeX
    • Improved speed of the AVX512 GEMM3M code, added an AVX512 kernel for STRMM and improved performance of the AVX2 GEMM kernels

    IBM Z:

    • fixed compilation of the DYNAMIC_ARCH code

    Download OpenBLAS

    Source code(tar.gz)
    Source code(zip)
  • v0.3.8(Feb 9, 2020)


    - LAPACK has been updated to 3.9.0 (plus patches up to January 2nd, 2020)
    - CMAKE support has been improved in several areas including cross-compilation
    - a thread race condition in the GEMM3M kernels was resolved
    - the "generic" (plain C) gemm beta kernel used by many targets has been sped up
    - an optimized version of the LAPACK trtrs functions has been added
    - an incompatibilty between the LAPACK tests and the OpenBLAS implementation of XERBLA
      was resolved, removing the numerous warnings about wrong error exits in the former 
    - support for NetBSD has been added
    - support for compilation with g95 and non-GNU versions of ld has been improved
    - compilation with (upcoming) gcc 10 is now supported


    - worked around miscompilation of several POWER8 and POWER9 kernels by
      older versions of gcc
    - added support for big-endian POWER8 and for compilation on AIX
    - corrected bugs in the big-endian support for PPC440 and PPC970
    - DYNAMIC_ARCH support is now available in CMAKE builds as well


    - performance of DGEMM_BETA and SGEMM_NCOPY has been improved
    - compilation for 32bit works again 
    - performance of the RPCC function has been improved
    - improved performance on small systems
    - DYNAMIC_ARCH support is now available in CMAKE builds as well
    - cross-compilation from OSX to IOS was simplified


    - a new AVX512 DGEMM kernel was added and the AVX512 SGEMM kernel was
      significantly improved
    - optimized AVX512 kernels for CGEMM and ZGEMM have been added
    - AVX2 kernels for STRMM, SGEMM, and CGEMM have been significantly
      sped up and optimized CGEMM3M and ZGEMM3M kernels have been added 
    - added support for QEMU virtual cpus
    - a compilation problem with PGI and SUN compilers was fixed
    - Intel "Goldmont plus" is now autodetected
    - a potential crash on program exit on MS Windows has been fixed 


    - an unwanted case sensitivity in the implementation of LSAME
      on older 32bit AMD cpus was fixed

    IBM Z:

    - Z15 is now supported as Z14
    - DYNAMIC_ARCH is now available on ZARCH as well

    Download OpenBLAS

    Source code(tar.gz)
    Source code(zip)
  • v0.3.7(Aug 11, 2019)


    • having the gmake special variables TARGET_ARCH or TARGET_MACH defined no longer causes build failures in ctest or utest
    • defining NO_AFFINITY or USE_TLS to zero in gmake builds no longer has the same effect as setting them to one
    • a new test program was added to allow checking the library for thread safety
    • a new option USE_LOCKING was added to ensure thread safety when OpenBLAS itself is built without multithreading but will be called from multiple threads.
    • a build failure on Linux with glibc versions earlier than 2.5 was fixed
    • a runtime error with CPU enumeration (and NO_AFFINITY not set) on glibc 2.6 was fixed
    • NO_AFFINITY was added to the CMAKE options (and defaults to being active on Linux, as in the gmake builds)


    • the build-time logic for detection of AVX512 availability in the processor and compiler was fixed
    • gmake builds on OSX now set the internal name of the library to libopenblas.0.dylib (consistent with CMAKE)
    • the Haswell DGEMM kernel received a significant speedup through improved prefetch and load instructions
    • performance of DGEMM, DTRMM, DTRSM and ZDOT on Zen/Zen2 was markedly increased by avoiding vpermpd instructions
    • the SKYLAKEX (AVX512) DGEMM helper functions have now been disabled to fix remaining errors in DGEMM, DSYMM and DTRMM


    • added support for building on FreeBSD/powerpc64 and FreeBSD/ppc970
    • added optimized kernels for POWER9 single and double precision complex BLAS3
    • added optimized kernels for POWER9 SGEMM and STRMM


    • fixed the softfp implementations of xAMAX and IxAMAX
    • removed the predefined -march= flags on both ARMV5 and ARMV6 as they were appropriate for only a subset of platforms

    Download OpenBLAS

    Source code(tar.gz)
    Source code(zip) MB)
  • v0.3.6(Apr 29, 2019)


    - the build tools now check that a given cpu TARGET is actually valid
    - the build-time check of system features (c_check) has been made
      less dependent on particular perl features (this should mainly
      benefit building on Windows)
    - several problems with ReLAPACK and its integration were fixed,
      including INTERFACE64 support and building a shared library
    - building with CMAKE on BSD systems was improved
    - a non-absolute SUM function was added based on the
      existing optimized code for ASUM
    - CBLAS interfaces to the IxMIN and IxMAX functions were added
    - a name clash between LAPACKE and BOOST headers was resolved
    - CMAKE builds with OpenMP failed to include the appropriate getrf_parallel
    - a crash on thread (key) deletion with the USE_TLS=1 memory management
      option was fixed
    - restored several earlier fixes, in particular for OpenMP performance,
      building on BSD, and calling fork on CYGWIN, which had inadvertently
      been dropped in the 0.3.3 rewrite of the memory management code.


    - single precision BLAS1/2 functions have received optimized POWER8 kernels
    - POWER9 is now a separate target, with an optimized DGEMM/DTRMM kernel
    - building on PPC970 systems under OSX Leopard or Tiger is now supported
    - out-of-bounds memory accesses in the gemm_beta microkernels were fixed
    - building a shared library on AIX is now supported for POWER6
    - DYNAMIC_ARCH support has been added for POWER6 and newer


    - corrected xDOT behaviour with zero INC_X or INC_Y 
    - a bug in the IMIN implementation made it return the result of IMAX


    - added support for HiSilicon TSV110 cpus
    - the CMAKE build system now recognizes 32bit userspace on 64bit hardware 
    - cross-compilation with CMAKE now works again
    - a bug in the IMIN implementation made it return the result of IMAX
    - ARMV8 builds with the BINARY=32 option are now automatically handled as ARMV7


    - the AVX512 DGEMM kernel has been disabled again due to unsolved problems
    - building with old versions of MSVC was fixed
    - it is now possible to build a static library on Windows with CMAKE
    - accessing environment variables on CYGWIN at run time was fixed
    - the CMAKE build system now recognizes 32bit userspace on 64bit hardware
    - Intel "Denverton" atom and Hygon "Dhyana" zen CPUs are now autodetected
    - building for DYNAMIC_ARCH with a DYNAMIC_LIST of targets is now supported
      with CMAKE as well
    - building for DYNAMIC_ARCH with GENERIC as the default target is now supported
    - a buffer overflow in the SSE GEMM kernel for Intel Nano targets was fixed
    - assembly bugs involving undeclared modification of input operands were fixed
      in the AXPY, DOT, GEMV, GER, SCAL, SYMV and TRSM microkernels for Nehalem, 
      Sandybridge, Haswell, Bulldozer and Piledriver. These would typically cause
      test failures or segfaults when compiled with recent versions of gcc from 8 onward.
    - a similar bug was fixed in the blas_quickdivide code used to split workloads
      in most functions
    - a bug in the IxMIN implementation for the GENERIC target made it return the result of IxMAX
    - fixed building on SkylakeX systems when either the compiler or the (emulated) operating 
      environment does not support AVX512
    - improved GEMM performance on ZEN targets


    - build failures caused by the recently added checks for AVX512 were fixed
    - an inline assembly bug involving undeclared modification of an input argument was
      fixed in the blas_quickdivide code used to split workloads in most functions
    - a bug in the IMIN implementation for the GENERIC target made it return the result of IMAX


    - a bug in the IMIN implementation made it return the result of IMAX

    IBM Z:

    - optimized microkernels for single precicion BLAS1/2 functions have been added for Z13 and Z14

    Download OpenBLAS

    Source code(tar.gz)
    Source code(zip) MB)
  • v0.3.5(Dec 31, 2018)


    • loop unrolling in TRMV has been enabled again.
    • A domain error in the thread workload distribution for SYRK has been fixed.
    • gmake builds will now automatically add -fPIC to the build options if the platform requires it.
    • a pthreads key leakage (and associate crash on dlclose) in the USE_TLS codepath was fixed.
    • building of the utest cases on systems that do not provide an implementation of complex.h was fixed.


    • the SkylakeX code was changed to compile on OSX.
    • unwanted application of the -march=skylake-avx512 option to the common code parts of a DYNAMIC_ARCH build was fixed.
    • improved performance of SGEMM for small workloads on Skylake X.
    • performance of SGEMM and DGEMM was improved on Haswell.


    • a configuration error that broke the CNRM2 kernel was corrected.
    • compilation of the GEMM kernels with CMAKE was fixed.
    • DYNAMIC_ARCH builds are now available with CMAKE as well.
    • using CMAKE for cross-compilation to the new cpu TARGETs introduced in 0.3.4 now works.


    • a problem in cpu autodetection for AIX has been corrected.

    Download OpenBLAS

    Source code(tar.gz)
    Source code(zip)
  • v0.3.4(Dec 2, 2018)


    • the new, experimental thread-local memory allocation had inadvertently been left enabled for gmake builds in 0.3.3 despite the announcement. It is now disabled by default, and single-threaded builds will keep using the old allocator even if the USE_TLS option is turned on.
    • OpenBLAS will now provide enough buffer space for at least 50 threads by default.
    • The output of openblas_get_config() now contains the version number.
    • A serious thread safety bug in GEMV operation with small M and large N size has been fixed.
    • The code will now automatically call blas_thread_init after a fork if needed before handling a call to openblas_set_num_threads
    • Accesses to parallelized level3 functions from multiple callers are now serialized to avoid thread races (unless using OpenMP). This should provide better performance than the known-threadsafe (but non-default) USE_SIMPLE_THREADED_LEVEL3 option.
    • When building LAPACK with gfortran, -frecursive is now (again) enabled by default to ensure correct behaviour.
    • The OpenBLAS version cblas.h now supports both CBLAS_ORDER and CBLAS_LAYOUT as the name of the matrix row/column order option.
    • Externally set LDFLAGS are now passed through to the final compile/link steps to facilitate setting platform-specific linker flags.
    • A potential race condition during the build of LAPACK (that would usually manifest itself as a failure to build TESTING/MATGEN) has been fixed.
    • xHEMV has been changed to stay single-threaded for small input sizes where the overhead of multithreading exceeds any possible gains
    • CSWAP and ZSWAP have been limited to a single thread except on ARMV8 or ThunderX hardware with sizable input.
    • Linker flags for the PGI compiler have been updated
    • Behaviour of AXPY with zero increments is now handled in the C interface, correcting the result on at least Intel Atom.
    • The result matrix from calling SGELSS with an all-zero input matrix is now zeroed completely.


    • Autodetection of AMD Ryzen2 has been fixed (again).
    • CMAKE builds now support labeling of an INTERFACE64=1 build of the library with the _64 suffix.
    • AVX512 version of DGEMM has been added and the AVX512 SGEMM kernel has been sped up by rewriting with C intrinsics
    • Fixed compilation on RHEL5/CENTOS5 (issue with typename __WAIT_STATUS)


    • added support for building on AIX (with gcc and GNU tools from AIX Toolbox).
    • CPU type detection has been implemented for AIX.
    • CPU type detection has been fixed for NETBSD.


    • AXPY on LOONGSON3A has been corrected to pass "zero increment" utest.
    • DSDOT on LOONGSON3A has been fixed.
    • the SGEMM microkernel has been hardened against potential data loss.


    • DYNAMic_ARCH support is now available for 64bit ARM
    • cross-compiling for ARMV8 under iOS now works.
    • cpu-specific code has been rearranged to make better use of both hardware commonalities and model-specific compiler optimizations.
    • XGENE1 has been removed as a TARGET, superseded by the improved generic ARMV8 support.


    • Older assembly mnemonics have been converted to UAL form to allow building with clang 7.0
    • Cross compiling LAPACKE for Android has been fixed again (broken by update to LAPACK 3.7.0 some while ago).

    Download OpenBLAS

    Source code(tar.gz)
    Source code(zip)
  • v0.3.3(Aug 30, 2018)


    • thread memory allocation has been switched back to the method used before version 0.3.1 due to unexpected problems caused by the new code under some circumstances. A new compile-time option USE_TLS has been added to allow enabling the new code instead, and it is hoped that this can become the default again in the next version.
    • LAPACK PR272 has been integrated, which fixes spurious errors in DSYEVR and related functions caused by missing conversion from ILAENV to ILAENV_2STAGE in several _2stage routines.
    • the cmake-generated OpenBLASConfig.cmake now uses correct case for the name of the library
    • added support for Haiku OS


    • added AVX512 implementations of SDOT, DDOT, SAXPY, DAXPY, DSCAL, DGEMVN and DSYMVL
    • added a workaround for a cygwin issue that prevented compilation of AVX512 code

    IBM Z:

    • added autodetection of Z14
    • fixed TRMM errors in the generic target

    Download OpenBLAS

    Source code(tar.gz)
    Source code(zip)
  • v0.3.2(Jul 30, 2018)


    • fixes for regressions caused by the rewrite of the thread initialization code in 0.3.1


    • added autodetection of AMD Ryzen 2
    • fixed build with older versions of MSVC


    • fixed cpu autodetection for the BSDs


    • fixed utest errors in AXPY, DSDOT, ROT and SWAP

    Download OpenBLAS

    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(Jul 1, 2018)


    • rewritten thread initialization code with significantly reduced overhead
    • added CBLAS interfaces to the IxAMIN BLAS extension functions
    • fixed the lapack-test target
    • CMAKE builds now create an OpenBLASConfig.cmake file
    • ZAXPY now uses a single thread for small input sizes
    • the LAPACK code was updated from


    • corrected CROT and ZROT behaviour with zero INC_X


    • corrected xDOT behaviour with zero INC_X or INC_Y


    • retired some older targets of DYNAMIC_ARCH builds to a new option DYNAMIC_OLDER, this affects PENRYN,DUNNINGTON,OPTERON,OPTERON_SSE3,BOBCAT,ATOM and NANO (which will still be supported via the slower PRESCOTT kernels when this option is not set)
    • added an option DYNAMIC_LIST that (used in conjunction with DYNAMIC_ARCH) allows to specify the list of x86_64 targets to include. Any target not on the list will be supported by the Sandybridge or Nehalem kernels if available, or by Prescott.
    • improved SWITCH_RATIO on Haswell for increased GEMM throughput
    • added initial support for Intel Skylake X, including an AVX512 SGEMM kernel
    • added autodetection of Intel Cannon Lake series as Skylake X
    • added a default L2 cache size for hypervisors that return zero here (Chromebook)
    • fixed a name clash with recent Windows10 headers that broke the build with (at least) recent mingw from MSYS2
    • fixed a link error in mixed clang/gfortran builds with OpenMP
    • updated the OSX deployment target to 10.8
    • switched on parallel make for builds on MS Windows by default


    • fixed SSWAP and DSWAP behaviour with zero INC_X and INC_Y

    Download OpenBLAS

    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(May 23, 2018)


    * fixed some more thread race and locking bugs
    * added preliminary support for calling an OpenMP build of the library from multiple threads
    * removed performance impact of thread locks added in 0.2.20 on OpenMP code
    * general code cleanup 
    * optimized DSDOT implementation
    * improved thread distribution for GEMM
    * corrected IMATCOPY/OMATCOPY implementation
    * fixed out-of-bounds accesses in the multithreaded xBMV/xPMV and SYMV implementations
    * cmake build improvements
    * pkgconfig file now contains build options
    * openblas_get_config() now reports USE_OPENMP and NUM_THREADS settings used for the build
    * corrections and improvements for systems with more than 64 cpus
    * LAPACK code updated to 3.8.0 including later fixes
    * added ReLAPACK, a recursive implementation of several LAPACK functions
    * Rewrote ROTMG to handle cases that the netlib code failed to address
    * Disabled (broken) multithreading code for xTRMV
    * corrected prototypes of complex CBLAS functions to make our cblas.h match the generally accepted standard
    * shared memory access failures on startup are now handled more gracefully
    * restored utests from earlier releases (and made them pass on all affected systems)


    * several fixes for cpu autodetection


    * corrected vector register overwriting in several Power8 kernels
    * optimized additional BLAS functions


    * added support for CortexA53 and A72 
    * added autodetection for ThunderX2T99
    * made most optimized kernels the default for generic ARMv8 targets 


    * parallelized DDOT kernel for Haswell
    * changed alignment directives in assembly kernels to boost performance on OSX
    * fixed register handling in the GEMV microkernels (bug exposed by gcc7)
    * added support for building on OpenBSD and Dragonfly 
    * updated compiler options to work with Intel release 2018
    * support fully optimized build with clang/flang on Microsoft Windows
    * fixed building on AIX

    IBM Z:

    * added optimized BLAS 1/2 functions


    * fixed cpu autodetection helper code
    * added mips32 1004K cpu (Mediatek MT7621 and similar SoC)
    * added mips64 I6500 cpu

    Download OpenBLAS

    Source code(tar.gz)
    Source code(zip)
  • v0.2.20(Jul 24, 2017)

    Version 0.2.20 24-Jul-2017


        * Improved CMake support
        * Fixed several thread race and locking bugs
        * Fixed default LAPACK optimization level
        * Updated LAPACK to 3.7.0
        * Added ReLAPACK (, make BUILD_RELAPACK=1


        * Optimizations for Power9
        * Fixed several Power8 assembly bugs


        * New optimized Vulcan and ThunderX2T99 targets
        * Support for ARMV7 SOFT_FP ABI  (make ARM_SOFTFP_ABI=1)
        * Detect all cpu cores including offline ones
        * Fix compilation with CLANG
        * Support building a shared library for Android


        * Fixed several threading issues
        * Fix compilation with CLANG


        * Detect Intel Bay Trail and Apollo Lake
        * Detect Intel Sky Lake and Kaby Lake
        * Detect Intel Knights Landing
        * Detect AMD A8, A10, A12 and Ryzen
        * Support 64bit builds with Visual Studio
        * Fix building with Intel and PGI compilers
        * Fix building with MINGW and TDM-GCC
        * Fix cmake builds for Haswell and related cpus
        * Fix building for Sandybridge with CLANG 3.9
        * Add support for the FLANG compiler

    IBM Z:

        * New target z13 with BLAS3 optimizations

    [Download OpenBLAS]( 0.2.20

    Source code(tar.gz)
    Source code(zip)
  • v0.2.19(Sep 1, 2016)

    Version 0.2.19 1-Sep-2016


        * Improved cross compiling.
        * Fix the bug on musl libc.


        * Optimize BLAS on Power8
        * Fixed Julia+OpenBLAS bugs on Power8


        * Optimize BLAS on MIPS P5600 and I6400 (Thanks, Shivraj Patil, Kaustubh Raste)


        * Improved on ARM Cortex-A57. (Thanks, Ashwin Sekhar T K)

    [Download OpenBLAS]( 0.2.19

    Source code(tar.gz)
    Source code(zip)
  • v0.2.18(Apr 12, 2016)

    Version 0.2.18 12-Apr-2016


    • If you set MAKE_NB_JOBS flag less or equal than zero, make will be without -j.


    • Support building Visual Studio static library. (#813, Thanks, theoractice)
    • Fix bugs to pass buidbot CI tests (


    • Provide DGEMM 8x4 kernel for Cortex-A57 (Thanks, Ashwin Sekhar T K)


    • Optimize S and C BLAS3 on Power8
    • Optimize BLAS2/1 on Power8

    [Download OpenBLAS]( 0.2.18

    Source code(tar.gz)
    Source code(zip)
  • v0.2.17(Mar 21, 2016)

    Version 0.2.17 20-Mar-2016


    • Enable BUILD_LAPACK_DEPRECATED=1 by default.

    [Download OpenBLAS]( 0.2.17

    Source code(tar.gz)
    Source code(zip)
  • v0.2.16(Mar 15, 2016)

    Version 0.2.16 15-Mar-2016


    • Upgrade LAPACK to 3.6.0 version. Add BUILD_LAPACK_DEPRECATED option in Makefile.rule to build LAPACK deprecated functions.
    • Add MAKE_NB_JOBS option in Makefile. Force number of make jobs.This is particularly useful when using distcc. (#735. Thanks, Jerome Robert.)
    • Redesign unit test. Run unit/regression test at every build (Travis-CI and Appveyor).
    • Disable multi-threading for small size swap and ger. (#744. Thanks, Jerome Robert)
    • Improve small zger, zgemv, ztrmv using stack alloction (#727. Thanks, Jerome Robert)
    • Let openblas_get_num_threads return the number of active threads. (#760. Thanks, Jerome Robert)
    • Support illumos(OmniOS). (#749. Thanks, Lauri Tirkkonen)
    • Fix LAPACK Dormbr, Dormlq bug. (#711, #713. Thanks, Brendan Tracey)
    • Update scipy benchmark script. (#745. Thanks, John Kirkham)
    • Avoid potential getenv segfault. (#716)
    • Import LAPACK svn bugfix #142-#147,#150-#155


    • Optimize trsm kernels for AMD Bulldozer, Piledriver, Steamroller.
    • Detect Intel Avoton.
    • Detect AMD Trinity, Richland, E2-3200.
    • Fix gemv performance bug on Mac OSX Intel Haswell.
    • Fix some bugs with CMake and Visual Studio
    • Optimize c/zgemv for AMD Bulldozer, Piledriver, Steamroller
    • Fix bug with scipy linalg test.


    • Support and optimize Cortex-A57 AArch64. (#686. Thanks, Ashwin Sekhar TK)
    • Fix Android build on ARMV7 (#778. Thanks, Paul Mustiere)
    • Update ARMV6 kernels.
    • Improve DGEMM for ARM Cortex-A57. (Thanks, Ashwin Sekhar T K)


    • Fix detection of POWER architecture (#684. Thanks, Sebastien Villemot)
    • Optimize D and Z BLAS3 functions for Power8.

    [Download OpenBLAS]( 0.2.16

    Source code(tar.gz)
    Source code(zip)
  • v0.2.15(Oct 27, 2015)

    Version 0.2.15 27-Oct-2015


    • Support cmake on x86/x86-64. Natively compiling on MS Visual Studio. (experimental. Thank Hank Anderson for the initial cmake porting work.)

        On Linux and Mac OSX, OpenBLAS cmake supports assembly kernels.
        e.g. cmake .
             make test (Optional)
        On Windows MS Visual Studio, OpenBLAS cmake only support C kernels.
        (OpenBLAS uses AT&T style assembly, which is not supported by MSVC.)
        e.g. cmake -G "Visual Studio 12 Win64" .
             Open OpenBLAS.sln and build.
    • Enable MAX_STACK_ALLOC flags by default. Improve ger and gemv for small matrices.

    • Improve gemv parallel with small m and large n case.

    • Improve ?imatcopy when lda==ldb (#633. Thanks, Martin Koehler)

    • Add vecLib benchmarks (#565. Thanks, Andreas Noack.)

    • Fix LAPACK lantr for row major matrices (#634. Thanks, Dan Kortschak)

    • Fix LAPACKE lansy (#640. Thanks, Dan Kortschak)

    • Import bug fixes for LAPACKE s/dormlq, c/zunmlq

    • Raise the signal when pthread_create fails (#668. Thanks, James K. Lowden)

    • Remove g77 from compiler list.

    • Enable AppVeyor Windows CI.


    • Support pure C generic kernels for x86/x86-64.
    • Support Intel Boardwell and Skylake by Haswell kernels.
    • Support AMD Excavator by Steamroller kernels.
    • Optimize s/d/c/zdot for Intel SandyBridge and Haswell.
    • Optimize s/d/c/zdot for AMD Piledriver and Steamroller.
    • Optimize s/d/c/zapxy for Intel SandyBridge and Haswell.
    • Optimize s/d/c/zapxy for AMD Piledriver and Steamroller.
    • Optimize d/c/zscal for Intel Haswell, dscal for Intel SandyBridge.
    • Optimize d/c/zscal for AMD Bulldozer, Piledriver and Steamroller.
    • Optimize s/dger for Intel SandyBridge.
    • Optimize s/dsymv for Intel SandyBridge.
    • Optimize ssymv for Intel Haswell.
    • Optimize dgemv for Intel Nehalem and Haswell.
    • Optimize dtrmm for Intel Haswell.


    • Support Android NDK armeabi-v7a-hard ABI (-mfloat-abi=hard)

        e.g. make HOSTCC=gcc CC=arm-linux-androideabi-gcc NO_LAPACK=1 TARGET=ARMV7
    • Fix lock, rpcc bugs (#616, #617. Thanks, Grazvydas Ignotas)


    • Support ppc64le platform (ELF ABI v2. #612. Thanks, Matthew Brandyberry.)
    • Support POWER7/8 by POWER6 kernels. (#612. Thanks, Fábio Perez.)

    [Download OpenBLAS]( 0.2.15

    Source code(tar.gz)
    Source code(zip)
  • v0.2.14(Mar 24, 2015)

    Version 0.2.14 24-Mar-2015


    • Improve OpenBLASConfig.cmake. (#474, #475. Thanks, xantares.)
    • Improve ger and gemv for small matrices by stack allocation. e.g. make -DMAX_STACK_ALLOC=2048 (#482. Thanks, Jerome Robert.)
    • Introduce openblas_get_num_threads and openblas_get_num_procs. (#497. Thanks, Erik Schnetter.)
    • Add ATLAS-style ?geadd function. (#509. Thanks, Martin Köhler.)
    • Fix c/zsyr bug with negative incx. (#492.)
    • Fix race condition during shutdown causing a crash in gotoblas_set_affinity(). (#508. Thanks, Ton van den Heuvel.)


    • Support AMD Streamroller.


    • Add Cortex-A9 and Cortex-A15 targets.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.13(Dec 3, 2014)

    Version 0.2.13 3-Dec-2014


    • Add SYMBOLPREFIX and SYMBOLSUFFIX makefile options for adding a prefix or suffix to all exported symbol names in the shared library.(#459, Thanks Tony Kelman)
    • Provide OpenBLASConfig.cmake at installation.
    • Fix Fortran compiler detection on FreeBSD. (#470, Thanks Mike Nolta)


    • Add generic kernel files for x86-64. make TARGET=GENERIC
    • Fix a bug of sgemm kernel on Intel Sandy Bridge.
    • Fix c_check bug on some amd64 systems. (#471, Thanks Mike Nolta)


    • Support APM's X-Gene 1 AArch64 processors. Optimize trmm and sgemm. (#465, Thanks Dave Nuechterlein)
    Source code(tar.gz)
    Source code(zip)
Optimized implementations of the Number Theoretic Transform (NTT) algorithm for the ring R/(X^N + 1) where N=2^m.

optimized-number-theoretic-transform-implementations This sample code package is an implementation of the Number Theoretic Transform (NTT) algorithm f

International Business Machines 12 Nov 14, 2022
An updated, 1.2.1 version of Equalizer APO running with internal double precision processing (64 bit)

EqualizerAPO - 64bit port This repo contains an updated, 1.2.1 64-bit port of EqualizerAPO - system wide equalizer for Windows. The port here is inspi

FireKahuna 41 Dec 4, 2022
A thin, highly portable C++ intermediate representation for dense loop-based computation.

A thin, highly portable C++ intermediate representation for dense loop-based computation.

Facebook Research 125 Nov 24, 2022
MIRACL Cryptographic SDK: Multiprecision Integer and Rational Arithmetic Cryptographic Library is a C software library that is widely regarded by developers as the gold standard open source SDK for elliptic curve cryptography (ECC).

MIRACL What is MIRACL? Multiprecision Integer and Rational Arithmetic Cryptographic Library – the MIRACL Crypto SDK – is a C software library that is

MIRACL 527 Jan 7, 2023
A C library for statistical and scientific computing

Apophenia is an open statistical library for working with data sets and statistical or simulation models. It provides functions on the same level as t

null 186 Sep 11, 2022
P(R*_{3, 0, 1}) specialized SIMD Geometric Algebra Library

Klein ?? ?? Project Site ?? ?? Description Do you need to do any of the following? Quickly? Really quickly even? Projecting points onto lines, lines t

Jeremy Ong 635 Dec 30, 2022
linalg.h is a single header, public domain, short vector math library for C++

linalg.h linalg.h is a single header, public domain, short vector math library for C++. It is inspired by the syntax of popular shading and compute la

Sterling Orsten 758 Jan 7, 2023
LibTomMath is a free open source portable number theoretic multiple-precision integer library written entirely in C.

libtommath This is the git repository for LibTomMath, a free open source portable number theoretic multiple-precision integer (MPI) library written en

libtom 543 Dec 27, 2022
a lean linear math library, aimed at graphics programming. Supports vec3, vec4, mat4x4 and quaternions

linmath.h -- A small library for linear math as required for computer graphics linmath.h provides the most used types required for programming compute

datenwolf 729 Jan 9, 2023
The QuantLib C++ library

QuantLib: the free/open-source library for quantitative finance The QuantLib project ( is aimed at providing a comprehensive softw

Luigi Ballabio 3.6k Dec 30, 2022
A C++ header-only library of statistical distribution functions.

StatsLib StatsLib is a templated C++ library of statistical distribution functions, featuring unique compile-time computing capabilities and seamless

Keith O'Hara 423 Jan 3, 2023
SymEngine is a fast symbolic manipulation library, written in C++

SymEngine SymEngine is a standalone fast C++ symbolic manipulation library. Optional thin wrappers allow usage of the library from other languages, e.

null 926 Dec 24, 2022
nml is a simple matrix and linear algebra library written in standard C.

nml is a simple matrix and linear algebra library written in standard C.

Andrei Ciobanu 45 Dec 9, 2022
RcppFastFloat: Rcpp Bindings for the fastfloat C++ Header-Only Library

Converting ascii text into (floating-point) numeric values is a very common problem. The fast_float header-only C++ library by Daniel Lemire does this very well, and very fast at up to or over to 1 gigabyte per second as described in more detail in a recent arXiv paper.

Dirk Eddelbuettel 18 Nov 15, 2022
✨sigmatch - Modern C++ 20 Signature Match / Search Library

sigmatch Modern C++ 20 Signature Match / Search Library ✨ Features ?? Header-only, no dependencies, no exceptions. ☕ Compile-time literal signature st

Sprite 55 Dec 27, 2022
C++ library for solving large sparse linear systems with algebraic multigrid method

AMGCL AMGCL is a header-only C++ library for solving large sparse linear systems with algebraic multigrid (AMG) method. AMG is one of the most effecti

Denis Demidov 578 Dec 11, 2022
Header only FFT library

dj_fft: Header-only FFT library Details This repository provides a header-only library to compute fourier transforms in 1D, 2D, and 3D. Its goal is to

Jonathan Dupuy 134 Dec 29, 2022
C++ Mathematical Expression Parsing And Evaluation Library

C++ Mathematical Expression Toolkit Library Documentation Section 00 - Introduction Section 01 - Capabilities Section 02 - Example Expressions

Arash Partow 445 Jan 4, 2023
C++ header-only fixed-point math library

fpm A C++ header-only fixed-point math library. "fpm" stands for "fixed-point math". It is designed to serve as a drop-in replacement for floating-poi

Mike Lankamp 392 Jan 7, 2023