BLAS-like Library Instantiation Software Framework

Overview

The BLIS cat is sleeping.

Build Status Build Status

Contents

Introduction

BLIS is a portable software framework for instantiating high-performance BLAS-like dense linear algebra libraries. The framework was designed to isolate essential kernels of computation that, when optimized, immediately enable optimized implementations of most of its commonly used and computationally intensive operations. BLIS is written in ISO C99 and available under a new/modified/3-clause BSD license. While BLIS exports a new BLAS-like API, it also includes a BLAS compatibility layer which gives application developers access to BLIS implementations via traditional BLAS routine calls. An object-based API unique to BLIS is also available.

For a thorough presentation of our framework, please read our ACM Transactions on Mathematical Software (TOMS) journal article, "BLIS: A Framework for Rapidly Instantiating BLAS Functionality". For those who just want an executive summary, please see the Key Features section below.

In a follow-up article (also in ACM TOMS), "The BLIS Framework: Experiments in Portability", we investigate using BLIS to instantiate level-3 BLAS implementations on a variety of general-purpose, low-power, and multicore architectures.

An IPDPS'14 conference paper titled "Anatomy of High-Performance Many-Threaded Matrix Multiplication" systematically explores the opportunities for parallelism within the five loops that BLIS exposes in its matrix multiplication algorithm.

For other papers related to BLIS, please see the Citations section below.

It is our belief that BLIS offers substantial benefits in productivity when compared to conventional approaches to developing BLAS libraries, as well as a much-needed refinement of the BLAS interface, and thus constitutes a major advance in dense linear algebra computation. While BLIS remains a work-in-progress, we are excited to continue its development and further cultivate its use within the community.

The BLIS framework is primarily developed and maintained by individuals in the Science of High-Performance Computing (SHPC) group in the Oden Institute for Computational Engineering and Sciences at The University of Texas at Austin. Please visit the SHPC website for more information about our research group, such as a list of people and collaborators, funding sources, publications, and other educational projects (such as MOOCs).

Education and Learning

Want to understand what's under the hood? Many of the same concepts and principles employed when developing BLIS are introduced and taught in a basic pedagogical setting as part of LAFF-On Programming for High Performance (LAFF-On-PfHP), one of several massive open online courses (MOOCs) in the Linear Algebra: Foundations to Frontiers series, all of which are available for free via the edX platform.

What's New

  • Addons feature now available! Have you ever wanted to quickly extend BLIS's operation support or define new custom BLIS APIs for your application, but were unsure of how to add your source code to BLIS? Do you want to isolate your custom code so that it only gets enabled when the user requests it? Do you like sandboxes, but wish you didn't have to provide an implementation of gemm? If so, you should check out our new addons feature. Addons act like optional extensions that can be created, enabled, and combined to suit your application's needs, all without formally integrating your code into the core BLIS framework.

  • Multithreaded small/skinny matrix support for sgemm now available! Thanks to funding and hardware support from Oracle, we have now accelerated gemm for single-precision real matrix problems where one or two dimensions is exceedingly small. This work is similar to the gemm optimization announced last year. For now, we have only gathered performance results on an AMD Epyc Zen2 system, but we hope to publish additional graphs for other architectures in the future. You may find these Zen2 graphs via the PerformanceSmall document.

  • BLIS awarded SIAM Activity Group on Supercomputing Best Paper Prize for 2020! We are thrilled to announce that the paper that we internally refer to as the second BLIS paper,

    "The BLIS Framework: Experiments in Portability." Field G. Van Zee, Tyler Smith, Bryan Marker, Tze Meng Low, Robert A. van de Geijn, Francisco Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael Kistler, Vernon Austel, John A. Gunnels, Lee Killough. ACM Transactions on Mathematical Software (TOMS), 42(2):12:1--12:19, 2016.

    was selected for the SIAM Activity Group on Supercomputing Best Paper Prize for 2020. The prize is awarded once every two years to a paper judged to be the most outstanding paper in the field of parallel scientific and engineering computing, and has only been awarded once before (in 2016) since its inception in 2015 (the committee did not award the prize in 2018). The prize was awarded at the 2020 SIAM Conference on Parallel Processing for Scientific Computing in Seattle. Robert was present at the conference to give a talk on BLIS and accept the prize alongside other coauthors. The selection committee sought to recognize the paper, "which validates BLIS, a framework relying on the notion of microkernels that enables both productivity and high performance." Their statement continues, "The framework will continue having an important influence on the design and the instantiation of dense linear algebra libraries."

  • Multithreaded small/skinny matrix support for dgemm now available! Thanks to contributions made possible by our partnership with AMD, we have dramatically accelerated gemm for double-precision real matrix problems where one or two dimensions is exceedingly small. A natural byproduct of this optimization is that the traditional case of small m = n = k (i.e. square matrices) is also accelerated, even though it was not targeted specifically. And though only dgemm was optimized for now, support for other datatypes and/or other operations may be implemented in the future. We've also added new graphs to the PerformanceSmall document to showcase multithreaded performance when one or more matrix dimensions are small.

  • Performance comparisons now available! We recently measured the performance of various level-3 operations on a variety of hardware architectures, as implemented within BLIS and other BLAS libraries for all four of the standard floating-point datatypes. The results speak for themselves! Check out our extensive performance graphs and background info in our new Performance document.

  • BLIS is now in Debian Unstable! Thanks to Debian developer-maintainers M. Zhou and Nico Schlömer for sponsoring our package in Debian. Their participation, contributions, and advocacy were key to getting BLIS into the second-most popular Linux distribution (behind Ubuntu, which Debian packages feed into). The Debian tracker page may be found here.

  • BLIS now supports mixed-datatype gemm! The gemm operation may now be executed on operands of mixed domains and/or mixed precisions. Any combination of storage datatype for A, B, and C is now supported, along with a separate computation precision that can differ from the storage precision of A and B. And even the 1m method now supports mixed-precision computation. For more details, please see our ACM TOMS journal article submission (current draft).

  • BLIS now implements the 1m method. Let's face it: writing complex assembly gemm microkernels for a new architecture is never a priority--and now, it almost never needs to be. The 1m method leverages existing real domain gemm microkernels to implement all complex domain level-3 operations. For more details, please see our ACM TOMS journal article submission (current draft).

What People Are Saying About BLIS

"I noticed a substantial increase in multithreaded performance on my own machine, which was extremely satisfying." ... "[I was] happy it worked so well!" (Justin Shea)

"This is an awesome library." ... "I want to thank you and the blis team for your efforts." (@Lephar)

"Any time somebody outside Intel beats MKL by a nontrivial amount, I report it to the MKL team. It is fantastic for any open-source project to get within 10% of MKL... [T]his is why Intel funds BLIS development." (@jeffhammond)

"So BLIS is now a part of Elk." ... "We have found that zgemm applied to a 15000x15000 matrix with multi-threaded BLIS on a 32-core Ryzen 2990WX processor is about twice as fast as MKL" ... "I'm starting to like this a lot." (@jdk2016)

"I [found] BLIS because I was looking for BLAS operations on C-ordered arrays for NumPy. BLIS has that, but even better is the fact that it's developed in the open using a more modern language than Fortran." (@nschloe)

"The specific reason to have BLIS included [in Linux distributions] is the KNL and SKX [AVX-512] BLAS support, which OpenBLAS doesn't have." (@loveshack)

"All tests pass without errors on OpenBSD. Thanks!" (@ararslan)

"Thank you very much for your great help!... Looking forward to benchmarking." (@mrader1248)

"Thanks for the beautiful work." (@mmrmo)

"[M]y software currently uses BLIS for its BLAS interface..." (@ShadenSmith)

"[T]hanks so much for your work on this! Excited to test." ... "[On AMD Excavator], BLIS is competitive to / slightly faster than OpenBLAS for dgemms in my tests." (@iotamudelta)

"BLIS provided the only viable option on KNL, whose ecosystem is at present dominated by blackbox toolchains. Thanks again. Keep on this great work." (@heroxbd)

"I want to definitely try this out..." (@ViralBShah)

Key Features

BLIS offers several advantages over traditional BLAS libraries:

  • Portability that doesn't impede high performance. Portability was a top priority of ours when creating BLIS. With virtually no additional effort on the part of the developer, BLIS is configurable as a fully-functional reference implementation. But more importantly, the framework identifies and isolates a key set of computational kernels which, when optimized, immediately and automatically optimize performance across virtually all level-2 and level-3 BLIS operations. In this way, the framework acts as a productivity multiplier. And since the optimized (non-portable) code is compartmentalized within these few kernels, instantiating a high-performance BLIS library on a new architecture is a relatively straightforward endeavor.

  • Generalized matrix storage. The BLIS framework exports interfaces that allow one to specify both the row stride and column stride of a matrix. This allows one to compute with matrices stored in column-major order, row-major order, or by general stride. (This latter storage format is important for those seeking to implement tensor contractions on multidimensional arrays.) Furthermore, since BLIS tracks stride information for each matrix, operands of different storage formats can be used within the same operation invocation. By contrast, BLAS requires column-major storage. And while the CBLAS interface supports row-major storage, it does not allow mixing storage formats.

  • Rich support for the complex domain. BLIS operations are developed and expressed in their most general form, which is typically in the complex domain. These formulations then simplify elegantly down to the real domain, with conjugations becoming no-ops. Unlike the BLAS, all input operands in BLIS that allow transposition and conjugate-transposition also support conjugation (without transposition), which obviates the need for thread-unsafe workarounds. Also, where applicable, both complex symmetric and complex Hermitian forms are supported. (BLAS omits some complex symmetric operations, such as symv, syr, and syr2.) Another great example of BLIS serving as a portability lever is its implementation of the 1m method for complex matrix multiplication, a novel mechanism of providing high-performance complex level-3 operations using only real domain microkernels. This new innovation guarantees automatic level-3 support in the complex domain even when the kernel developers entirely forgo writing complex kernels.

  • Advanced multithreading support. BLIS allows multiple levels of symmetric multithreading for nearly all level-3 operations. (Currently, users may choose to obtain parallelism via either OpenMP or POSIX threads). This means that matrices may be partitioned in multiple dimensions simultaneously to attain scalable, high-performance parallelism on multicore and many-core architectures. The key to this innovation is a thread-specific control tree infrastructure which encodes information about the logical thread topology and allows threads to query and communicate data amongst one another. BLIS also employs so-called "quadratic partitioning" when computing dimension sub-ranges for each thread, so that arbitrary diagonal offsets of structured matrices with unreferenced regions are taken into account to achieve proper load balance. More recently, BLIS introduced a runtime abstraction to specify parallelism on a per-call basis, which is useful for applications that want to handle most of the parallelism.

  • Ease of use. The BLIS framework, and the library of routines it generates, are easy to use for end users, experts, and vendors alike. An optional BLAS compatibility layer provides application developers with backwards compatibility to existing BLAS-dependent codes. Or, one may adjust or write their application to take advantage of new BLIS functionality (such as generalized storage formats or additional complex operations) by calling one of BLIS's native APIs directly. BLIS's typed API will feel familiar to many veterans of BLAS since these interfaces use BLAS-like calling sequences. And many will find BLIS's object-based APIs a delight to use when customizing or writing their own BLIS operations. (Objects are relatively lightweight structs and passed by address, which helps tame function calling overhead.)

  • Multilayered API and exposed kernels. The BLIS framework exposes its implementations in various layers, allowing expert developers to access exactly the functionality desired. This layered interface includes that of the lowest-level kernels, for those who wish to bypass the bulk of the framework. Optimizations can occur at various levels, in part thanks to exposed packing and unpacking facilities, which by default are highly parameterized and flexible.

  • Functionality that grows with the community's needs. As its name suggests, the BLIS framework is not a single library or static API, but rather a nearly-complete template for instantiating high-performance BLAS-like libraries. Furthermore, the framework is extensible, allowing developers to leverage existing components to support new operations as they are identified. If such operations require new kernels for optimal efficiency, the framework and its APIs will be adjusted and extended accordingly. Community developers who wish to experiment with creating new operations or APIs in BLIS can quickly and easily do so via the Addons feature.

  • Code re-use. Auto-generation approaches to achieving the aforementioned goals tend to quickly lead to code bloat due to the multiple dimensions of variation supported: operation (i.e. gemm, herk, trmm, etc.); parameter case (i.e. side, [conjugate-]transposition, upper/lower storage, unit/non-unit diagonal); datatype (i.e. single-/double-precision real/complex); matrix storage (i.e. row-major, column-major, generalized); and algorithm (i.e. partitioning path and kernel shape). These "brute force" approaches often consider and optimize each operation or case combination in isolation, which is less than ideal when the goal is to provide entire libraries. BLIS was designed to be a complete framework for implementing basic linear algebra operations, but supporting this vast amount of functionality in a manageable way required a holistic design that employed careful abstractions, layering, and recycling of generic (highly parameterized) codes, subject to the constraint that high performance remain attainable.

  • A foundation for mixed domain and/or mixed precision operations. BLIS was designed with the hope of one day allowing computation on real and complex operands within the same operation. Similarly, we wanted to allow mixing operands' numerical domains, floating-point precisions, or both domain and precision, and to optionally compute in a precision different than one or both operands' storage precisions. This feature has been implemented for the general matrix multiplication (gemm) operation, providing 128 different possible type combinations, which, when combined with existing transposition, conjugation, and storage parameters, enables 55,296 different gemm use cases. For more details, please see the documentation on mixed datatype support and/or our ACM TOMS journal paper on mixed-domain/mixed-precision gemm (linked below).

How to Download BLIS

There are a few ways to download BLIS. We list the most common four ways below. We highly recommend using either Option 1 or 2. Otherwise, we recommend Option 3 (over Option 4) so your compiler can perform optimizations specific to your hardware.

  1. Download a source repository with git clone. Generally speaking, we prefer using git clone to clone a git repository. Having a repository allows the user to periodically pull in the latest changes and quickly rebuild BLIS whenever they wish. Also, implicit in cloning a repository is that the repository defaults to using the master branch, which contains the latest "stable" commits since the most recent release. (This is in contrast to Option 3 in which the user is opting for code that may be slightly out of date.)

    In order to clone a git repository of BLIS, please obtain a repository URL by clicking on the green button above the file/directory listing near the top of this page (as rendered by GitHub). Generally speaking, it will amount to executing the following command in your terminal shell:

    git clone https://github.com/flame/blis.git
    
  2. Download a source repository via a zip file. If you are uncomfortable with using git but would still like the latest stable commits, we recommend that you download BLIS as a zip file.

    In order to download a zip file of the BLIS source distribution, please click on the green button above the file listing near the top of this page. This should reveal a link for downloading the zip file.

  3. Download a source release via a tarball/zip file. Alternatively, if you would like to stick to the code that is included in official releases, you may download either a tarball or zip file of any of BLIS's previous tagged releases. We consider this option to be less than ideal for most people since it will likely mean you miss out on the latest bugfix or feature commits (in contrast to Options 1 or 2), and you also will not be able to update your code with a simple git pull command (in contrast to Option 1).

  4. Download a binary package specific to your OS. While we don't recommend this as the first choice for most users, we provide links to community members who generously maintain BLIS packages for various Linux distributions such as Debian Unstable and EPEL/Fedora. Please see the External Packages section below for more information.

Getting Started

NOTE: This section assumes you've either cloned a BLIS source code repository via git, downloaded the latest source code via a zip file, or downloaded the source code for a tagged version release---Options 1, 2, or 3, respectively, as discussed in the previous section.

If you just want to build a sequential (not parallelized) version of BLIS in a hurry and come back and explore other topics later, you can configure and build BLIS as follows:

$ ./configure auto
$ make [-j]

You can then verify your build by running BLAS- and BLIS-specific test drivers via make check:

$ make check [-j]

And if you would like to install BLIS to the directory specified to configure via the --prefix option, run the install target:

$ make install

Please read the output of ./configure --help for a full list of configure-time options. If/when you have time, we strongly encourage you to read the detailed walkthrough of the build system found in our Build System guide.

Example Code

The BLIS source distribution provides example code in the examples directory. Example code focuses on using BLIS APIs (not BLAS or CBLAS), and resides in two subdirectories: examples/oapi (which demonstrates the object API) and examples/tapi (which demonstrates the typed API).

Either directory contains several files, each containing various pieces of code that exercise core functionality of the BLIS API in question (object or typed). These example files should be thought of collectively like a tutorial, and therefore it is recommended to start from the beginning (the file that starts in 00).

You can build all of the examples by simply running make from either example subdirectory (examples/oapi or examples/tapi). (You can also run make clean.) The local Makefile assumes that you've already configured and built (but not necessarily installed) BLIS two directories up, in ../... If you have already installed BLIS to some permanent directory, you may refer to that installation by setting the environment variable BLIS_INSTALL_PATH prior to running make:

export BLIS_INSTALL_PATH=/usr/local; make

or by setting the same variable as part of the make command:

make BLIS_INSTALL_PATH=/usr/local

Once the executable files have been built, we recommend reading the code and the corresponding executable output side by side. This will help you see the effects of each section of code.

This tutorial is not exhaustive or complete; several object API functions were omitted (mostly for brevity's sake) and thus more examples could be written.

Documentation

We provide extensive documentation on the BLIS build system, APIs, test infrastructure, and other important topics. All documentation is formatted in markdown and included in the BLIS source distribution (usually in the docs directory). Slightly longer descriptions of each document may be found via in the project's wiki section.

Documents for everyone:

  • Build System. This document covers the basics of configuring and building BLIS libraries, as well as related topics.

  • Testsuite. This document describes how to run BLIS's highly parameterized and configurable test suite, as well as the included BLAS test drivers.

  • BLIS Typed API Reference. Here we document the so-called "typed" (or BLAS-like) API. This is the API that many users who are already familiar with the BLAS will likely want to use.

  • BLIS Object API Reference. Here we document the object API. This is API abstracts away properties of vectors and matrices within obj_t structs that can be queried with accessor functions. Many developers and experts prefer this API over the typed API.

  • Hardware Support. This document maintains a table of supported microarchitectures.

  • Multithreading. This document describes how to use the multithreading features of BLIS.

  • Mixed-Datatypes. This document provides an overview of BLIS's mixed-datatype functionality and provides a brief example of how to take advantage of this new code.

  • Performance. This document reports empirically measured performance of a representative set of level-3 operations on a variety of hardware architectures, as implemented within BLIS and other BLAS libraries for all four of the standard floating-point datatypes.

  • PerformanceSmall. This document reports empirically measured performance of gemm on select hardware architectures within BLIS and other BLAS libraries when performing matrix problems where one or two dimensions is exceedingly small.

  • Release Notes. This document tracks a summary of changes included with each new version of BLIS, along with contributor credits for key features.

  • Frequently Asked Questions. If you have general questions about BLIS, please read this FAQ. If you can't find the answer to your question, please feel free to join the blis-devel mailing list and post a question. We also have a blis-discuss mailing list that anyone can post to (even without joining).

Documents for github contributors:

  • Contributing bug reports, feature requests, PRs, etc. Interested in contributing to BLIS? Please read this document before getting started. It provides a general overview of how best to report bugs, propose new features, and offer code patches.

  • Coding Conventions. If you are interested or planning on contributing code to BLIS, please read this document so that you can format your code in accordance with BLIS's standards.

Documents for BLIS developers:

  • Kernels Guide. If you would like to learn more about the types of kernels that BLIS exposes, their semantics, the operations that each kernel accelerates, and various implementation issues, please read this guide.

  • Configuration Guide. If you would like to learn how to add new sub-configurations or configuration families, or are simply interested in learning how BLIS organizes its configurations and kernel sets, please read this thorough walkthrough of the configuration system.

  • Addon Guide. If you are interested in learning about using BLIS addons--that is, enabling existing (or creating new) bundles of operation or API code that are built into a BLIS library--please read this document.

  • Sandbox Guide. If you are interested in learning about using sandboxes in BLIS--that is, providing alternative implementations of the gemm operation--please read this document.

Performance

We provide graphs that report performance of several implementations across a range of hardware types, multithreading configurations, problem sizes, operations, and datatypes. These pages also document most of the details needed to reproduce these experiments.

  • Performance. This document reports empirically measured performance of a representative set of level-3 operations on a variety of hardware architectures, as implemented within BLIS and other BLAS libraries for all four of the standard floating-point datatypes.

  • PerformanceSmall. This document reports empirically measured performance of gemm on select hardware architectures within BLIS and other BLAS libraries when performing matrix problems where one or two dimensions is exceedingly small.

External Packages

Generally speaking, we highly recommend building from source whenever possible using the latest git clone. (Tarballs of each tagged release are also available, but we consider them to be less ideal since they are not as easy to upgrade as git clones.)

That said, some users may prefer binary and/or source packages through their Linux distribution. Thanks to generous involvement/contributions from our community members, the following BLIS packages are now available:

  • Debian. M. Zhou has volunteered to sponsor and maintain BLIS packages within the Debian Linux distribution. The Debian package tracker can be found here. (Also, thanks to Nico Schlömer for previously volunteering his time to set up a standalone PPA.)

  • Gentoo. M. Zhou also maintains the BLIS package entry for Gentoo, a Linux distribution known for its source-based portage package manager and distribution system.

  • EPEL/Fedora. There are official BLIS packages in Fedora and EPEL (for RHEL7+ and compatible distributions) with versions for 64-bit integers, OpenMP, and pthreads, and shims which can be dynamically linked instead of reference BLAS. (NOTE: For architectures other than intel64, amd64, and maybe arm64, the performance of packaged BLIS will be low because it uses unoptimized generic kernels; for those architectures, OpenBLAS may be a better solution.) Dave Love provides additional packages for EPEL6 in a Fedora Copr, and possibly versions more recent than the official repo for other EPEL/Fedora releases. The source packages may build on other rpm-based distributions.

  • OpenSuSE. The copr referred to above has rpms for some OpenSuSE releases; the source rpms may build for others.

  • GNU Guix. Guix has BLIS packages, provides builds only for the generic target and some specific x86_64 micro-architectures.

  • Conda. conda channel conda-forge has Linux, OSX and Windows binary packages for x86_64.

Discussion

You can keep in touch with developers and other users of the project by joining one of the following mailing lists:

  • blis-devel: Please join and post to this mailing list if you are a BLIS developer, or if you are trying to use BLIS beyond simply linking to it as a BLAS library. Note: Most of the interesting discussions happen here; don't be afraid to join! If you would like to submit a bug report, or discuss a possible bug, please consider opening a new issue on github.

  • blis-discuss: Please join and post to this mailing list if you have general questions or feedback regarding BLIS. Application developers (end users) may wish to post here, unless they have bug reports, in which case they should open a new issue on github.

Contributing

For information on how to contribute to our project, including preferred coding conventions, please refer to the CONTRIBUTING file at the top-level of the BLIS source distribution.

Citations

For those of you looking for the appropriate article to cite regarding BLIS, we recommend citing our first ACM TOMS journal paper (unofficial backup link):

@article{BLIS1,
   author      = {Field G. {V}an~{Z}ee and Robert A. {v}an~{d}e~{G}eijn},
   title       = {{BLIS}: A Framework for Rapidly Instantiating {BLAS} Functionality},
   journal     = {ACM Transactions on Mathematical Software},
   volume      = {41},
   number      = {3},
   pages       = {14:1--14:33},
   month       = {June},
   year        = {2015},
   issue_date  = {June 2015},
   url         = {https://doi.acm.org/10.1145/2764454},
}

You may also cite the second ACM TOMS journal paper (unofficial backup link):

@article{BLIS2,
   author      = {Field G. {V}an~{Z}ee and Tyler Smith and Francisco D. Igual and
                  Mikhail Smelyanskiy and Xianyi Zhang and Michael Kistler and Vernon Austel and
                  John Gunnels and Tze Meng Low and Bryan Marker and Lee Killough and
                  Robert A. {v}an~{d}e~{G}eijn},
   title       = {The {BLIS} Framework: Experiments in Portability},
   journal     = {ACM Transactions on Mathematical Software},
   volume      = {42},
   number      = {2},
   pages       = {12:1--12:19},
   month       = {June},
   year        = {2016},
   issue_date  = {June 2016},
   url         = {https://doi.acm.org/10.1145/2755561},
}

We also have a third paper, submitted to IPDPS 2014, on achieving multithreaded parallelism in BLIS (unofficial backup link):

@inproceedings{BLIS3,
   author      = {Tyler M. Smith and Robert A. {v}an~{d}e~{G}eijn and Mikhail Smelyanskiy and
                  Jeff R. Hammond and Field G. {V}an~{Z}ee},
   title       = {Anatomy of High-Performance Many-Threaded Matrix Multiplication},
   booktitle   = {28th IEEE International Parallel \& Distributed Processing Symposium
                  (IPDPS 2014)},
   year        = {2014},
   url         = {https://doi.org/10.1109/IPDPS.2014.110},
}

A fourth paper, submitted to ACM TOMS, also exists, which proposes an analytical model for determining blocksize parameters in BLIS (unofficial backup link):

@article{BLIS4,
   author      = {Tze Meng Low and Francisco D. Igual and Tyler M. Smith and
                  Enrique S. Quintana-Ort\'{\i}},
   title       = {Analytical Modeling Is Enough for High-Performance {BLIS}},
   journal     = {ACM Transactions on Mathematical Software},
   volume      = {43},
   number      = {2},
   pages       = {12:1--12:18},
   month       = {August},
   year        = {2016},
   issue_date  = {August 2016},
   url         = {https://doi.acm.org/10.1145/2925987},
}

A fifth paper, submitted to ACM TOMS, begins the study of so-called induced methods for complex matrix multiplication (unofficial backup link):

@article{BLIS5,
   author      = {Field G. {V}an~{Z}ee and Tyler Smith},
   title       = {Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods},
   journal     = {ACM Transactions on Mathematical Software},
   volume      = {44},
   number      = {1},
   pages       = {7:1--7:36},
   month       = {July},
   year        = {2017},
   issue_date  = {July 2017},
   url         = {https://doi.acm.org/10.1145/3086466},
}

A sixth paper, submitted to ACM TOMS, revisits the topic of the previous article and derives a superior induced method (unofficial backup link):

@article{BLIS6,
   author      = {Field G. {V}an~{Z}ee},
   title       = {Implementing High-Performance Complex Matrix Multiplication via the 1m Method},
   journal     = {SIAM Journal on Scientific Computing},
   volume      = {42},
   number      = {5},
   pages       = {C221--C244},
   month       = {September}
   year        = {2020},
   issue_date  = {September 2020},
   url         = {https://doi.org/10.1137/19M1282040}
}

A seventh paper, submitted to ACM TOMS, explores the implementation of gemm for mixed-domain and/or mixed-precision operands (unofficial backup link):

@article{BLIS7,
   author      = {Field G. {V}an~{Z}ee and Devangi N. Parikh and Robert A. van~de~{G}eijn},
   title       = {Supporting Mixed-domain Mixed-precision Matrix Multiplication
within the BLIS Framework},
   journal     = {ACM Transactions on Mathematical Software},
   volume      = {47},
   number      = {2},
   pages       = {12:1--12:26},
   month       = {April},
   year        = {2021},
   issue_date  = {April 2021},
   url         = {https://doi.org/10.1145/3402225},
}

Funding

This project and its associated research were partially sponsored by grants from Microsoft, Intel, Texas Instruments, AMD, HPE, Oracle, Huawei, Facebook, and ARM, as well as grants from the National Science Foundation (Awards CCF-0917167, ACI-1148125/1340293, CCF-1320112, and ACI-1550493).

Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

Comments
  • Failures with AVX512 in numpy test suite when using blis on windows (through conda-forge)

    Failures with AVX512 in numpy test suite when using blis on windows (through conda-forge)

    Conda-forge is the community driven channel for the packaging & environment tool conda, which is used heavily in many scientific & corporate environments, especially (but not only) in terms of python.

    conda-forge implements a blas meta-package that allows selecting between different blas/lapack implementations. Currently available are: netlib, mkl, openblas & blis (where blis uses the netlib binaries for lapack).

    For the last couple of releases, I've been running the numpy (& scipy) test-suites against all those blas variants to root out any potential errors. Blis on windows (at least as packaged through conda-forge) was always a problem child, but has gotten a bit better recently. Still, I keep getting some ~flaky~ test failures both for the numpy & scipy test suites ~- flaky as in: they appear in one run, and then disappear again, with the only difference being a CI restart.~ due to absence/presence of AVX512.

    Those errors look pretty scary numerically - e.g. all NaNs (where non-NaN would be expected), or a matrix that's completely zero instead of the identity - more details below.

    For some example CI runs see here and here (coming from https://github.com/conda-forge/numpy-feedstock/pull/227, but also occurring in https://github.com/conda-forge/numpy-feedstock/pull/237).

    Reproducer (on a windows system with AVX512):

    # install miniconda, e.g. from https://github.com/conda-forge/miniforge
    conda config --add channels conda-forge
    # set up environment
    conda create -n test_env numpy "blas =*=blis" pytest hypothesis setuptools
    # activate environment
    conda activate test_env
    # confirm that instruction sets '... AVX512F* AVX512CD* AVX512_SKX* ...' are detected
    python -c "import numpy; numpy._pytesttester._show_numpy_info()"
    # run (relevant subset of) the test suite
    pytest --pyargs numpy.linalg.tests.test_linalg -v
    
    Short log of failures
    =========================== short test summary info ===========================
    FAILED core/tests/test_multiarray.py::TestMatmul::test_dot_equivalent[args4]
    FAILED linalg/tests/test_linalg.py::TestSolve::test_sq_cases - AssertionError...
    FAILED linalg/tests/test_linalg.py::TestSolve::test_generalized_sq_cases - As...
    FAILED linalg/tests/test_linalg.py::TestInv::test_sq_cases - AssertionError: ...
    FAILED linalg/tests/test_linalg.py::TestInv::test_generalized_sq_cases - Asse...
    FAILED linalg/tests/test_linalg.py::TestPinv::test_generalized_sq_cases - Ass...
    FAILED linalg/tests/test_linalg.py::TestPinv::test_generalized_nonsq_cases - ...
    FAILED linalg/tests/test_linalg.py::TestDet::test_sq_cases - AssertionError: ...
    FAILED linalg/tests/test_linalg.py::TestDet::test_generalized_sq_cases - Asse...
    FAILED linalg/tests/test_linalg.py::TestMatrixPower::test_power_is_minus_one[dt13]
    FAILED linalg/tests/test_linalg.py::TestCholesky::test_basic_property - Asser...
    = 11 failed, 13581 passed, 714 skipped, 1 deselected, 20 xfailed, 1 xpassed, 229 warnings in 447.85s (0:07:27) =
    
    Failure of TestMatmul.test_dot_equivalent[args4]
    ____________________ TestMatmul.test_dot_equivalent[args4] ____________________
    
    self = <numpy.core.tests.test_multiarray.TestMatmul object at 0x000002A1BE3DFFD0>
    args = (array([[ 0.,  1.,  2.],
           [ 3.,  4.,  5.],
           [ 6.,  7.,  8.],
           [ 9., 10., 11.],
           [12., 13., 14.]]), array([[ 0.,  3.,  6.,  9., 12.],
           [ 1.,  4.,  7., 10., 13.],
           [ 2.,  5.,  8., 11., 14.]]))
    
        @pytest.mark.parametrize('args', (
                [...]
            ))
        def test_dot_equivalent(self, args):
            r1 = np.matmul(*args)
            r2 = np.dot(*args)
    >       assert_equal(r1, r2)
    E       AssertionError:
    E       Arrays are not equal
    E
    E       x and y nan location mismatch:
    E        x: array([[nan, nan, nan, nan, nan],
    E              [nan, nan, nan, nan, nan],
    E              [nan, nan, nan, nan, nan],...
    E        y: array([[  5.,  14.,  23.,  32.,  41.],
    E              [ 14.,  50.,  86., 122., 158.],
    E              [ 23.,  86., 149., 212., 275.],...
    
    Failure of TestSolve.test_sq_cases
    ___________________________ TestSolve.test_sq_cases ___________________________
    
    actual = array([2.+1.j, 1.+2.j], dtype=complex64)
    desired = array([0.+0.j, 0.+0.j], dtype=complex64), decimal = 6, err_msg = ''
    verbose = True
    
    Failure of TestSolve.test_generalized_sq_cases
    _____________________ TestSolve.test_generalized_sq_cases _____________________
    
    actual = array([[ 2. +1.j,  1. +2.j],
           [14. +7.j,  7.+14.j],
           [12. +6.j,  6.+12.j]], dtype=complex64)
    desired = array([[0.+0.j, 0.+0.j],
           [0.+0.j, 0.+0.j],
           [0.+0.j, 0.+0.j]], dtype=complex64)
    decimal = 6, err_msg = '', verbose = True
    
    Failure of TestInv.test_sq_cases
    ____________________________ TestInv.test_sq_cases ____________________________
    
    actual = array([[0.+0.j, 0.+0.j],
           [0.+0.j, 0.+0.j]], dtype=complex64)
    desired = array([[1., 0.],
           [0., 1.]]), decimal = 6, err_msg = ''
    verbose = True
    
    Failure of TestInv.test_generalized_sq_cases
    ______________________ TestInv.test_generalized_sq_cases ______________________
    
    actual = array([[[0.+0.j, 0.+0.j],
            [0.+0.j, 0.+0.j]],
    
           [[0.+0.j, 0.+0.j],
            [0.+0.j, 0.+0.j]],
    
           [[0.+0.j, 0.+0.j],
            [0.+0.j, 0.+0.j]]], dtype=complex64)
    desired = array([[[1.+0.j, 0.+0.j],
            [0.+0.j, 1.+0.j]],
    
           [[1.+0.j, 0.+0.j],
            [0.+0.j, 1.+0.j]],
    
           [[1.+0.j, 0.+0.j],
            [0.+0.j, 1.+0.j]]], dtype=complex64)
    decimal = 6, err_msg = '', verbose = True
    
    Failure of TestPinv.test_generalized_sq_cases
    _____________________ TestPinv.test_generalized_sq_cases ______________________
    
    [...]
    
    self = <numpy.linalg.tests.test_linalg.TestPinv object at 0x000002A1C2382E50>
    a = array([[[0.19151945, 0.62210877, 0.43772774, 0.78535858, 0.77997581,
             0.27259261, 0.27646426, 0.80187218],
       ...],
            [2.3715724 , 2.9762444 , 2.87640529, 2.37589241, 0.85575288,
             1.87475012, 1.43428139, 0.58702554]]])
    b = array([[0.38231745, 0.05387369, 0.45164841, 0.98200474, 0.1239427 ,
            0.1193809 , 0.73852306, 0.58730363],
         ...543],
           [2.29390471, 0.32324211, 2.70989045, 5.89202845, 0.7436562 ,
            0.71628539, 4.43113834, 3.5238218 ]])
    tags = frozenset({'generalized', 'square', 'strided'})
    
        def do(self, a, b, tags):
            a_ginv = linalg.pinv(a)
            # `a @ a_ginv == I` does not hold if a is singular
            dot = dot_generalized
    >       assert_almost_equal(dot(dot(a, a_ginv), a), a, single_decimal=5, double_decimal=11)
    
    [...]
    
    actual = array([[[nan, nan, nan, nan, nan, nan, nan, nan],
            [nan, nan, nan, nan, nan, nan, nan, nan],
            [nan, nan,..., nan, nan, nan],
            [nan, nan, nan, nan, nan, nan, nan, nan],
            [nan, nan, nan, nan, nan, nan, nan, nan]]])
    desired = array([[[0.19151945, 0.62210877, 0.43772774, 0.78535858, 0.77997581,
             0.27259261, 0.27646426, 0.80187218],
       ...],
            [2.3715724 , 2.9762444 , 2.87640529, 2.37589241, 0.85575288,
             1.87475012, 1.43428139, 0.58702554]]])
    decimal = 11, err_msg = '', verbose = True
    
    Failure of TestPinv.test_generalized_nonsq_cases
    ____________________ TestPinv.test_generalized_nonsq_cases ____________________
    
    [...]
    
    self = <numpy.linalg.tests.test_linalg.TestPinv object at 0x000002A1C2395E80>
    a = array([[[0.22921857, 0.89996519, 0.41675354, 0.53585166, 0.00620852,
             0.30064171, 0.43689317, 0.612149  , 0.91...8, 2.32854217, 2.34743398,
             2.28481174, 2.74320934, 1.97586835, 1.70510274, 0.60526708,
             2.09488913]]])
    b = array([[0.95219541, 0.88996329, 0.99356736, 0.81870351, 0.54512217,
            0.45125405, 0.89055719, 0.97326479],
         ...354],
           [5.71317246, 5.33977972, 5.96140418, 4.91222106, 3.270733  ,
            2.70752433, 5.34334313, 5.83958875]])
    tags = frozenset({'generalized', 'nonsquare', 'strided'})
    
        def do(self, a, b, tags):
            a_ginv = linalg.pinv(a)
            # `a @ a_ginv == I` does not hold if a is singular
            dot = dot_generalized
    >       assert_almost_equal(dot(dot(a, a_ginv), a), a, single_decimal=5, double_decimal=11)
    
    [...]
    
    actual = array([[[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
            [nan, nan, nan, nan, nan, nan, nan, nan, nan,..., nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
            [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]]])
    desired = array([[[0.22921857, 0.89996519, 0.41675354, 0.53585166, 0.00620852,
             0.30064171, 0.43689317, 0.612149  , 0.91...8, 2.32854217, 2.34743398,
             2.28481174, 2.74320934, 1.97586835, 1.70510274, 0.60526708,
             2.09488913]]])
    decimal = 11, err_msg = '', verbose = True
    
    Failure of TestDet.test_sq_cases
    ____________________________ TestDet.test_sq_cases ____________________________
    
    actual = (6-17j), desired = (3.9968028886505635e-15-4.000000000000001j)
    decimal = 6, err_msg = '', verbose = True
    
    Failure of TestDet.test_generalized_sq_cases
    ______________________ TestDet.test_generalized_sq_cases ______________________
    
    actual = array([ 6. -17.j, 24. -68.j, 54.-153.j], dtype=complex64)
    desired = array([3.99680289e-15 -4.j, 1.59872116e-14-16.j, 2.13162821e-14-36.j])
    decimal = 6, err_msg = '', verbose = True
    
    Failure of TestMatrixPower.test_power_is_minus_one[dt13]
    ________________ TestMatrixPower.test_power_is_minus_one[dt13] ________________
    
    actual = array([[0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j],
           [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j],
           [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j],
           [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j]], dtype=complex64)
    desired = array([[1., 0., 0., 0.],
           [0., 1., 0., 0.],
           [0., 0., 1., 0.],
           [0., 0., 0., 1.]])
    decimal = 6, err_msg = '', verbose = True
    
    Failure of TestCholesky::test_basic_property
    ______________________ TestCholesky.test_basic_property _______________________
    
    self = <numpy.linalg.tests.test_linalg.TestCholesky object at 0x000002A1C0E0A940>
    
        def test_basic_property(self):
            # Check A = L L^H
            shapes = [(1, 1), (2, 2), (3, 3), (50, 50), (3, 10, 10)]
            dtypes = (np.float32, np.float64, np.complex64, np.complex128)
    
            for shape, dtype in itertools.product(shapes, dtypes):
                np.random.seed(1)
                a = np.random.randn(*shape)
                if np.issubdtype(dtype, np.complexfloating):
                    a = a + 1j*np.random.randn(*shape)
    
                t = list(range(len(shape)))
                t[-2:] = -1, -2
    
                a = np.matmul(a.transpose(t).conj(), a)
                a = np.asarray(a, dtype=dtype)
    
                c = np.linalg.cholesky(a)
    
                b = np.matmul(c, c.transpose(t).conj())
    >           assert_allclose(b, a,
                                err_msg=f'{shape} {dtype}\n{a}\n{c}',
                                atol=500 * a.shape[0] * np.finfo(dtype).eps)
    E           AssertionError:
    E           Not equal to tolerance rtol=1e-07, atol=0.000119209
    E           (2, 2) <class 'numpy.complex64'>
    E           [[ 6.7107615-6.181477e-17j -3.746924 -9.348988e-01j]
    E            [-3.746924 +9.348988e-01j  7.402024 -6.601837e-17j]]
    E           [[2.5905137+0.j 0.       +0.j]
    E            [0.       -0.j 2.7206662+0.j]]
    E           Mismatched elements: 2 / 4 (50%)
    E           Max absolute difference: 3.8617969
    E           Max relative difference: 1.
    E            x: array([[6.710761+0.j, 0.      +0.j],
    E                  [0.      +0.j, 7.402024+0.j]], dtype=complex64)
    E            y: array([[ 6.710762-6.181477e-17j, -3.746924-9.348988e-01j],
    E                  [-3.746924+9.348988e-01j,  7.402024-6.601837e-17j]],
    E                 dtype=complex64)
    
    bug 
    opened by h-vetinari 93
  • New GEMM Assembly & Configuration Set for Arm SVE

    New GEMM Assembly & Configuration Set for Arm SVE

    NOTICE: Branch xrq-phys:armsve-cfg-venture has been rebased/reworked for several times. Comments below may not reflect their context code commits.

    I'm reopening #422 with a few updates:

    • A few improvements on the ARM SVE-512 kernel;
    • A new dgemm kernel specialized for A64fx chip (reason below);
    • A new configuration called a64fx;

    Reason for a different dgemm kernel for A64fx

    dgemm_armsve512_asm kernel under kernels/armsve/3 is mainly composed of SVE indexed FMLA instructions (opcode=0x64). This strategy is the same as dgemm_armsve256_asm kernel located at the same directory. It is able to increase the interval between a vector's load and its reference by FMLA. However, actual profiling result of that kernel gives (Test size: GEMM ~~(2400,1600,500)~~ (2000,1400,500)):

    Bmk_DGEMM

    Left part of combo histogram shows that in most time the processor is only committing 1 or 2 instructions while A64fx has 2 FP pipelines and 2 integer pipelines summing up to 4. This fact drastically lowers final GFlOps yielded (c.f. spread at the end). However, FP stall rate and memory/cache access wait is quite low, indicating no impediment to FP pipelines.

    ~Though not documented in materials disclosed by Fujitsu, I suspect~ According to A64fx uarch manual (https://github.com/fujitsu/A64FX), the FP pipeline in A64fx does not have element duplicator for indexed SVE operations so that one single 0x64 FMLA is executed with both FP pipelines, each half-occupied. As a workaround, another kernel is created for A64fx with 0x64 FMLA replaced with 0x65 FMLA and it does yield higher GFlOps:

    Bmk_DGEMM 2

    | | BLIS + 0x64 Kernel | BLIS + 0x65 Kernel | Fujitsu BLAS | | ---- | ---- | ---- | ---- | | DGEMM(2000,1400,500) | 24 GFlOps | 33 GFlOps | 42 GFlOps | | (Table has typo corrected.) | | | |

    Again, I want to say again that this pull request contains 2 components:

    • Arm SVE-512 dgemm kernels with one more specialized for A64fx;
    • Configurations for general Arm SVE and A64fx.

    It can be separated into 4 dependent but self-inclusive themes so if you feel this pull request too big please feel free to close it and let me know. I'll relaunch with separated code changes.

    opened by xrq-phys 55
  • Tests fail on POWER machines

    Tests fail on POWER machines

    • testsuite-run-fast fails on POWER9 and POWER10 with error message: 4440 Segmentation fault ./test_libblis.x -g ./testsuite/input.general.fast -o ./testsuite/input.operations.fast > output.testsuite
    • POWER7 dgemm microkernel includes altivec.h, which defines type bool. Because of that, edge-case handling macros here and here use a conflicting type bool from stdbool. A possible solution is to check for POWER architecture in edge_case_macros and define _bool as int for POWER and bool for other architectures.
    • POWER10 build fails because POWER10 microkernels use the old microkernel API without edge handling (addressed in #620). Also, type nibble is defined in the sandbox but used for i4 microkernel so any POWER10 user has to configure BLIS with -s power10. Is it intentional?

    POWER9 uses slow reference implementations for sgemm, cgemm, zgemm by default, are there plans to support a fast sgemm microkernel for POWER9?

    opened by ivan23kor 48
  • Added support for AMD's Zen3 architecture

    Added support for AMD's Zen3 architecture

    Added support for Zen3 Configuration.

    -- Added new zen3 configuration and auto detection of zen3 architecture -- Added configuration family amd64 for all zen architectures -- Moved older AMD architecture in amd64_legacy family -- Updated zen2 makefiles to pick znver2 for clang (if supported by the detected clang version)


    AMD BLIS Upstream:

    This PR includes following commits for AMD BLIS version 3.0.1

    pick 9c7814da Added support for zen3 configuration squash 536edc40 Added support for zen3 configuration squash 23a2073c Added support for zen3 configuration squash 449ee370 Added support for zen3 configuration pick 25d23cdd Zen3 support, disabled IR, JR loop parallelization squash 9d7978ee Zen3 support, disabled IR, JR loop parallelization pick f8ab9f63 Enabled znver3 flag for zen3 architecture squash 38a8008c Enabled znver3 flag for zen3 architecture pick b1144a85 Added -fomit-frame-pointer option to CKOPTFLAGS. pick ce99b1ec Added dynamic block size selection logic for DGEMM. squash f9d06c74 Added dynamic block size selection logic for DGEMM. pick 14e21603 Update amd64 bundle configuration squash c2f63fcc Update amd64 bundle configuration

    opened by dzambare 47
  • knl build fails

    knl build fails

    After configuring 0.3.0 for knl with GCC 7 on CentOS7, the build fails:

    $ make -k V=1
    gcc -O3 -mavx512f -mavx512pf -mfpmath=sse -march=knl -Wall -Wno-unused-function -Wfatal-errors -fPIC -std=c99 -D_POSIX_C_SOURCE=200112L -I./include/knl -I./frame/3/ -I./frame/ind/ukernels/ -I./frame/1m/ -I./frame/1f/ -I./frame/1/ -I./frame/include -DBLIS_VERSION_STRING=\"0.3.0\" -c kernels/knl/1m/bli_packm_knl_asm_24x8.c -o obj/knl/kernels/knl/1m/bli_packm_knl_asm_24x8.o
    kernels/knl/1m/bli_packm_knl_asm_24x8.c:105:6: error: conflicting types for ‘bli_dpackm_knl_asm_8xk’
     void bli_dpackm_knl_asm_8xk
          ^~~~~~~~~~~~~~~~~~~~~~
    compilation terminated due to -Wfatal-errors.
    make: *** [obj/knl/kernels/knl/1m/bli_packm_knl_asm_24x8.o] Error 1
    gcc -O3 -mavx512f -mavx512pf -mfpmath=sse -march=knl -Wall -Wno-unused-function -Wfatal-errors -fPIC -std=c99 -D_POSIX_C_SOURCE=200112L -I./include/knl -I./frame/3/ -I./frame/ind/ukernels/ -I./frame/1m/ -I./frame/1f/ -I./frame/1/ -I./frame/include -DBLIS_VERSION_STRING=\"0.3.0\" -c kernels/knl/1m/bli_packm_knl_asm_30x8.c -o obj/knl/kernels/knl/1m/bli_packm_knl_asm_30x8.o
    kernels/knl/1m/bli_packm_knl_asm_30x8.c:133:6: error: conflicting types for ‘bli_dpackm_knl_asm_30xk’
     void bli_dpackm_knl_asm_30xk
          ^~~~~~~~~~~~~~~~~~~~~~~
    compilation terminated due to -Wfatal-errors.
    make: *** [obj/knl/kernels/knl/1m/bli_packm_knl_asm_30x8.o] Error 1
    

    -Wfatal-errors seems unhelpful as it stops the trace of macro definitions being shown.

    opened by loveshack 47
  • threading=pthreads is broken

    threading=pthreads is broken

    I'm trying to use pthreads-win32 as the threading library and there are linker errors now.

    See https://ci.appveyor.com/project/isuruf/blis/builds/19668917/job/fhjfy9wwtl8ynocj?fullLog=true

    OpenMP and no threading works fine. https://ci.appveyor.com/project/isuruf/blis/builds/19668917

    opened by isuruf 46
  • gcc minimum version needs increasing for x86

    gcc minimum version needs increasing for x86

    The gcc 4.7+ requirement in configure needs to be increased. I don't know when they were introduced, but -march=sandybridge and haswell aren't supported in gcc 4.8 (EL7).

    opened by loveshack 44
  • multithread by default?

    multithread by default?

    Widely used BLAS implementations such as MKL, OpenBLAS, enabled multithread by default. As a BLAS drop-in replacement I don't expect every user to be aware of the environment variable to set number of threads...

    Why is BLIS running in single thread mode by default, unlike MKL/OpenBLAS?

    opened by cdluminate 42
  • [aarch64] possible issue with atomic barrier and generic implementation (lack of good atomic support on generic kernel ?)

    [aarch64] possible issue with atomic barrier and generic implementation (lack of good atomic support on generic kernel ?)

    I've been observing a strange behavior when launching some job on an Ampere Altra processor (ARM neoverse-n1 based design) with multiple threads. The symptoms remain the same with both pthread and openmp threading : when launching the computation with multiple threads, the computation is likely to deadlock when a dgemm_ request is made.

    By looking at the stracktrace, when the dgemm_ call is made, I can see an awful lot of threads there (see extract below, with the thread number). In this example, I was running with 64 threads in BLIS (with a 80-core processor).

    Thread 1103 (Thread 0xffe2b1fcc5f0 (LWP 3186538) "actranpy"):
    #0  0x0000ffffb3f9a018 in bli_thrcomm_barrier_atomic () from /bin/../lib/libblis.so.4
    #1  0x0000ffffb3f99f8c in bli_thrcomm_bcast () from /bin/../lib/libblis.so.4
    #2  0x0000ffffb3eee56c in bli_packm_alloc () from /bin/../lib/libblis.so.4
    #3  0x0000ffffb3eefe2c in bli_packm_init () from /bin/../lib/libblis.so.4
    #4  0x0000ffffb3eee638 in bli_packm_blk_var1 () from /bin/../lib/libblis.so.4
    #5  0x0000ffffb3eefecc in bli_packm_int () from /bin/../lib/libblis.so.4
    #6  0x0000ffffb3f1b784 in bli_l3_packb () from /bin/../lib/libblis.so.4
    #7  0x0000ffffb3f1a398 in bli_l3_int () from /bin/../lib/libblis.so.4
    #8  0x0000ffffb3f2d850 in bli_gemm_blk_var3 () from /bin/../lib/libblis.so.4
    #9  0x0000ffffb3f1a398 in bli_l3_int () from /bin/../lib/libblis.so.4
    #10 0x0000ffffb3f2d6fc in bli_gemm_blk_var2 () from /bin/../lib/libblis.so.4
    #11 0x0000ffffb3f1a398 in bli_l3_int () from /bin/../lib/libblis.so.4
    #12 0x0000ffffb3f99a30 in bli_l3_thread_entry () from /bin/../lib/libblis.so.4
    #13 0x0000ffffb3627898 in start_thread () from /lib64/libpthread.so.0
    #14 0x0000ffffb3571ddc in thread_start () from /lib64/libc.so.6
    [...]
    Thread 667 (Thread 0xffe44e2ec5f0 (LWP 3220206) "actranpy"):
    #0  0x0000ffffb3f9a018 in bli_thrcomm_barrier_atomic () from /bin/../lib/libblis.so.4
    #1  0x0000ffffb3f99f8c in bli_thrcomm_bcast () from /bin/../lib/libblis.so.4
    #2  0x0000ffffb3f9b94c in bli_thrinfo_create_for_cntl () from /bin/../lib/libblis.so.4
    #3  0x0000ffffb3f9ba5c in bli_thrinfo_rgrow () from /bin/../lib/libblis.so.4
    #4  0x0000ffffb3f9bcb0 in bli_thrinfo_grow () from /bin/../lib/libblis.so.4
    #5  0x0000ffffb3f1a374 in bli_l3_int () from /bin/../lib/libblis.so.4
    #6  0x0000ffffb3f2d6fc in bli_gemm_blk_var2 () from /bin/../lib/libblis.so.4
    #7  0x0000ffffb3f1a398 in bli_l3_int () from /bin/../lib/libblis.so.4
    #8  0x0000ffffb3f99a30 in bli_l3_thread_entry () from /bin/../lib/libblis.so.4
    #9  0x0000ffffb3627898 in start_thread () from /lib64/libpthread.so.0
    #10 0x0000ffffb3571ddc in thread_start () from /lib64/libc.so.6
    
    

    original call is

    Thread 1 (Thread 0xffffb1fa3850 (LWP 3137984) "actranpy"):
    #0  0x0000ffffb3f9a018 in bli_thrcomm_barrier_atomic () from /bin/../lib/libblis.so.4
    #1  0x0000ffffb3f9b998 in bli_thrinfo_create_for_cntl () from /bin/../lib/libblis.so.4
    #2  0x0000ffffb3f9bcb0 in bli_thrinfo_grow () from /bin/../lib/libblis.so.4
    #3  0x0000ffffb3f1a374 in bli_l3_int () from /bin/../lib/libblis.so.4
    #4  0x0000ffffb3f99a30 in bli_l3_thread_entry () from /bin/../lib/libblis.so.4
    #5  0x0000ffffb3f99bf0 in bli_l3_thread_decorator () from /bin/../lib/libblis.so.4
    #6  0x0000ffffb3f2e0b4 in bli_gemm_front () from /bin/../lib/libblis.so.4
    #7  0x0000ffffb3f1a6d8 in bli_gemm_ex () from /bin/../lib/libblis.so.4
    #8  0x0000ffffb3f62834 in dgemm_ () from /bin/../lib/libblis.so.4
    
    

    How can I help identifying the underlying problem ?

    bug 
    opened by egaudry 41
  • Implement missing pthreads function on Windows

    Implement missing pthreads function on Windows

    Implement pthreads mutexes and one-time initialization using Windows primitives to avoid pthreads dependency. Fixes #247. In the current state, pthreads-dependence of the testsuite driver is not addressed. Also, it has not been tested.

    opened by devinamatthews 39
  • Conflict type lsame_ when include blis.h and lapacke.h in a same file

    Conflict type lsame_ when include blis.h and lapacke.h in a same file

    I don't know if this is expected, but blis.h redifines the lsame_ auxilary routine of Lapacke :

    $ cat test_blis_lapacke.c #include <stdio.h> #include <lapacke/lapacke.h> #include <blis.h>

    int main() { printf("Hello world\n"); return 0; }

    $ gcc -std=c99 -I<BLIS_DIR_HERE>/include/blis test_blis_lapacke.c -o test_blis_lapacke In file included from /work1/mqho/blis/build/x86/include/blis/blis.h:68:0, from test_blis_lapacke.c:3: /work1/mqho/blis/build/x86/include/blis/bla_lsame.h:37:23: error: conflicting types for ‘lsame_’ bla_logical PASTEF770(lsame)(const bla_character ca, const bla_character cb, ftnlen ca_len, ftnlen cb_len); ^ /work1/mqho/blis/build/x86/include/blis/bli_macro_defs.h:111:52: note: in definition of macro ‘PASTEF770’ #define PASTEF770(name) name ## _ ^ In file included from /usr/include/lapacke/lapacke.h:143:0, from test_blis_lapacke.c:2: /usr/include/lapacke/lapacke.h:146:16: note: previous declaration of ‘lsame_’ was here lapack_logical LAPACK_lsame( char ca, char cb, ^

    opened by hominhquan 39
  • Rework bli_packm_blk_var1.

    Rework bli_packm_blk_var1.

    Separate the dense code (general/hermitian/symmetric/triangular fully-stored part) from the triangular, diagonal-intersecting code. This allows a more consistent usage of round-robin thread scheduling (even for the dense micro-panels) for the latter.

    opened by devinamatthews 0
  • Fix k = 0 edge case in power10 micro kernels

    Fix k = 0 edge case in power10 micro kernels

    When sgemm and dgemm microkernels of power10 are called with k = 0, they run into infinite loops and segfault. This is fixed now by early exit in the case of k = 0.

    opened by nisanthmp 0
  • Disable power10 kernels other than sgemm and dgemm

    Disable power10 kernels other than sgemm and dgemm

    There is a power10 sandbox which uses micro-kernels for (lower precision) datatypes other than float and double. In the generic power10-configured build (non-sandbox), there were compile errors for some microkernels other than sgemm and dgemm. So enabling those kernels only for power10 sandbox (disabling them in the case of normal power10-configured build).

    opened by nisanthmp 0
  • Add new constants.

    Add new constants.

    • BLIS_ONE_I: the imaginary unit
    • BLIS_MINUS_ONE_I: the negative imaginary unit
    • BLIS_NAN: a not-a-number value. Both real and imaginary parts are set to NaN for complex datatypes.
    opened by devinamatthews 0
  • Fixing type-mismatch errors in power10 sandbox

    Fixing type-mismatch errors in power10 sandbox

    There was a mismatch between the definition and declaration of bli_gemm_ex() function in the power10 sandbox, which was throwing compilation errors. Now, the definition matches the declaration in blis.h.

    There was also a mismatch between the function definition and the function call for bli_gemm_front() in power10 sandbox. This is also fixed by typecasting the corresponding argument passed in the function.

    opened by nisanthmp 0
  • Compatibility with (reference) BLAS 3.10.1?

    Compatibility with (reference) BLAS 3.10.1?

    Hey all

    I just wanted to check whether blis considers itself compatible with the BLAS APIs from upstream 3.10.1. We'd like to lift our baseline across blas-flavours in conda-forge, and for that I'd like to make sure that all symbols are supported everywhere. Perhaps this is a wrong question to ask, given blis' architecture, but I cannot really tell the state as an outsider. 😅

    For reference, running

    git diff tags/v3.9.0..tags/v3.10.1 --stat | grep -i "blas"
    

    on the upstream repo I get the following stats:

     BLAS/CMakeLists.txt                           |     4 +-
     BLAS/SRC/CMakeLists.txt                       |    40 +-
     BLAS/SRC/Makefile                             |     6 +
     BLAS/SRC/caxpy.f                              |     8 +-
     BLAS/SRC/ccopy.f                              |     8 +-
     BLAS/SRC/cdotc.f                              |    12 +-
     BLAS/SRC/cdotu.f                              |    12 +-
     BLAS/SRC/cgbmv.f                              |     7 +-
     BLAS/SRC/cgemm.f                              |    14 +-
     BLAS/SRC/cgemv.f                              |     7 +-
     BLAS/SRC/cgerc.f                              |     7 +-
     BLAS/SRC/cgeru.f                              |     7 +-
     BLAS/SRC/chbmv.f                              |     7 +-
     BLAS/SRC/chemm.f                              |     7 +-
     BLAS/SRC/chemv.f                              |     7 +-
     BLAS/SRC/cher.f                               |     7 +-
     BLAS/SRC/cher2.f                              |     7 +-
     BLAS/SRC/cher2k.f                             |     7 +-
     BLAS/SRC/cherk.f                              |     7 +-
     BLAS/SRC/chpmv.f                              |     7 +-
     BLAS/SRC/chpr.f                               |     7 +-
     BLAS/SRC/chpr2.f                              |     7 +-
     BLAS/SRC/crotg.f                              |    97 -
     BLAS/SRC/crotg.f90                            |   229 +
     BLAS/SRC/cscal.f                              |     8 +-
     BLAS/SRC/csrot.f                              |     8 +-
     BLAS/SRC/csscal.f                             |     8 +-
     BLAS/SRC/cswap.f                              |     8 +-
     BLAS/SRC/csymm.f                              |     7 +-
     BLAS/SRC/csyr2k.f                             |     7 +-
     BLAS/SRC/csyrk.f                              |     7 +-
     BLAS/SRC/ctbmv.f                              |     7 +-
     BLAS/SRC/ctbsv.f                              |     7 +-
     BLAS/SRC/ctpmv.f                              |     7 +-
     BLAS/SRC/ctpsv.f                              |     7 +-
     BLAS/SRC/ctrmm.f                              |     7 +-
     BLAS/SRC/ctrmv.f                              |     7 +-
     BLAS/SRC/ctrsm.f                              |     7 +-
     BLAS/SRC/ctrsv.f                              |     7 +-
     BLAS/SRC/dasum.f                              |     8 +-
     BLAS/SRC/daxpy.f                              |     8 +-
     BLAS/SRC/dcabs1.f                             |     8 +-
     BLAS/SRC/dcopy.f                              |     8 +-
     BLAS/SRC/ddot.f                               |     8 +-
     BLAS/SRC/dgbmv.f                              |     7 +-
     BLAS/SRC/dgemm.f                              |    15 +-
     BLAS/SRC/dgemv.f                              |     7 +-
     BLAS/SRC/dger.f                               |     7 +-
     BLAS/SRC/dnrm2.f                              |   132 -
     BLAS/SRC/dnrm2.f90                            |   199 +
     BLAS/SRC/drot.f                               |     8 +-
     BLAS/SRC/drotg.f                              |   109 -
     BLAS/SRC/drotg.f90                            |   151 +
     BLAS/SRC/drotm.f                              |     8 +-
     BLAS/SRC/drotmg.f                             |    25 +-
     BLAS/SRC/dsbmv.f                              |     7 +-
     BLAS/SRC/dscal.f                              |     8 +-
     BLAS/SRC/dsdot.f                              |     8 +-
     BLAS/SRC/dspmv.f                              |     7 +-
     BLAS/SRC/dspr.f                               |     7 +-
     BLAS/SRC/dspr2.f                              |     7 +-
     BLAS/SRC/dswap.f                              |     8 +-
     BLAS/SRC/dsymm.f                              |     7 +-
     BLAS/SRC/dsymv.f                              |     7 +-
     BLAS/SRC/dsyr.f                               |     7 +-
     BLAS/SRC/dsyr2.f                              |     7 +-
     BLAS/SRC/dsyr2k.f                             |     7 +-
     BLAS/SRC/dsyrk.f                              |     7 +-
     BLAS/SRC/dtbmv.f                              |     7 +-
     BLAS/SRC/dtbsv.f                              |     7 +-
     BLAS/SRC/dtpmv.f                              |     7 +-
     BLAS/SRC/dtpsv.f                              |     7 +-
     BLAS/SRC/dtrmm.f                              |     7 +-
     BLAS/SRC/dtrmv.f                              |     7 +-
     BLAS/SRC/dtrsm.f                              |     7 +-
     BLAS/SRC/dtrsv.f                              |     7 +-
     BLAS/SRC/dzasum.f                             |    10 +-
     BLAS/SRC/dznrm2.f                             |   140 -
     BLAS/SRC/dznrm2.f90                           |   209 +
     BLAS/SRC/icamax.f                             |     8 +-
     BLAS/SRC/idamax.f                             |     8 +-
     BLAS/SRC/isamax.f                             |     8 +-
     BLAS/SRC/izamax.f                             |     8 +-
     BLAS/SRC/lsame.f                              |     5 +-
     BLAS/SRC/meson.build                          |    29 -
     BLAS/SRC/sasum.f                              |     8 +-
     BLAS/SRC/saxpy.f                              |     8 +-
     BLAS/SRC/scabs1.f                             |     8 +-
     BLAS/SRC/scasum.f                             |     8 +-
     BLAS/SRC/scnrm2.f                             |   140 -
     BLAS/SRC/scnrm2.f90                           |   209 +
     BLAS/SRC/scopy.f                              |     8 +-
     BLAS/SRC/sdot.f                               |     8 +-
     BLAS/SRC/sdsdot.f                             |     8 +-
     BLAS/SRC/sgbmv.f                              |     7 +-
     BLAS/SRC/sgemm.f                              |    15 +-
     BLAS/SRC/sgemv.f                              |     7 +-
     BLAS/SRC/sger.f                               |     7 +-
     BLAS/SRC/snrm2.f                              |   132 -
     BLAS/SRC/snrm2.f90                            |   199 +
     BLAS/SRC/srot.f                               |     8 +-
     BLAS/SRC/srotg.f                              |   109 -
     BLAS/SRC/srotg.f90                            |   151 +
     BLAS/SRC/srotm.f                              |     8 +-
     BLAS/SRC/srotmg.f                             |    27 +-
     BLAS/SRC/ssbmv.f                              |     7 +-
     BLAS/SRC/sscal.f                              |     8 +-
     BLAS/SRC/sspmv.f                              |     7 +-
     BLAS/SRC/sspr.f                               |     7 +-
     BLAS/SRC/sspr2.f                              |     7 +-
     BLAS/SRC/sswap.f                              |     8 +-
     BLAS/SRC/ssymm.f                              |     7 +-
     BLAS/SRC/ssymv.f                              |     7 +-
     BLAS/SRC/ssyr.f                               |     7 +-
     BLAS/SRC/ssyr2.f                              |     7 +-
     BLAS/SRC/ssyr2k.f                             |     7 +-
     BLAS/SRC/ssyrk.f                              |     7 +-
     BLAS/SRC/stbmv.f                              |     7 +-
     BLAS/SRC/stbsv.f                              |     7 +-
     BLAS/SRC/stpmv.f                              |     7 +-
     BLAS/SRC/stpsv.f                              |     7 +-
     BLAS/SRC/strmm.f                              |     7 +-
     BLAS/SRC/strmv.f                              |     7 +-
     BLAS/SRC/strsm.f                              |     7 +-
     BLAS/SRC/strsv.f                              |     7 +-
     BLAS/SRC/xerbla.f                             |     5 +-
     BLAS/SRC/xerbla_array.f                       |     8 +-
     BLAS/SRC/zaxpy.f                              |     8 +-
     BLAS/SRC/zcopy.f                              |     8 +-
     BLAS/SRC/zdotc.f                              |    12 +-
     BLAS/SRC/zdotu.f                              |    12 +-
     BLAS/SRC/zdrot.f                              |    48 +-
     BLAS/SRC/zdscal.f                             |     8 +-
     BLAS/SRC/zgbmv.f                              |     7 +-
     BLAS/SRC/zgemm.f                              |    14 +-
     BLAS/SRC/zgemv.f                              |     7 +-
     BLAS/SRC/zgerc.f                              |     7 +-
     BLAS/SRC/zgeru.f                              |     7 +-
     BLAS/SRC/zhbmv.f                              |     7 +-
     BLAS/SRC/zhemm.f                              |     7 +-
     BLAS/SRC/zhemv.f                              |     7 +-
     BLAS/SRC/zher.f                               |     7 +-
     BLAS/SRC/zher2.f                              |     7 +-
     BLAS/SRC/zher2k.f                             |     7 +-
     BLAS/SRC/zherk.f                              |     7 +-
     BLAS/SRC/zhpmv.f                              |     7 +-
     BLAS/SRC/zhpr.f                               |     7 +-
     BLAS/SRC/zhpr2.f                              |     7 +-
     BLAS/SRC/zrotg.f                              |    98 -
     BLAS/SRC/zrotg.f90                            |   229 +
     BLAS/SRC/zscal.f                              |     8 +-
     BLAS/SRC/zswap.f                              |     8 +-
     BLAS/SRC/zsymm.f                              |     7 +-
     BLAS/SRC/zsyr2k.f                             |     7 +-
     BLAS/SRC/zsyrk.f                              |     7 +-
     BLAS/SRC/ztbmv.f                              |     7 +-
     BLAS/SRC/ztbsv.f                              |     7 +-
     BLAS/SRC/ztpmv.f                              |     7 +-
     BLAS/SRC/ztpsv.f                              |     7 +-
     BLAS/SRC/ztrmm.f                              |     7 +-
     BLAS/SRC/ztrmv.f                              |     7 +-
     BLAS/SRC/ztrsm.f                              |     7 +-
     BLAS/SRC/ztrsv.f                              |     7 +-
     BLAS/TESTING/CMakeLists.txt                   |     2 +-
     BLAS/TESTING/cblat1.f                         |    80 +-
     BLAS/TESTING/cblat2.f                         |    36 +-
     BLAS/TESTING/cblat3.f                         |    34 +-
     BLAS/TESTING/dblat1.f                         |    83 +-
     BLAS/TESTING/dblat2.f                         |    36 +-
     BLAS/TESTING/dblat3.f                         |    34 +-
     BLAS/TESTING/sblat1.f                         |    79 +-
     BLAS/TESTING/sblat2.f                         |    36 +-
     BLAS/TESTING/sblat3.f                         |    34 +-
     BLAS/TESTING/zblat1.f                         |    80 +-
     BLAS/TESTING/zblat2.f                         |    36 +-
     BLAS/TESTING/zblat3.f                         |    34 +-
     BLAS/blas.pc.in                               |     2 +-
     CBLAS/CMakeLists.txt                          |    72 +-
     CBLAS/cblas.pc.in                             |     4 +-
     CBLAS/cmake/cblas-config-build.cmake.in       |     4 +-
     CBLAS/cmake/cblas-config-install.cmake.in     |     8 +-
     CBLAS/examples/CMakeLists.txt                 |     4 +-
     CBLAS/examples/cblas_example1.c               |     2 +-
     CBLAS/examples/cblas_example2.c               |     2 +-
     CBLAS/include/CMakeLists.txt                  |     4 +-
     CBLAS/include/cblas.h                         |   704 +-
     CBLAS/include/cblas_f77.h                     |  1300 +-
     CBLAS/src/CMakeLists.txt                      |    15 +-
     CBLAS/src/cblas_caxpy.c                       |     4 +-
     CBLAS/src/cblas_ccopy.c                       |     4 +-
     CBLAS/src/cblas_cdotc_sub.c                   |     4 +-
     CBLAS/src/cblas_cdotu_sub.c                   |     4 +-
     CBLAS/src/cblas_cgbmv.c                       |    14 +-
     CBLAS/src/cblas_cgemm.c                       |     8 +-
     CBLAS/src/cblas_cgemv.c                       |    12 +-
     CBLAS/src/cblas_cgerc.c                       |     8 +-
     CBLAS/src/cblas_cgeru.c                       |     6 +-
     CBLAS/src/cblas_chbmv.c                       |    12 +-
     CBLAS/src/cblas_chemm.c                       |     8 +-
     CBLAS/src/cblas_chemv.c                       |    12 +-
     CBLAS/src/cblas_cher.c                        |     6 +-
     CBLAS/src/cblas_cher2.c                       |     6 +-
     CBLAS/src/cblas_cher2k.c                      |     8 +-
     CBLAS/src/cblas_cherk.c                       |     6 +-
     CBLAS/src/cblas_chpmv.c                       |    10 +-
     CBLAS/src/cblas_chpr.c                        |     6 +-
     CBLAS/src/cblas_chpr2.c                       |     6 +-
     CBLAS/src/cblas_cscal.c                       |     4 +-
     CBLAS/src/cblas_csscal.c                      |     4 +-
     CBLAS/src/cblas_cswap.c                       |     4 +-
     CBLAS/src/cblas_csymm.c                       |     8 +-
     CBLAS/src/cblas_csyr2k.c                      |     8 +-
     CBLAS/src/cblas_csyrk.c                       |     6 +-
     CBLAS/src/cblas_ctbmv.c                       |     6 +-
     CBLAS/src/cblas_ctbsv.c                       |     6 +-
     CBLAS/src/cblas_ctpmv.c                       |     4 +-
     CBLAS/src/cblas_ctpsv.c                       |     4 +-
     CBLAS/src/cblas_ctrmm.c                       |     6 +-
     CBLAS/src/cblas_ctrmv.c                       |     6 +-
     CBLAS/src/cblas_ctrsm.c                       |     6 +-
     CBLAS/src/cblas_ctrsv.c                       |     6 +-
     CBLAS/src/cblas_dasum.c                       |     2 +-
     CBLAS/src/cblas_daxpy.c                       |     4 +-
     CBLAS/src/cblas_dcopy.c                       |     4 +-
     CBLAS/src/cblas_ddot.c                        |     4 +-
     CBLAS/src/cblas_dgbmv.c                       |    10 +-
     CBLAS/src/cblas_dgemm.c                       |     8 +-
     CBLAS/src/cblas_dgemv.c                       |     8 +-
     CBLAS/src/cblas_dger.c                        |     6 +-
     CBLAS/src/cblas_dnrm2.c                       |     2 +-
     CBLAS/src/cblas_drot.c                        |     4 +-
     CBLAS/src/cblas_drotm.c                       |     4 +-
     CBLAS/src/cblas_dsbmv.c                       |     8 +-
     CBLAS/src/cblas_dscal.c                       |     4 +-
     CBLAS/src/cblas_dsdot.c                       |     4 +-
     CBLAS/src/cblas_dspmv.c                       |     6 +-
     CBLAS/src/cblas_dspr.c                        |     4 +-
     CBLAS/src/cblas_dspr2.c                       |     4 +-
     CBLAS/src/cblas_dswap.c                       |     4 +-
     CBLAS/src/cblas_dsymm.c                       |     8 +-
     CBLAS/src/cblas_dsymv.c                       |     8 +-
     CBLAS/src/cblas_dsyr.c                        |     4 +-
     CBLAS/src/cblas_dsyr2.c                       |     6 +-
     CBLAS/src/cblas_dsyr2k.c                      |     8 +-
     CBLAS/src/cblas_dsyrk.c                       |     6 +-
     CBLAS/src/cblas_dtbmv.c                       |     4 +-
     CBLAS/src/cblas_dtbsv.c                       |     4 +-
     CBLAS/src/cblas_dtpmv.c                       |     2 +-
     CBLAS/src/cblas_dtpsv.c                       |     2 +-
     CBLAS/src/cblas_dtrmm.c                       |     6 +-
     CBLAS/src/cblas_dtrmv.c                       |     4 +-
     CBLAS/src/cblas_dtrsm.c                       |     6 +-
     CBLAS/src/cblas_dtrsv.c                       |     4 +-
     CBLAS/src/cblas_dzasum.c                      |     2 +-
     CBLAS/src/cblas_dznrm2.c                      |     2 +-
     CBLAS/src/cblas_globals.c                     |     4 +-
     CBLAS/src/cblas_icamax.c                      |    12 +-
     CBLAS/src/cblas_idamax.c                      |    12 +-
     CBLAS/src/cblas_isamax.c                      |    12 +-
     CBLAS/src/cblas_izamax.c                      |    12 +-
     CBLAS/src/cblas_sasum.c                       |     2 +-
     CBLAS/src/cblas_saxpy.c                       |     4 +-
     CBLAS/src/cblas_scasum.c                      |     2 +-
     CBLAS/src/cblas_scnrm2.c                      |     2 +-
     CBLAS/src/cblas_scopy.c                       |     4 +-
     CBLAS/src/cblas_sdot.c                        |     4 +-
     CBLAS/src/cblas_sdsdot.c                      |     4 +-
     CBLAS/src/cblas_sgbmv.c                       |    10 +-
     CBLAS/src/cblas_sgemm.c                       |    10 +-
     CBLAS/src/cblas_sgemv.c                       |     8 +-
     CBLAS/src/cblas_sger.c                        |     6 +-
     CBLAS/src/cblas_snrm2.c                       |     2 +-
     CBLAS/src/cblas_srot.c                        |     4 +-
     CBLAS/src/cblas_srotm.c                       |     4 +-
     CBLAS/src/cblas_ssbmv.c                       |     6 +-
     CBLAS/src/cblas_sscal.c                       |     4 +-
     CBLAS/src/cblas_sspmv.c                       |     6 +-
     CBLAS/src/cblas_sspr.c                        |     4 +-
     CBLAS/src/cblas_sspr2.c                       |     4 +-
     CBLAS/src/cblas_sswap.c                       |     4 +-
     CBLAS/src/cblas_ssymm.c                       |     8 +-
     CBLAS/src/cblas_ssymv.c                       |     8 +-
     CBLAS/src/cblas_ssyr.c                        |     4 +-
     CBLAS/src/cblas_ssyr2.c                       |     6 +-
     CBLAS/src/cblas_ssyr2k.c                      |     8 +-
     CBLAS/src/cblas_ssyrk.c                       |     6 +-
     CBLAS/src/cblas_stbmv.c                       |     4 +-
     CBLAS/src/cblas_stbsv.c                       |     4 +-
     CBLAS/src/cblas_stpmv.c                       |     2 +-
     CBLAS/src/cblas_stpsv.c                       |     2 +-
     CBLAS/src/cblas_strmm.c                       |     6 +-
     CBLAS/src/cblas_strmv.c                       |     4 +-
     CBLAS/src/cblas_strsm.c                       |     6 +-
     CBLAS/src/cblas_strsv.c                       |     4 +-
     CBLAS/src/cblas_xerbla.c                      |     6 +-
     CBLAS/src/cblas_zaxpy.c                       |     4 +-
     CBLAS/src/cblas_zcopy.c                       |     4 +-
     CBLAS/src/cblas_zdotc_sub.c                   |     4 +-
     CBLAS/src/cblas_zdotu_sub.c                   |     4 +-
     CBLAS/src/cblas_zdscal.c                      |     4 +-
     CBLAS/src/cblas_zgbmv.c                       |    14 +-
     CBLAS/src/cblas_zgemm.c                       |     8 +-
     CBLAS/src/cblas_zgemv.c                       |    12 +-
     CBLAS/src/cblas_zgerc.c                       |     8 +-
     CBLAS/src/cblas_zgeru.c                       |     6 +-
     CBLAS/src/cblas_zhbmv.c                       |    12 +-
     CBLAS/src/cblas_zhemm.c                       |     8 +-
     CBLAS/src/cblas_zhemv.c                       |    12 +-
     CBLAS/src/cblas_zher.c                        |     6 +-
     CBLAS/src/cblas_zher2.c                       |     6 +-
     CBLAS/src/cblas_zher2k.c                      |     8 +-
     CBLAS/src/cblas_zherk.c                       |     6 +-
     CBLAS/src/cblas_zhpmv.c                       |    10 +-
     CBLAS/src/cblas_zhpr.c                        |     6 +-
     CBLAS/src/cblas_zhpr2.c                       |     6 +-
     CBLAS/src/cblas_zscal.c                       |     4 +-
     CBLAS/src/cblas_zswap.c                       |     4 +-
     CBLAS/src/cblas_zsymm.c                       |     8 +-
     CBLAS/src/cblas_zsyr2k.c                      |     8 +-
     CBLAS/src/cblas_zsyrk.c                       |     6 +-
     CBLAS/src/cblas_ztbmv.c                       |     6 +-
     CBLAS/src/cblas_ztbsv.c                       |     6 +-
     CBLAS/src/cblas_ztpmv.c                       |     4 +-
     CBLAS/src/cblas_ztpsv.c                       |     4 +-
     CBLAS/src/cblas_ztrmm.c                       |     6 +-
     CBLAS/src/cblas_ztrmv.c                       |     6 +-
     CBLAS/src/cblas_ztrsm.c                       |     6 +-
     CBLAS/src/cblas_ztrsv.c                       |     6 +-
     CBLAS/src/xerbla.c                            |    14 +-
     CBLAS/testing/CMakeLists.txt                  |    48 +-
     CBLAS/testing/c_c2chke.c                      |     2 +
     CBLAS/testing/c_c3chke.c                      |     2 +
     CBLAS/testing/c_d2chke.c                      |     2 +
     CBLAS/testing/c_d3chke.c                      |     2 +
     CBLAS/testing/c_s2chke.c                      |     2 +
     CBLAS/testing/c_s3chke.c                      |     2 +
     CBLAS/testing/c_xerbla.c                      |     4 +-
     CBLAS/testing/c_z2chke.c                      |     2 +
     CBLAS/testing/c_z3chke.c                      |     2 +
    
    opened by h-vetinari 1
Releases(0.9.0)
  • 0.9.0(Apr 4, 2022)

    This release contains a slew of improvements, new kernels and APIs, bugfixes, and more (including lots of code reduction). It also contains foundational support for an exciting new class of expert functionality: creating new operations without the need to duplicate the middleware that sits between the API and kernels.

    Improvements present in 0.9.0:

    Framework:

    • Added various fields to obj_t that relate to storing function pointers to custom packm kernels, microkernels, etc as well as accessor functions to set and query those fields. (Devin Matthews)
    • Enabled user-customized packm microkernels and variants via the aforementioned new obj_t fields. (Devin Matthews)
    • Moved edge-case handling out of the macrokernel and into the gemm and gemmtrsm microkernels. This also required updating of APIs and definitions of all existing microkernels in kernels directory. Edge-case handling functionality is now facilitated via new preprocessor macros found in bli_edge_case_macro_defs.h. (Devin Matthews)
    • Avoid gemmsup thread barriers when not packing A or B. This boosts performance for many small multithreaded problems. (Field Van Zee, AMD)
    • Allow the 1m method to operate normally when single and double real-domain microkernels mix row and column I/O preference. (Field Van Zee, Devin Matthews, RuQing Xu)
    • Removed support for execution of complex-domain level-3 operations via the 3m and 4m methods.
    • Refactored herk, her2k, syrk, syr2k in terms of gemmt. (Devin Matthews)
    • Defined setijv and getijv to set/get vector elements.
    • Defined eqsc, eqv, and eqm operations to test equality between two scalars, vectors, or matrices.
    • Added new bounds checking to setijm and getijm to prevent use of negative indices.
    • Renamed membrk files/variables/functions to pba.
    • Store error-checking level as a thread-local variable. (Devin Matthews)
    • Add err_t* "return" parameter to bli_malloc_*() and friends.
    • Switched internal mutexes of the sba and pba to static initialization.
    • Changed return value method of bli_pack_get_pack_a(), bli_pack_get_pack_b().
    • Fixed a bug that allows bli_init() to be called more than once (without segfaulting). (@lschork2, Minh Quan Ho, Devin Matthews)
    • Removed a sanity check in bli_pool_finalize() that prevented BLIS from being re-initialized. (AMD)
    • Fixed insufficient pool_t-growing logic in bli_pool.c, and always allocate at least one element in .block_ptrs array. (Minh Quan Ho)
    • Cleanups related to the error message array in bli_error.c. (Minh Quan Ho)
    • Moved language-related definitions from bli_macro_defs.h to a new header, bli_lang_defs.h.
    • Renamed BLIS_SIMD_NUM_REGISTERS to BLIS_SIMD_MAX_NUM_REGISTERS and BLIS_SIMD_SIZE to BLIS_SIMD_MAX_SIZE for improved clarity. (Devin Matthews)
    • Many minor bugfixes.
    • Many cleanups, including removal of old and commented-out code.

    Compatibility:

    • Expanded BLAS layer to include support for ?axpby_() and ?gemm_batch_(). (Meghana Vankadari, AMD)
    • Added gemm3m APIs to BLAS and CBLAS layers. (Bhaskar Nallani, AMD)
    • Handle ?gemm_() invocations where m or n is unit by calling ?gemv_(). (Dipal M Zambare, AMD)
    • Removed option to finalize BLIS after every BLAS call.
    • Updated default definitions of bli_slamch() and bli_dlamch() to use constants from standard C library rather than values computed at runtime. (Devin Matthews)

    Kernels:

    • Added 512-bit SVE-based a64fx subconfiguration that uses empirically-tuned blocksizes (Stepan Nassyr, RuQing Xu)
    • Added a vector-length agnostic armsve subconfig that computes blocksizes via an analytical model. (Stepan Nassyr)
    • Added vector-length agnostic d/s/sh gemm kernels for Arm SVE. (Stepan Nassyr)
    • Added gemmsup kernels to the armv8a kernel set for use in new Apple Firestorm subconfiguration. (RuQing Xu)
    • Added 512-bit SVE dpackm kernels (16xk and 10xk) with in-register transpose. (RuQing Xu)
    • Extended 256-bit SVE dpackm kernels by Linaro Ltd. to 512-bit for size 12xk. (RuQing Xu)
    • Reorganized register usage in bli_gemm_armv8a_asm_d6x8.c to accommodate clang. (RuQing Xu)
    • Added saxpyf/daxpyf/caxpyf kernels to zen kernel set. (Dipal M Zambare, AMD)
    • Added vzeroupper instruction to haswell microkernels. (Devin Matthews)
    • Added explicit beta == 0 handling in s/d armsve and armv7a gemm microkernels. (Devin Matthews)
    • Added a unique tag to branch labels to accommodate clang. (Devin Matthews, Jeff Hammond)
    • Fixed a copy-paste bug in the loading of kappa_i in the two assembly cpackm kernels in haswell kernel set. (Devin Matthews)
    • Fixed a bug in Mx1 gemmsup haswell kernels whereby the vhaddpd instruction is used with uninitialized registers. (Devin Matthews)
    • Fixed a bug in the power10 microkernel I/O. (Nicholai Tukanov)
    • Many other Arm kernel updates and fixes. (RuQing Xu)

    Extras:

    • Added support for addons, which are similar to sandboxes but do not require the user to implement any particular operation.
    • Added a new gemmlike sandbox to allow rapid prototyping of gemm-like operations.
    • Various updates and improvements to the power10 sandbox, including a new testsuite. (Nicholai Tukanov)

    Build system:

    • Added explicit support for AMD's Zen3 microarchitecture. (Dipal M Zambare, AMD, Field Van Zee)
    • Added runtime microarchitecture detection for Arm. (Dave Love, RuQing Xu, Devin Matthews)
    • Added a new configure option --[en|dis]able-amd-frame-tweaks that allows BLIS to compile certain framework files (each with the _amd suffix) that have been customized by AMD for improved performance (provided that the targeted configuration is eligible). By default, the more portable counterparts to these files are compiled. (Field Van Zee, AMD)
    • Added an explicit compiler predicate (is_win) for Windows in configure. (Devin Matthews)
    • Use -march=haswell instead of -march=skylake-avx512 on Windows. (Devin Matthews, @h-vetinari)
    • Fixed configure breakage on MacOSX by accepting either clang or LLVM in vendor string. (Devin Matthews)
    • Blacklist clang10/gcc9 and older for armsve subconfig.
    • Added a configure option to control whether or not to use @rpath. (Devin Matthews)
    • Added armclang detection to configure. (Devin Matthews)
    • Use @path-based install name on MacOSX and use relocatable RPATH entries for testsuite binaries. (Devin Matthews)
    • For environment variables CC, CXX, FC, PYTHON, AR, and RANLIB, configure will now print an error message and abort if a user specifies a specific tool and that tool is not found. (Field Van Zee, Devin Matthews)
    • Added symlink to blis.pc.in for out-of-tree builds. (Andrew Wildman)
    • Register optimized real-domain copyv, setv, and swapv kernels in zen subconfig. (Dipal M Zambare, AMD)
    • Added Apple Firestorm (A14/M1) subconfiguration, firestorm. (RuQing Xu)
    • Added armsve subconfig to arm64 configuration family. (RuQing Xu)
    • Allow using clang with the thunderx2 subconfiguration. (Devin Matthews)
    • Fixed a subtle substitution bug in configure. (Chengguo Sun)
    • Updated top-level Makefile to reflect a dependency on the "flat" blis.h file for the BLIS and BLAS testsuite objects. (Devin Matthews)
    • Mark xerbla_() as a "weak" symbol on MacOSX. (Devin Matthews)
    • Fixed a long-standing bug in common.mk whereby the header path to cblas.h was omitted from the compiler flags when compiling CBLAS files within BLIS.
    • Added a custom-made recursive sed script to build directory.
    • Minor cleanups and fixes to configure, common.mk, and others.

    Testing:

    • Fixed a race condition in the testsuite when the SALT option (simulate application-level threading) is enabled. (Devin Matthews)
    • Test 1m method execution during make check. (Devin Matthews)
    • Test make install in Travis CI. (Devin Matthews)
    • Test C++ in Travis CI to make sure blis.h is C++-compatible. (Devin Matthews)
    • Disabled SDE testing of pre-Zen microarchitectures via Travis CI.
    • Added Travis CI support for testing Arm SVE. (RuQing Xu)
    • Updated SDE usage so that it is downloaded from a separate repository (ci-utils) in our GitHub organization. (Field Van Zee, Devin Matthews)
    • Updated octave scripts in test/3 to be robust against missing datasets as well as to fixed a few minor issues.
    • Added test_axpbyv.c and test_gemm_batch.c test driver files to test directory. (Meghana Vankadari, AMD)
    • Support all four datatypes in her, her2, herk, and her2k drivers in test directory. (Madan mohan Manokar, AMD)

    Documentation:

    • Added documentation for: setijv, getijv, eqsc, eqv, eqm.
    • Added docs/Addons.md.
    • Added dedicated "Performance" and "Example Code" sections to README.md.
    • Updated README.md.
    • Updated docs/Sandboxes.md.
    • Updated docs/Multithreading.md. (Devin Matthews)
    • Updated docs/KernelHowTo.md.
    • Updated docs/Performance.md to report Fujitsu A64fx (512-bit SVE) results. (RuQing Xu)
    • Updated docs/Performance.md to report Graviton2 Neoverse N1 results. (Nicholai Tukanov)
    • Updated docs/FAQ.md with new questions.
    • Fixed typos in docs/FAQ.md. (Gaëtan Cassiers)
    • Various other minor fixes.
    Source code(tar.gz)
    Source code(zip)
Owner
Projects by the Science of High-Performance Computing (formerly FLAME) group
null
Code for Paper A Systematic Framework to Identify Violations of Scenario-dependent Driving Rules in Autonomous Vehicle Software

Code for Paper A Systematic Framework to Identify Violations of Scenario-dependent Driving Rules in Autonomous Vehicle Software

Qingzhao Zhang 6 Nov 28, 2022
TengineGst is a streaming media analytics framework, based on GStreamer multimedia framework, for creating varied complex media analytics pipelines.

TengineGst is a streaming media analytics framework, based on GStreamer multimedia framework, for creating varied complex media analytics pipelines. It ensures pipeline interoperability and provides optimized media, and inference operations using Tengine Toolkit Inference Engine backend, across varied architecture - CPU, iGPU and VPU.

OAID 69 Dec 17, 2022
An efficient C++17 GPU numerical computing library with Python-like syntax

MatX - Matrix Primitives Library MatX is a modern C++ library for numerical computing on NVIDIA GPUs. Near-native performance can be achieved while us

NVIDIA Corporation 625 Jan 1, 2023
Radeon Rays is ray intersection acceleration library for hardware and software multiplatforms using CPU and GPU

RadeonRays 4.1 Summary RadeonRays is a ray intersection acceleration library. AMD developed RadeonRays to help developers make the most of GPU and to

GPUOpen Libraries & SDKs 980 Dec 29, 2022
Ikomia Studio software

Ikomia Studio Presentation Ikomia Studio is an Open Source desktop application that aims to simplify use, reproducibility and sharing of state of the

Ikomia 24 Dec 26, 2022
Simplified distributed block storage with strong consistency, like in Ceph (repository mirror)

Vitastor Читать на русском The Idea Make Software-Defined Block Storage Great Again. Vitastor is a small, simple and fast clustered block storage (sto

Vitaliy Filippov 63 Jan 7, 2023
A software pipeline to decode the Falcon 9 telemetry from the 6MS/s baseband file.

falcon9_pipeline A software pipeline to decode the Falcon 9 telemetry from the 6MS/s baseband file. This is a work in progress, and you need to source

Mike Field 12 May 13, 2021
functorch is a prototype of JAX-like composable function transforms for PyTorch.

functorch Why functorch? | Install guide | Transformations | Future Plans functorch is a prototype of JAX-like composable FUNCtion transforms for pyTO

Richard Zou 1.2k Dec 27, 2022
functorch is a prototype of JAX-like composable function transforms for PyTorch.

functorch Why functorch? | Install guide | Transformations | Future Plans functorch is a prototype of JAX-like composable FUNCtion transforms for pyTO

Facebook Research 1.2k Dec 27, 2022
AlphaZero like implementation for Oware Abapa game

CGZero AlphaZero like implementation for Oware abapa game, in Codingame (https://www.codingame.com/multiplayer/bot-programming/oware-abapa) See https:

null 23 Dec 27, 2022
Handwrite the Korean alphabet, Hangul, like a native.

BTS Magic Wand lets you improve your Hangul handwriting! Using an Arduino, IMU sensor, and TensorFlow Lite for Microcontrollers, we trained a tiny machine learning model to recognize the Korean alphabet, Hangul, you are writing.

BTS-MicroNet (Team Butler) 15 Nov 18, 2021
Frog is an integration of memory-based natural language processing (NLP) modules developed for Dutch. All NLP modules are based on Timbl, the Tilburg memory-based learning software package.

Frog - A Tagger-Lemmatizer-Morphological-Analyzer-Dependency-Parser for Dutch Copyright 2006-2020 Ko van der Sloot, Maarten van Gompel, Antal van den

Language Machines 70 Dec 14, 2022
VNOpenAI 31 Dec 26, 2022
Extracts high-precision mouse/pointer motion data on Windows. Good for drawing software!

window_mouse_queue This is a wrapper for GetMouseMovePointsEx function that allows to extract high-precision mouse/pointer motion data on Windows. Goo

YellowAfterlife's GameMaker Things 6 Feb 21, 2022
functorch is a prototype of JAX-like composable function transforms for PyTorch.

functorch Why functorch? | Install guide | Transformations | Documentation | Future Plans This library is currently under heavy development - if you h

null 1.2k Dec 27, 2022
Basic Autonomous Driving Software

Basic_Autonomous-Driving-Software 2021년도 2학기 프로젝트 프로젝트에 대한 내용은 아래 링크 참고. 프로젝트 소개 및 과정 ??️ 환경설정 및 버전 환경설정 버전 OS Ubuntu 18.04 사용언어 C++ 사용 IDE QT Creater

Sang Hyun Park 8 Oct 11, 2022
Modern(-ish) password hashing for your software and your servers

bcrypt Good password hashing for your software and your servers Installation To install bcrypt, simply: $ pip install bcrypt Note that bcrypt should b

Python Cryptographic Authority 947 Dec 28, 2022
This is a list of hardware which is supports Intel SGX - Software Guard Extensions.

SGX-hardware list This is a list of hardware which supports Intel SGX - Software Guard Extensions. Desktop The CPU and the motherboard BIOS must suppo

Lars Lühr 513 Dec 16, 2022
NVIDIA GPUs htop like monitoring tool

NVTOP What is NVTOP? Nvtop stands for NVidia TOP, a (h)top like task monitor for NVIDIA GPUs. It can handle multiple GPUs and print information about

Maxime Schmitt 4.7k Dec 31, 2022