VexCL is a C++ vector expression template library for OpenCL/CUDA/OpenMP

Overview

VexCL

Documentation Status DOI Build Status Build status codecov Coverity Scan Build Status

VexCL is a vector expression template library for OpenCL/CUDA. It has been created for ease of GPGPU development with C++. VexCL strives to reduce amount of boilerplate code needed to develop GPGPU applications. The library provides convenient and intuitive notation for vector arithmetic, reduction, sparse matrix-vector products, etc. Multi-device and even multi-platform computations are supported. The source code of the library is distributed under very permissive MIT license.

See VexCL documentation at http://vexcl.readthedocs.io/

Issues
  • Issue backend CUDA with boost 1.69.0 and cuda 10

    Issue backend CUDA with boost 1.69.0 and cuda 10

    Hello, I want to use the cuda backend of vexcl but it doesn't work I have the following error :

    Error explicit specialization of class "vex::traits::is_vector_expr_terminal<size_t, void>" must precede its first use C:\Utils\vexcl-master\vexcl\backend\cuda\texture_object.hpp 41

    I have boost 1.69, visual studio 2017 15.9.7 and cuda 10. Maybe my boost version is too recent ?

    opened by MaximeRoux 45
  • Support sparse vectors?

    Support sparse vectors?

    It's a feature request.

    Is it possible to support sparse vectors? Right now, when a (big) dense vector a + a (very) sparse vector b, the effective flops is about 60K, as many computation are wasted for + 0 (i.e. the missing elements in the sparse vector).

    question 
    opened by byzhang 40
  • Solving an ODE w/ Interpolation

    Solving an ODE w/ Interpolation

    I have been using ODEINT for doing Monte Carlo simulations of a parachute system descending to the ground model as a point mass. I would like to utilize VexCL to leverage GPUs to speed up computation. I have mostly been able to modify your symbolic example for my particular equations of motion, but I have run into an issue. I need to interpolate a 3D wind field as a function of the position of the system in order to determine the drag forces acting on the system. I would ideally like to leverage the texture interpolation capabilities of the GPU to do this. Do you have any suggestions?

    Thanks

    If it helps, here is description of the system model:

    eom

    opened by agerlach 25
  • Using existing context in VexCL

    Using existing context in VexCL

    Dear all,

    a quick question: is it possible to tell VexCL to reuse an OpenCL context/device/queue already initialized by another library? It would be nice to have something similar to ViennaCL's viennacl::ocl::setup_context() and viennacl::ocl::switch_context(), which take already initialized cl_context and cl_device_id objects as input.

    Thanks!

    opened by apeternier 25
  • writing to vex::multivector<> on CPU raises exception EXC_BAD_ACCESS on OS X 10.9

    writing to vex::multivector<> on CPU raises exception EXC_BAD_ACCESS on OS X 10.9

    While trying to debug some VexCL code on the CPU, I notice that writing to a vex::multivector<> generates an exception, EXC_BAD_ACCESS (code=2).

    For example, the snippet

    #define VEXCL_SHOW_KERNELS
    #include "vexcl/vexcl.hpp"
    
    int main() {
       vex::Context ctx(vex::Filter::Type(CL_DEVICE_TYPE_CPU) && vex::Filter::DoublePrecision);
    
       vex::multivector<double, 16> u2(ctx.queue(), 5);   
       u2(0) = 0.0;
    }
    

    generates a suitable-looking kernel, but crashes with an exception when the write to u2 is performed. But running it on a GPU device works as expected. This is on OS X 10.9.

    Because it works on the GPU this has the appearance of an implementation bug with the OpenCL shipped with 10.9. Unfortunately I don't have access to a different one which would enable me to check. Is there any way to debug such issues, or is the only realistic approach to report it to Apple and hope for the best?

    opened by ds283 25
  • User-defined stencil operators are broken in MacOS X's OpenCL.

    User-defined stencil operators are broken in MacOS X's OpenCL.

    It appears to mostly work with Apple's implementation - however utests does fail in one case:

    User-defined stencil operator: ................................. failed.
    

    I'll investigate further myself, but I just included this here for future reference.

    opened by raedwulf 24
  • Using CMake targets, fixes issues with pthread not being added

    Using CMake targets, fixes issues with pthread not being added

    This moves to using the CMake Boost:: targets if they are available, and making them if they are not. The problem it is fixing (-pthread no being added when it is required on some systems) should be fixed even in the case the Boost:: targets are not available (old CMake or simi-old CMake combined with newer Boost).

    (The threads test case does not build on CentOS 7 system without this patch)

    opened by henryiii 23
  • AMD SI Cards weird results

    AMD SI Cards weird results

    A bug with AMD SI cards was reported in this thread https://community.amd.com/message/2869393#comment-2869393

    A solution was also proposed on that thread. If I were to implement that solution within vexcl how do I go about doing that? Essentially, with every kernel, a null kernel would need to passed. It can safely be ignored with an #ifdef declaration for other cards. I can show the results of an example that illustrate this case:

    [email protected]:~/build/vexcl/examples$ ./complex_simple

    1. Hainan (AMD Accelerated Parallel Processing)

    X * Y = (0,16) * (16,0) = (-5.62355e+303,7.1998e-304)i X * Y = (1,15) * (15,1) = (1.76736e+186,-2.529e-186)i X * Y = (2,14) * (14,2) = (-5.43986e-256,-7.4239e-199)i X * Y = (3,13) * (13,3) = (-1.15801e-125,-1.42689e-184)i X * Y = (4,12) * (12,4) = (-1.5799e-103,-2.30511e+235)i X * Y = (5,11) * (11,5) = (2.83923e+103,-2.783e+307)i X * Y = (6,10) * (10,6) = (-4.95723e+305,1.35145e+188)i X * Y = (7,9) * (9,7) = (-2.73677e-48,-0.00275755)i X * Y = (8,8) * (8,8) = (-2.26843e-106,-1.45955e-201)i X * Y = (9,7) * (7,9) = (1.79762e+106,2.97045e+201)i X * Y = (10,6) * (6,10) = (-nan,2.14326e-308)i X * Y = (11,5) * (5,11) = (-8.49166e-200,5.26253e+199)i X * Y = (12,4) * (4,12) = (-3.88897e+306,2.78449e+188)i X * Y = (13,3) * (3,13) = (1.69952e+184,1.77589e-234)i X * Y = (14,2) * (2,14) = (-1.94762e-104,-4.35661e+232)i X * Y = (15,1) * (1,15) = (-1.82884e-128,1.00333e-232)i X / Y = (0,16) / (16,0) = (0,256) X / Y = (1,15) / (15,1) = (0,226) X / Y = (2,14) / (14,2) = (0,200) X / Y = (3,13) / (13,3) = (0,178) X / Y = (4,12) / (12,4) = (0,160) X / Y = (5,11) / (11,5) = (0,146) X / Y = (6,10) / (10,6) = (0,136) X / Y = (7,9) / (9,7) = (0,130) X / Y = (8,8) / (8,8) = (0,128) X / Y = (9,7) / (7,9) = (0,130) X / Y = (10,6) / (6,10) = (0,136) X / Y = (11,5) / (5,11) = (0,146) X / Y = (12,4) / (4,12) = (0,160) X / Y = (13,3) / (3,13) = (0,178) X / Y = (14,2) / (2,14) = (0,200) X / Y = (15,1) / (1,15) = (0,226)

    If I run it the second time [email protected]:~/build/vexcl/examples$ ./complex_simple

    1. Hainan (AMD Accelerated Parallel Processing)

    X * Y = (0,16) * (16,0) = (0,1)i X * Y = (1,15) * (15,1) = (0.132743,0.99115)i X * Y = (2,14) * (14,2) = (0.28,0.96)i X * Y = (3,13) * (13,3) = (0.438202,0.898876)i X * Y = (4,12) * (12,4) = (0.6,0.8)i X * Y = (5,11) * (11,5) = (0.753425,0.657534)i X * Y = (6,10) * (10,6) = (0.882353,0.470588)i X * Y = (7,9) * (9,7) = (0.969231,0.246154)i X * Y = (8,8) * (8,8) = (1,0)i X * Y = (9,7) * (7,9) = (0.969231,-0.246154)i X * Y = (10,6) * (6,10) = (0.882353,-0.470588)i X * Y = (11,5) * (5,11) = (0.753425,-0.657534)i X * Y = (12,4) * (4,12) = (0.6,-0.8)i X * Y = (13,3) * (3,13) = (0.438202,-0.898876)i X * Y = (14,2) * (2,14) = (0.28,-0.96)i X * Y = (15,1) * (1,15) = (0.132743,-0.99115)i X / Y = (0,16) / (16,0) = (0,256) X / Y = (1,15) / (15,1) = (0,226) X / Y = (2,14) / (14,2) = (0,200) X / Y = (3,13) / (13,3) = (0,178) X / Y = (4,12) / (12,4) = (0,160) X / Y = (5,11) / (11,5) = (0,146) X / Y = (6,10) / (10,6) = (0,136) X / Y = (7,9) / (9,7) = (0,130) X / Y = (8,8) / (8,8) = (0,128) X / Y = (9,7) / (7,9) = (0,130) X / Y = (10,6) / (6,10) = (0,136) X / Y = (11,5) / (5,11) = (0,146) X / Y = (12,4) / (4,12) = (0,160) X / Y = (13,3) / (3,13) = (0,178) X / Y = (14,2) / (2,14) = (0,200) X / Y = (15,1) / (1,15) = (0,226)

    The results look fine now. The result of clinfo is [email protected]:~/build/vexcl/examples$ clinfo Number of platforms: 1 Platform Profile: FULL_PROFILE Platform Version: OpenCL 2.1 AMD-APP (2639.3) Platform Name: AMD Accelerated Parallel Processing Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices

    Platform Name: AMD Accelerated Parallel Processing Number of devices: 1 Device Type: CL_DEVICE_TYPE_GPU Vendor ID: 1002h Board name: AMD Radeon (TM) R5 M330 Device Topology: PCI[ B#1, D#0, F#0 ] Max compute units: 5 Max work items dimensions: 3 Max work items[0]: 1024 Max work items[1]: 1024 Max work items[2]: 1024 Max work group size: 256 Preferred vector width char: 4 Preferred vector width short: 2 Preferred vector width int: 1 Preferred vector width long: 1 Preferred vector width float: 1 Preferred vector width double: 1 Native vector width char: 4 Native vector width short: 2 Native vector width int: 1 Native vector width long: 1 Native vector width float: 1 Native vector width double: 1 Max clock frequency: 750Mhz Address bits: 64 Max memory allocation: 1596905472 Image support: Yes Max number of images read arguments: 128 Max number of images write arguments: 8 Max image 2D width: 16384 Max image 2D height: 16384 Max image 3D width: 2048 Max image 3D height: 2048 Max image 3D depth: 2048 Max samplers within kernel: 16 Max size of kernel argument: 1024 Alignment (bits) of base address: 2048 Minimum alignment (bytes) for any datatype: 128 Single precision floating point capability Denorms: No Quiet NaNs: Yes Round to nearest even: Yes Round to zero: Yes Round to +ve and infinity: Yes IEEE754-2008 fused multiply-add: Yes Cache type: Read/Write Cache line size: 64 Cache size: 16384 Global memory size: 2146349056 Constant buffer size: 65536 Max number of constant args: 8 Local memory type: Scratchpad Local memory size: 32768 Max pipe arguments: 0 Max pipe active reservations: 0 Max pipe packet size: 0 Max global variable size: 0 Max global variable preferred total size: 0 Max read/write image args: 0 Max on device events: 0 Queue on device max size: 0 Max on device queues: 0 Queue on device preferred size: 0 SVM capabilities: Coarse grain buffer: No Fine grain buffer: No Fine grain system: No Atomics: No Preferred platform atomic alignment: 0 Preferred global atomic alignment: 0 Preferred local atomic alignment: 0 Kernel Preferred work group size multiple: 64 Error correction support: 0 Unified memory for Host and Device: 0 Profiling timer resolution: 1 Device endianess: Little Available: Yes Compiler available: Yes Execution capabilities: Execute OpenCL kernels: Yes Execute native function: No Queue on Host properties: Out-of-Order: No Profiling : Yes Queue on Device properties: Out-of-Order: No Profiling : No Platform ID: 0x7f5a050f49f0 Name: Hainan Vendor: Advanced Micro Devices, Inc. Device OpenCL C version: OpenCL C 1.2 Driver version: 2639.3 Profile: FULL_PROFILE Version: OpenCL 1.2 AMD-APP (2639.3) Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event

    opened by skn123 20
  • Random results of SPMV on non-squared matrix

    Random results of SPMV on non-squared matrix

    I found sometime, spmv on a non-square matrix may generate random results. Here is an example: I made a change on examples/utests.cpp: --- i/examples/utests.cpp +++ w/examples/utests.cpp

    @@ -381,16 +381,22 @@ int main(int argc, char *argv[]) {
                        row.push_back(col.size());
                    }
    
    -               std::vector<double> x(m);
    +               std::vector<double> x(m * 40 * 1024);
                    std::vector<double> y(n);
    -               std::generate(x.begin(), x.end(), []() { return (double)rand() / RAND_MAX; });
    +               std::generate(x.begin(), x.end(), []() { return 0; });
    +               // std::generate(x.begin(), x.end(), []() { return (double)rand() / RAND_MAX; });
    
                    vex::SpMat <double> A(ctx.queue(), y.size(), x.size(), row.data(), col.data(), val.data());
    -               vex::vector<double> X(ctx.queue(), x);
    +               vex::vector<double> X(ctx.queue(), x.size());
    +    X = 0;
                    vex::vector<double> Y(ctx.queue(), y.size());
    
                    Y = A * X;
                    copy(Y, y);
    +    for (auto i: y) {
    +      std::cout << i << "\t";
    +    }
    +    std::cout << std::endl;
    
                    double res = 0;
                    for(size_t i = 0; i < y.size(); i++) {
    

    ===========================END OF DIFF=============================== I expect to see all the elements of y are 0. However, in my cxt:

    1.    Intel(R) Core(TM) i7-3930K CPU @ 3.20GHz
      
    2. GeForce GTX 680 They may be: (not always, but if they are all 0, then change 40 to something else, it's easy to see random numbers for the first run)

    Sparse matrix-vector product for non-square matrix: ............0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0623769 0.952414 0.456983 0.67 0493 0.99274 0.902097 0.687297 0.163851 0.503021 0.282368 0.872369 0.101505 0.372683 0.670697 0.470279 0.42733 0.830566 0.987858 0 .908128 0.348441 0.902872 0.703435 0.485632 0.197168 0.271371 0.0163845 0.216113 0.510647 0.775435 0.0967307 0.846954 0.837812 0 .049145 0.303937 0.508305 0.0418853 0.206033 0.195602 0.205736 0.709054 0.477971 0.0781055 0.810559 0.850653 0.748802 0.280838 0 .277983 0.579368 0.268696 0.186111 0.927809 0.171569 0.889546 0.413442 0.368737 0.160917 0.429826 0.58485 0.671563 0.205261 0.68 1581 0.518517 0.0430729 0.730726 0.822454 0.551378 0.772611 0.0284869 0.74698 0.978347 0.737541 0.224951 0.0564526 0.5481 0.0756039 0 .805255 0.828938 0.353587 0.384623 0.0976346 0.539698 0.312432 0.269203 0.429244 0.725873 0.63794 0.59016 0.155699 0.22279 0.261723 0.36 096 0.904371 0.780241 0.404033 0.635096 0.602694 0.955412 0.407707 0.631181 0.702392 0.386054 0.368722 0.927343 0.442507 0 .916822 0.00294663 0.247762 0.745761 0.356533 0.632384 0.843395 0.896231 0.944816 0.112598 0.325475 0.67069 0.750539 0.915635 0.82 6389 0.973329 0.177358 0.187349 0.8777 0.957599 0.591383 0.512796 0.560293 0.546794 0.920503 0.191474 0.249186 0.306557 0.56 0196 0.176529 0.749064 0.477018 0.179476 0.996826 0.222779 0.536009 0.62921 0.0661738 0.43224 0.574026 0.178772 0.757715 0.244716 0 .929311 0.67335 0.0711045 0.90264 0.850708 0.258454 0.78034 0.808307 0.849837 0.293136 0.3686 0.396631 0.213639 0.560074 0.645818 0.520196 0 .12027 0.822347 0.26926 0.597288 0.00182245 0.266086 0.820067 0.537831 0.895296 0.886241 0.970072 0.469322 0.0650129 0.727787 0.71 4038 0.994324 0.401137 0.785142 0.896964 0.251845 0.0435961 0.677304 0.0601521 0.893433 0.970439 0.428752 0.290064 0.184078 0 .988826 0.935882 0.704274 0.109096 0.758228 0.973535 0.706384 0.760051 0.23962 0.526451 0.297882 0.134916 0.412692 0.267954 0.60 4238 0.477705 0.995741 0.318276 0.472028 0.396878 0.103418 0.368992 0.648723 0.147014 0.0462959 0.708875 0.040447 0.0167352 0 .137627 0.330511 0.200813 0.126453 0.266393 0.905088 0.235549 0.0246213 0.878622 0.941933 0.784672 0.118243 0.468384 0.0825547 0 .253159 0.881076 0.350509 0.857397 0.358781 0.34625 0.175673 0.830809 0.743128 0.279092 0.199801 0.39185 0.426106 0.246097 0.100725 0 .466553 0.262833 0.238352 0.797064 0.463646 0.364805 0.0634568 0.368734 0.600354 0.0880781 0.247356 0.542288 0.87275 0.365598 0.01 06718 0.955305 0.618757 0.891748 0.305814 0.476155 0.250528 0.652064 0.651828 0.0813374 0.395191 0.93092 0.281139 0.787042 0.35 7026 0.527236 0.887767 0.823579 0.790069 0.126119 0.620642 0.253715 0.490925 0.684099 0.622448 0.0912792 0.772177 0.869804 0 .633567 0.644928 0.235402 0.644239 0.600233 0.85416 0.535986 0.906047 0.330315 0.786515 0.55811 0.982143 0.867852 0.953302 0.913063 0 .148991 0.740343 0.270089 0.676227 0.628111 0.0936671 0.466296 0.75423 0.71431 0.72001 0.245155 0.398409 0.342458 0.336434 0.170586 0.21 2262 0.970001 0.815514 0.447664 0.614239 0.415747 0.301824 0.150226 0.321793 0.632138 0.936741 0.879904 0.614281 0.804593 0 .833205 0.527344 0.953584 0.573549 0.797433 0.62981 0.201659 0.8911 0.096106 0.955889 0.605409 0.816116 0.201044 0.00381809 0.158574 0 .537478 0.174404 0.370836 0.507479 0.989918 0.8185 0.121718 0.405665 0.120324 0.271944 0.727458 0.752462 0.208685 0.607362 0.36 6744 0.0132775 0.440568 0.894088 0.966861 0.0141164 0.69152 0.596672 0.215776 0.58262 0.692777 0.171665 0.188029 0.508894 0.372709 0 .191847 0.667468 0.910187 0.366252 0.0383039 0.417666 0.35617 0.856804 0.539385 0.761835 0.977128 0.811329 0.489294 0.729591 0.02 00135 0.0966558 0.0963346 0.0332909 0.537223 0.990422 0.000151991 0.55134 0.681943 0.596824 0.767116 0.264563 0.289601 0.938781 0.45 2592 0.798495 0.31149 0.64444 0.465962 0.221677 0.0106913 0.504266 0.639344 0.366861 0.36107 0.178728 0.128696 0.338199 0.990057 0.61 799 0.0677896 0.0100704 0.714646 0.164124 0.0433613 0.251869 0.154546 0.0435133 0.803209 0.836489 0.640337 0.570325 0.101052 0 .929938 0.509105 0.553644 0.728432 0.820595 0.198084 0.194395 0.0422728 0.208775 0.698661 0.681616 0.575636 0.0597309 0.860345 0 .704333 0.39793 0.850401 0.322323 0.465719 0.860472 0.0369686 0.629843 0.903833 0.288838 0.78439 0.947346 0.0920466 0.620879 0.587683 0 .662371 0.721931 0.517621 0.171477 0.275575 0.246053 0.992072 0.473659 0.440448 0.0343447 0.682434 0.139108 0.715961 0.25807 0.19 8839 0.576306 0.962403 0.596769 0.426707 0.284726 0.0624881 0.287179 0.321695 0.692331 0.191012 0.610532 0.476721 0.138359 0 .702579 0.0976002 0.726042 0.36495 0.819531 0.243663 0.536427 0.0951062 0.489716 0.528499 0.568765 0.930164 0.562843 0.251199 0.06 92725 0.278805 0.509269 0.268112 0.85511 0.471673 0.864881 0.281817 0.756399 0.927369 0.568996 0.0780933 0.6197 0.760008 0.688626 0 .0964217 0.898367 0.391205 0.194022 0.624409 0.756155 0.013553 0.868072 0.292581 0.108659 0.357788 0.82108 0.677424 0.287952 0 .383924 0.928623 0.357225 0.662728 0.437893 0.625337 0.517838 0.909565 0.490217 0.799656 0.665964 0.417586 0.368652 0.744057 0 .0372869 0.12866 0.432683 0.133709 0.0270274 0.823887 0.32773 0.651436 0.580042 0.341283 0.519508 0.872624 0.449943 0.877296 0.69 3704 0.127367 0.165248 0.0776274 0.0559905 0.522473 0.740356 0.493883 0.147809 0.258194 0.403449 0.638027 0.0578493 0.0694125 0 .0556134 0.426501 0.81347 0.0929003 0.555162 0.246153 0.226609 0.582189 0.0700402 0.554339 0.233625 0.650082 0.895623 0.753133 0 .522706 0.345566 0.63043 0.21641 0.472933 0.795678 0.294037 0.528923 0.318151 0.0343929 0.0228064 0.46596 0.292587 0.426255 0.103987 0.35 0436 0.495667 0.159601 0.776937 0.309137 0.252501 0.332099 0.55529 0.47911 0.914288 0.62533 0.0334494 0.147913 0.275413 0.929072 0.90 1047 0.798119 0.274638 0.531476 0.0145286 0.747571 0.327154 0.308566 0.276494 0.645305 0.342959 0.2993 0.111266 0.635546 0.72 5555 0.215253 0.985982 0.221223 0.374854 0.762919 0.53036 0.627355 0.0950179 0.0856498 0.106465 0.00930583 0.71098 0.139914 0.157219 0 .986392 0.0689863 0.058266 0.784511 0.343624 0.589743 0.79904 0.091195 0.916897 0.107606 0.367689 0.562202 0.450565 0.666989 0.673468 0.0861104 0.392544 0.88872 0.072092 0.613767 0.263574 0.835011 0.144127 0.890929 0.930029 0.229777 0.997393 0.939335 0.94 0756 0.137307 0.0965539 0.927149 0.206293 0.15482 0.71166 0.549917 0.744562 0.5107 0.641112 0.661459 0.618305 0.00880131 0.223661 0.06 88701 0.67579 0.897129 0.15498 0.0683349 0.785849 0.227073 0.682102 0.0494234 0.0620835 0.826229 failed.

    bug 
    opened by byzhang 20
  • Problem compiling on Linux (tests/threads.cpp)

    Problem compiling on Linux (tests/threads.cpp)

    Hi ddemidov and thanks for your beautiful work! Trying to compile in linux I obtain two errors:

    1. the first is concerning constants.hpp: OPENCL/vexcl/vexcl/constants.hpp:178:1: error: ‘one_div_two_pi’ is not a member of ‘boost::math::constants’ I have solved this commenting the line 178 in constants.hpp.

    2. the second is about tests/threads.cpp and I didn't solve it yet. /home/formica/OPENCL/vexcl/tests/threads.cpp:24:74: required from here /usr/include/boost/thread/detail/thread.hpp:114:9: error: ‘boost::thread::thread(boost::thread&)’ is private In file included from /usr/include/c++/4.7/memory:66:0, from /usr/include/boost/config/no_tr1/memory.hpp:21, from /usr/include/boost/smart_ptr/shared_ptr.hpp:27, from /usr/include/boost/shared_ptr.hpp:17, from /usr/include/boost/thread/pthread/thread_data.hpp:10, from /usr/include/boost/thread/thread.hpp:17, from /usr/include/boost/thread.hpp:13, from /home/formica/OPENCL/vexcl/tests/threads.cpp:2: /usr/include/c++/4.7/bits/stl_construct.h:77:7: error: within this context

    Take care: I had to use cl.hpp version 1.1 to start the compilation with my old nvidia gpu.

    Can you help me?

    opened by formica-multiuso 19
  • added the restricted keywords

    added the restricted keywords

    Dear Dennis, I have added a restricted and restricted_const keyword. however i have not discovered any increase in performance when using these.

    i also added an extra cmakelist which only adds the vexcl files so they are visible if you use the cmakelist as a project file.

    I edited some lines of the reduction kernel. so it doesn't copy from smem to sdata to smem again since this was a useless copy and it saved a single instruction hurray :) while not breaking the code.

    Sincerely, Boris Smidt.

    opened by borissmidt 18
  • ambiguous overload for ‘operator<<' Error.

    ambiguous overload for ‘operator<<' Error.

    I am facing this error after including vexcl. Even if i specify the backend it shows the error. #include <vexcl/vexcl.hpp>

        ```
    

    from /home/kafi/Documents/Projects/CODD-pro-lib_integration/main.cpp:14: /usr/include/boost/date_time/date_generators.hpp: In member function ‘virtual std::string boost::date_time::nth_kday_of_month<date_type>::to_string() const’: /usr/include/boost/date_time/date_generators.hpp:237:8: error: ambiguous overload for ‘operator<<’ (operand types are ‘std::basic_ostream’ and ‘int’) 236 | ss << 'M' | ~~~~~~~~~ | | | std::basic_ostream 237 | << static_cast(month_) << '.' | ^~ ~~~~~~~~~~~~~~~~~~~~~~~~ | | | int In file included from /usr/include/c++/9/iterator:64, from /usr/include/CL/cl.hpp:219,

    opened by kafi350 1
  • Copying Memory location of an OpenCl backend using VexCL and using that location with Boost.compute

    Copying Memory location of an OpenCl backend using VexCL and using that location with Boost.compute

    Is it possible to Copy Memory location of an OpenCl backend using VexCL and using that location with Boost.compute? As Boost.compute uses vector type..

    opened by kafi350 6
  • can't build doc from tarball

    can't build doc from tarball

    It looks like docs can be built only from git checkout

    Running Sphinx v4.0.1                                                                                                  
                                                                                                                                                                                                                                                  
    Configuration error:                           
    There is a programmable error in your configuration file:
    
    Traceback (most recent call last):
      File "/usr/lib/pypy3.7/site-packages/sphinx/config.py", line 323, in eval_config_file
        exec(code, namespace)
      File "/var/tmp/portage/dev-cpp/vexcl-1.4.2/work/vexcl-1.4.2/docs/conf.py", line 78, in <module>
        version = git_version()
      File "./git_version.py", line 40, in git_version
        raise ValueError("Cannot find the version number!")
    ValueError: Cannot find the version number!
    
    opened by Alessandro-Barbieri 3
  • CUDA only support

    CUDA only support

    I was tinkering with VexCL on an NVidia jetson nano, which only supports cuda. It appears if CL isn't available at all, the project files have issues? (Even if I install OpenCL headers, the lack of a library still causes a failure). Any hints?

    -- Looking for CL/cl_platform.h -- Looking for CL/cl_platform.h - not found CMake Warning at CMakeLists.txt:180 (message): The JIT interface requires OpenCL headers to be available.You can download them from https://github.com/KhronosGroup/OpenCL-HeadersSet OpenCL_INCLUDE_DIR to the location of the headers.For now, disabling the JIT target.

    CMake Error at CMakeLists.txt:244 (add_library): add_library cannot create ALIAS target "VexCL::Backend" because target "OpenCL" does not already exist.

    opened by ByronFaber 4
  • Support for Block SpMV

    Support for Block SpMV

    Hi, first of all, thanks for the development and maintenance of this great library! I would like to know if there is any plan to support block sparse matrices and block SpMV with constant-size blocks. This would speedup the kernel by reducing the indexing overhead for this class of matrices.

    Thank you.

    opened by daaugusto 12
Releases(1.4.3)
  • 1.4.3(Nov 9, 2021)

  • 1.4.2(Apr 27, 2021)

    • Two years worth of minor fixes and improvements.
    • Added source_generator::num_groups() returning the number of workgroups on the compute device.
    • Make push_compile_options, push_program_header behave in a cumulative way.
    • Added profiler::reset().
    • Added vector::at().
    • Support mixed precision in vex::copy().
    Source code(tar.gz)
    Source code(zip)
  • 1.4.1(May 4, 2017)

  • 1.4.0(Apr 19, 2017)

    • Modernize cmake build system. Provide VexCL::OpenCL, VexCL::Compute, VexCL::CUDA, VexCL::JIT imported targets, so that users may just
      add_executable(myprogram myprogram.cpp)
      target_link_libraries(myprogram VexCL::OpenCL)
      

      to build a program using the corresponding VexCL backend. Also stop polluting global cmake namespace with things like add_definitions(), include_directories(), etc. See http://vexcl.readthedocs.io/en/latest/cmake.html.

    • Make vex::backend::kernel::config() return reference to the kernel. So that it is possible to config and launch the kernel in a single line: K.config(nblocks, nthreads)(queue, prm1, prm2, prm3);.
    • Implement vector<T>::reinterpret<U>() method. It returns a new vector that reinterprets the same data (no copies are made) as the new type.
    • Implemented new backend: JIT. The backend generates and compiles at runtime C++ kernels with OpenMP support. The code will not be more effective that hand-written OpenMP code, but allows to easily debug the generated code with host-side debugger. The backend also may be used to develop and test new code when other backends are not available.
    • Let VEX_CONSTANTS to be casted to their values in the host code. So that a constant defined with VEX_CONSTANT(name, expr) could be used in host code as name. Constants are still useable in vector expressions as name().
    • Allow passing generated kernel args for each GPU (#202). Kernel args packed into std::vector will be unpacked and passed to the generated kernels on respective devices.
    • Reimplemented vex::SpMat as vex::sparse::ell, vex::sparse::crs, vex::sparse::matrix (automatically chooses one of the two formats based on the current compute device), and vex::sparse::distributed<format> (this one may span several compute devices). The new matrix-vector products are now normal vector expressions, while the old vex::SpMat could only be used in additive expressions. The old implementation is still available. vex::sparse::ell is now converted from host-side CRS format on compute device, which makes the conversion faster.
    • Bug fixes and minor improvements.
    Source code(tar.gz)
    Source code(zip)
  • 1.3.3(Apr 6, 2015)

    • Added vex::tensordot() operation. Given two tensors (arrays of dimension greater than or equal to one), A and B, and a list of axes pairs (where each pair represents corresponding axes from two tensors), sums the products of A's and B's elements over the given axes. Inspired by python's numpy.tensordot operation.
    • Expose constant memory space in OpenCL backend.
    • Provide shortcut filters vex::Filter::{CPU,GPU,Accelerator} for OpenCL backend.
    • Added Boost.Compute backend. Core functionality of the Boost.Compute library is used as a replacement to Khronos C++ API which seems to become more and more outdated. The Boost.Compute backend is still based on OpenCL, so there are two OpenCL backends now. Define VEXCL_BACKEND_COMPUTE to use this backend and make sure Boost.Compute headers are in include path.
    Source code(tar.gz)
    Source code(zip)
  • 1.3.2(Sep 4, 2014)

  • 1.3.1(May 14, 2014)

  • 1.3.0(Apr 14, 2014)

    • API breaking change: vex::purge_kernel_caches() family of functions is renamed to vex::purge_caches() as the online cache now may hold objects of arbitrary type. The overloads that used to take vex::backend::kernel_cache_key now take const vex::backend::command_queue&.
    • The online cache is now purged whenever vex::Context is destroyed. This allows for clean release of OpenCL/CUDA contexts.
    • Code for random number generators has been unified between OpenCL and CUDA backends.
    • Fast Fourier Transform is now supported both for OpenCL and CUDA backends.
    • vex::backend::kernel constructor now takes optional parameter with command line options.
    • Performance of CLOGS algorithms has been improved.
    • VEX_BUILTIN_FUNCTION macro has been made public.
    • Minor bug fixes and improvements.
    Source code(tar.gz)
    Source code(zip)
  • 1.2.0(Apr 2, 2014)

    • API breaking change: the definition of VEX_FUNCTION family of macros has changed. The previous versions are available as VEX_FUNCTION_V1.
    • Wrapping code for clogs library is added by @bmerry (the author of clogs).
    • vector/multivector iterators are now standard-conforming iterators.
    • Other minor improvements and bug fixes.
    Source code(tar.gz)
    Source code(zip)
  • 1.1.2(Dec 24, 2013)

    • reduce_by_key() may take several tied keys (see e09d249d565b71ed7aea18bf3368ab2bf4ff7713).
    • It is possible to reduce OpenCL vector types (cl_float2, cl_double4, etc).
    • VEXCL_SHOW_KERNELS may be an environment variable as well as a preprocessor macro. This allows to control kernel source output without program recompilation.
    • Added compute capability filter for the CUDA backend (vex::Filter::CC(major, minor)).
    • Fixed compilation errors and warnings generated by Visual Studio.
    Source code(tar.gz)
    Source code(zip)
  • 1.1.1(Dec 5, 2013)

    Sorting algorithms may take tuples of keys/values (in fact, any Boost.Fusion sequence will do). One will have to explicitly specify the comparison functor in this case. Both host and device variants of the comparison functor should take 2n arguments, where n is the number of keys. The first n arguments correspond to the left set of keys, and the second n arguments correspond to the right set of keys. Here is an example that sorts values by a tuple of two keys:

    vex::vector<int>    keys1(ctx, n);
    vex::vector<float>  keys2(ctx, n);
    vex::vector<double> vals (ctx, n);
    
    struct {
        VEX_FUNCTION(device, bool(int, float, int, float),
                "return (prm1 == prm3) ? (prm2 < prm4) : (prm1 < prm3);"
                );
        bool operator()(int a1, float a2, int b1, float b2) const {
            return std::make_tuple(a1, a2) < std::tuple(b1, b2);
        }
    } comp;
    
    vex::sort_by_key(std::tie(keys1, keys2), vals, comp);
    
    Source code(tar.gz)
    Source code(zip)
  • 1.1.0(Nov 29, 2013)

    • vex::SpMat<>class uses CUSPARSE library on CUDA backend when VEXCL_USE_CUSPARSE macro is defined. This results in more effective sparse matrix-vector product, but disables inlining of SpMV operation.
    • Provided an example of CUDA backend interoperation with Thrust.
    • When VEXCL_CHECK_SIZES macro is defined to 1 or 2, then runtime checks for vector expression correctness are enabled (see #81, #82).
    • Added sort() and sort_by_key() functions.
    • Added inclusive_scan() and exclusive_scan() functions.
    • Added reduce_by_key() function. Only works with single-device contexts.
    • Added convert_<type>() and as_<type>() builtin functions for OpenCL backend.
    Source code(tar.gz)
    Source code(zip)
  • 1.0.0(Nov 15, 2013)

    CUDA backend is added!

    As of v1.0.0, VexCL provides two backends: OpenCL and CUDA. In order to choose either of those, user has to define VEXCL_BACKEND_OPENCL or VEXCL_BACKEND_CUDA macros. In case neither of those are defined, OpenCL backend is chosen by default. One also has to link to either libOpenCL.so (OpenCL.dll for Windows users) or libcuda.so (cuda.dll).

    For the CUDA backend to work, CUDA Toolkit has to be installed, NVIDIA CUDA compiler driver nvcc has to be in executable PATH and usable at runtime.

    Benchmarks show that the CUDA backend is a couple of percents more efficient than the OpenCL backend, except for matrix-vector multiplication on multiple devices (there are some issues with asynchronous memory transfer with CUDA driver API). Note that first run of a program will take longer than usual, because there will be several invocations of nvcc compiler to compile each of compute kernels used in the program. Second and other runs will use offline kernel cache and will complete faster.

    Also:

    • Added vex::Filter::General: modifiable container for device filters.
    • vex::Filter::Env supports OCL_POSITION environment variable.
    • Vector views (reduction, permutation) are all working with vector expressions.
    • Added vex::reshape() function for reshaping of multidimensional expressions.
    • Added vex::cast() function for changing deduced type of an expression.
    • Added vex::Filter::Extension and vex::Filter::GLSharing filters for the OpenCL backend (thanks, @johneih!)
    • VEXCL_SPLIT_MULTIEXPRESSIONS macro allows componentwise splitting of large multiexpressions.
    • Various bug fixes.
    Source code(tar.gz)
    Source code(zip)
  • 0.8.5(Oct 9, 2013)

    • Sparse matrix-vector product for OpenCL vector types:
        vex::SpMat <cl_double2> A;
        vex::vector<cl_double2> x, y;
        y = A * x;
    
    • Added raw_pointer() function. See 'Raw pointers' section in README.
    • Fixed compilation for 32bit Visual Studio (thanks to @d-meiser for reporting!).
    • Other bug fixes.
    Source code(tar.gz)
    Source code(zip)
  • 0.8.4-r1(Sep 30, 2013)

  • 0.8.4(Sep 20, 2013)

    • Allow user-defined functions in symbolic expressions
    • Introduced address-of and dereference operators in vector expressions This makes the following possible:
    /*
     * Assign 42 to either y or z, depending on value of x. The trick with
     * address_of/dereference is unfortunately required because in C99 (which
     * OpenCL is based on) result of ternary operator is not an lvalue.
     */
    vex::tie( *if_else( x < 0.5 ? &y : &z) ) = 42;
    
    • vex::reduce() accepts slices of vector expressions. vex::reduce() calls may be nested.
    • vex::element_index() optionally accepts length (number of elements). This allows to reduce stateless vector expressions. Could be useful e.g. for Monte-Carlo experiments.
    • Added missing builtin functions.
    • Introduced constants in vector expressions. Instances of std::integral_constant<T,v> or constants from vex::constants namespace (which are currently wrappers for boost::math::constants) will not be passed as kernel parameters, but will be written as literals to kernel source. Users may introduce their own constants with help of VEX_CONSTANT macro.
    Source code(tar.gz)
    Source code(zip)
  • 0.8.3(Sep 14, 2013)

    • FFT transform may be used as if it was first-class vector expression, and not just an additive transform (#54). This does not mean that expressions involving FFTs will result in single kernel launch.
    • Allow to purge online kernel caches. This allows for complete cleanup of OpenCL contexts. Should be useful for libraries.
    • Offline kernel caching. Saves time on first-time compilation. See comments in 1aedcd27a79277fbaddae68eb42e407e1e6b9394.
    Source code(tar.gz)
    Source code(zip)
  • 0.8.2(Sep 11, 2013)

  • 0.8.1(Sep 9, 2013)

  • 0.8.0(Sep 9, 2013)

    API changes:

    • There are no more non-owning multivectors. multivector class has only two template parameters now: type and number of components.
    • vex::tie() now returns vex::expression_tuple instead of non-owning multivector. This allows to tie vectors of different types or even writable expressions (e.g. slices) together.
    • Order of vex::make_temp<> template parameters has changed (first required Tag, then optional Type). The type, when unspecified, is deduced automatically from the given expression.
    • MPI support is dropped (moved to 'mpi' branch).
    Source code(tar.gz)
    Source code(zip)
  • 0.7.6(Sep 5, 2013)

  • 0.7.4(Sep 18, 2013)

    • Added missing max, min, abs OpenCL functions
    • Added vex::extents helper object
    • Added vex::multi_array container with minimal functionality
    • Allow to disable constructors that use static VexCL context
    • Added vex::get_reductor<T,R>() function
    • Code cleanup
    Source code(tar.gz)
    Source code(zip)
  • 0.7.5(Aug 7, 2013)

OpenCL based GPU accelerated SPH fluid simulation library

libclsph An OpenCL based GPU accelerated SPH fluid simulation library Can I see it in action? Demo #1 Demo #2 Why? Libclsph was created to explore the

null 46 Jun 15, 2022
A C++ GPU Computing Library for OpenCL

Boost.Compute Boost.Compute is a GPU/parallel-computing library for C++ based on OpenCL. The core library is a thin C++ wrapper over the OpenCL API an

Boost.org 1.3k Jun 22, 2022
A small C OpenCL wrapper

oclkit, plain and stupid OpenCL helper oclkit is a small set of C functions, to avoid writing the same OpenCL boiler plate over and over again, yet ke

Matthias Vogelgesang 14 Nov 10, 2020
Thin C++-flavored wrappers for the CUDA Runtime API

cuda-api-wrappers: Thin C++-flavored wrappers for the CUDA runtime API Branch Build Status: Master | Development: nVIDIA's Runtime API for CUDA is int

Eyal Rozenberg 496 Jun 27, 2022
Parallel algorithms (quick-sort, merge-sort , enumeration-sort) implemented by p-threads and CUDA

程序运行方式 一、编译程序,进入sort-project(cuda-sort-project),输入命令行 make 程序即可自动编译为可以执行文件sort(cudaSort)。 二、运行可执行程序,输入命令行 ./sort 或 ./cudaSort 三、删除程序 make clean 四、指定线程

Fu-Yun Wang 3 May 30, 2022
Bolt is a C++ template library optimized for GPUs. Bolt provides high-performance library implementations for common algorithms such as scan, reduce, transform, and sort.

Bolt is a C++ template library optimized for heterogeneous computing. Bolt is designed to provide high-performance library implementations for common

null 356 Jun 17, 2022
Pool is C++17 memory pool template with different implementations(algorithms)

Object Pool Description Pool is C++17 object(memory) pool template with different implementations(algorithms) The classic object pool pattern is a sof

KoynovStas 1 Feb 14, 2022
oneAPI DPC++ Library (oneDPL) https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/dpc-library.html

oneAPI DPC++ Library (oneDPL) The oneAPI DPC++ Library (oneDPL) aims to work with the oneAPI DPC++ Compiler to provide high-productivity APIs to devel

oneAPI-SRC 624 Jun 21, 2022
ArrayFire: a general purpose GPU library.

ArrayFire is a general-purpose library that simplifies the process of developing software that targets parallel and massively-parallel architectures i

ArrayFire 3.9k Jul 2, 2022
C++React: A reactive programming library for C++11.

C++React is reactive programming library for C++14. It enables the declarative definition of data dependencies between state and event flows. Based on

Sebastian 952 Jun 19, 2022
A library for enabling task-based multi-threading. It allows execution of task graphs with arbitrary dependencies.

Fiber Tasking Lib This is a library for enabling task-based multi-threading. It allows execution of task graphs with arbitrary dependencies. Dependenc

RichieSams 768 Jun 25, 2022
The C++ Standard Library for Parallelism and Concurrency

Documentation: latest, development (master) HPX HPX is a C++ Standard Library for Concurrency and Parallelism. It implements all of the corresponding

The STE||AR Group 1.9k Jun 23, 2022
A C++ library of Concurrent Data Structures

CDS C++ library The Concurrent Data Structures (CDS) library is a collection of concurrent containers that don't require external (manual) synchroniza

Max Khizhinsky 2.1k Jul 1, 2022
A header-only C++ library for task concurrency

transwarp Doxygen documentation transwarp is a header-only C++ library for task concurrency. It allows you to easily create a graph of tasks where eve

Christian Blume 570 Jun 24, 2022
:copyright: Concurrent Programming Library (Coroutine) for C11

libconcurrent tiny asymmetric-coroutine library. Description asymmetric-coroutine bidirectional communication by yield_value/resume_value native conte

sharow 347 Jun 5, 2022
Simple and fast C library implementing a thread-safe API to manage hash-tables, linked lists, lock-free ring buffers and queues

libhl C library implementing a set of APIs to efficiently manage some basic data structures such as : hashtables, linked lists, queues, trees, ringbuf

Andrea Guzzo 383 Jul 1, 2022
An optimized C library for math, parallel processing and data movement

PAL: The Parallel Architectures Library The Parallel Architectures Library (PAL) is a compact C library with optimized routines for math, synchronizat

Parallella 295 Jun 10, 2022
A easy to use multithreading thread pool library for C. It is a handy stream like job scheduler with an automatic garbage collector. This is a multithreaded job scheduler for non I/O bound computation.

A easy to use multithreading thread pool library for C. It is a handy stream-like job scheduler with an automatic garbage collector for non I/O bound computation.

Hyoung Min Suh 12 Jun 4, 2022
Single header asymmetric stackful cross-platform coroutine library in pure C.

minicoro Minicoro is single-file library for using asymmetric coroutines in C. The API is inspired by Lua coroutines but with C use in mind. The proje

Eduardo Bart 301 Jun 17, 2022