A bug with AMD SI cards was reported in this thread
https://community.amd.com/message/2869393#comment-2869393
A solution was also proposed on that thread. If I were to implement that solution within vexcl how do I go about doing that? Essentially, with every kernel, a null kernel would need to passed. It can safely be ignored with an #ifdef declaration for other cards. I can show the results of an example that illustrate this case:
[email protected]:~/build/vexcl/examples$ ./complex_simple
- Hainan (AMD Accelerated Parallel Processing)
X * Y = (0,16) * (16,0) = (-5.62355e+303,7.1998e-304)i
X * Y = (1,15) * (15,1) = (1.76736e+186,-2.529e-186)i
X * Y = (2,14) * (14,2) = (-5.43986e-256,-7.4239e-199)i
X * Y = (3,13) * (13,3) = (-1.15801e-125,-1.42689e-184)i
X * Y = (4,12) * (12,4) = (-1.5799e-103,-2.30511e+235)i
X * Y = (5,11) * (11,5) = (2.83923e+103,-2.783e+307)i
X * Y = (6,10) * (10,6) = (-4.95723e+305,1.35145e+188)i
X * Y = (7,9) * (9,7) = (-2.73677e-48,-0.00275755)i
X * Y = (8,8) * (8,8) = (-2.26843e-106,-1.45955e-201)i
X * Y = (9,7) * (7,9) = (1.79762e+106,2.97045e+201)i
X * Y = (10,6) * (6,10) = (-nan,2.14326e-308)i
X * Y = (11,5) * (5,11) = (-8.49166e-200,5.26253e+199)i
X * Y = (12,4) * (4,12) = (-3.88897e+306,2.78449e+188)i
X * Y = (13,3) * (3,13) = (1.69952e+184,1.77589e-234)i
X * Y = (14,2) * (2,14) = (-1.94762e-104,-4.35661e+232)i
X * Y = (15,1) * (1,15) = (-1.82884e-128,1.00333e-232)i
X / Y = (0,16) / (16,0) = (0,256)
X / Y = (1,15) / (15,1) = (0,226)
X / Y = (2,14) / (14,2) = (0,200)
X / Y = (3,13) / (13,3) = (0,178)
X / Y = (4,12) / (12,4) = (0,160)
X / Y = (5,11) / (11,5) = (0,146)
X / Y = (6,10) / (10,6) = (0,136)
X / Y = (7,9) / (9,7) = (0,130)
X / Y = (8,8) / (8,8) = (0,128)
X / Y = (9,7) / (7,9) = (0,130)
X / Y = (10,6) / (6,10) = (0,136)
X / Y = (11,5) / (5,11) = (0,146)
X / Y = (12,4) / (4,12) = (0,160)
X / Y = (13,3) / (3,13) = (0,178)
X / Y = (14,2) / (2,14) = (0,200)
X / Y = (15,1) / (1,15) = (0,226)
If I run it the second time
[email protected]:~/build/vexcl/examples$ ./complex_simple
- Hainan (AMD Accelerated Parallel Processing)
X * Y = (0,16) * (16,0) = (0,1)i
X * Y = (1,15) * (15,1) = (0.132743,0.99115)i
X * Y = (2,14) * (14,2) = (0.28,0.96)i
X * Y = (3,13) * (13,3) = (0.438202,0.898876)i
X * Y = (4,12) * (12,4) = (0.6,0.8)i
X * Y = (5,11) * (11,5) = (0.753425,0.657534)i
X * Y = (6,10) * (10,6) = (0.882353,0.470588)i
X * Y = (7,9) * (9,7) = (0.969231,0.246154)i
X * Y = (8,8) * (8,8) = (1,0)i
X * Y = (9,7) * (7,9) = (0.969231,-0.246154)i
X * Y = (10,6) * (6,10) = (0.882353,-0.470588)i
X * Y = (11,5) * (5,11) = (0.753425,-0.657534)i
X * Y = (12,4) * (4,12) = (0.6,-0.8)i
X * Y = (13,3) * (3,13) = (0.438202,-0.898876)i
X * Y = (14,2) * (2,14) = (0.28,-0.96)i
X * Y = (15,1) * (1,15) = (0.132743,-0.99115)i
X / Y = (0,16) / (16,0) = (0,256)
X / Y = (1,15) / (15,1) = (0,226)
X / Y = (2,14) / (14,2) = (0,200)
X / Y = (3,13) / (13,3) = (0,178)
X / Y = (4,12) / (12,4) = (0,160)
X / Y = (5,11) / (11,5) = (0,146)
X / Y = (6,10) / (10,6) = (0,136)
X / Y = (7,9) / (9,7) = (0,130)
X / Y = (8,8) / (8,8) = (0,128)
X / Y = (9,7) / (7,9) = (0,130)
X / Y = (10,6) / (6,10) = (0,136)
X / Y = (11,5) / (5,11) = (0,146)
X / Y = (12,4) / (4,12) = (0,160)
X / Y = (13,3) / (3,13) = (0,178)
X / Y = (14,2) / (2,14) = (0,200)
X / Y = (15,1) / (1,15) = (0,226)
The results look fine now.
The result of clinfo is
[email protected]:~/build/vexcl/examples$ clinfo
Number of platforms: 1
Platform Profile: FULL_PROFILE
Platform Version: OpenCL 2.1 AMD-APP (2639.3)
Platform Name: AMD Accelerated Parallel Processing
Platform Vendor: Advanced Micro Devices, Inc.
Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices
Platform Name: AMD Accelerated Parallel Processing
Number of devices: 1
Device Type: CL_DEVICE_TYPE_GPU
Vendor ID: 1002h
Board name: AMD Radeon (TM) R5 M330
Device Topology: PCI[ B#1, D#0, F#0 ]
Max compute units: 5
Max work items dimensions: 3
Max work items[0]: 1024
Max work items[1]: 1024
Max work items[2]: 1024
Max work group size: 256
Preferred vector width char: 4
Preferred vector width short: 2
Preferred vector width int: 1
Preferred vector width long: 1
Preferred vector width float: 1
Preferred vector width double: 1
Native vector width char: 4
Native vector width short: 2
Native vector width int: 1
Native vector width long: 1
Native vector width float: 1
Native vector width double: 1
Max clock frequency: 750Mhz
Address bits: 64
Max memory allocation: 1596905472
Image support: Yes
Max number of images read arguments: 128
Max number of images write arguments: 8
Max image 2D width: 16384
Max image 2D height: 16384
Max image 3D width: 2048
Max image 3D height: 2048
Max image 3D depth: 2048
Max samplers within kernel: 16
Max size of kernel argument: 1024
Alignment (bits) of base address: 2048
Minimum alignment (bytes) for any datatype: 128
Single precision floating point capability
Denorms: No
Quiet NaNs: Yes
Round to nearest even: Yes
Round to zero: Yes
Round to +ve and infinity: Yes
IEEE754-2008 fused multiply-add: Yes
Cache type: Read/Write
Cache line size: 64
Cache size: 16384
Global memory size: 2146349056
Constant buffer size: 65536
Max number of constant args: 8
Local memory type: Scratchpad
Local memory size: 32768
Max pipe arguments: 0
Max pipe active reservations: 0
Max pipe packet size: 0
Max global variable size: 0
Max global variable preferred total size: 0
Max read/write image args: 0
Max on device events: 0
Queue on device max size: 0
Max on device queues: 0
Queue on device preferred size: 0
SVM capabilities:
Coarse grain buffer: No
Fine grain buffer: No
Fine grain system: No
Atomics: No
Preferred platform atomic alignment: 0
Preferred global atomic alignment: 0
Preferred local atomic alignment: 0
Kernel Preferred work group size multiple: 64
Error correction support: 0
Unified memory for Host and Device: 0
Profiling timer resolution: 1
Device endianess: Little
Available: Yes
Compiler available: Yes
Execution capabilities:
Execute OpenCL kernels: Yes
Execute native function: No
Queue on Host properties:
Out-of-Order: No
Profiling : Yes
Queue on Device properties:
Out-of-Order: No
Profiling : No
Platform ID: 0x7f5a050f49f0
Name: Hainan
Vendor: Advanced Micro Devices, Inc.
Device OpenCL C version: OpenCL C 1.2
Driver version: 2639.3
Profile: FULL_PROFILE
Version: OpenCL 1.2 AMD-APP (2639.3)
Extensions: cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf cl_amd_media_ops cl_amd_media_ops2 cl_amd_popcnt cl_khr_image2d_from_buffer cl_khr_spir cl_khr_gl_event