【Altera SoC体验之旅】+ 正式开启OpenCL模式

dongliudongliu 发表于 2015-7-22 15:07:46

本帖最后由 dongliudongliu 于 2015-7-22 15:24 编辑

本文转自：http://bbs.eeworld.com.cn/thread-455862-1-1.html

最近可谓几经周折。先前的Lark板子虽然看上去很高端，但实在是资料太少，对于我的应用来说从头开始搭模块不太现实。
与EEWorld 影子沟通后，在她帮助下，和网友 @chenzhufly 互换了板子，他用的是ArrowSoC。这个板子资料丰富一些，至少在RocketBoard上有很多教程和资料。
一切看上去都很完美，但做完所有实验后发现，本来Altera承诺的“支持OpenCL开发”结果是一句口号，我找遍了官网也没有发现这块板子的BSP。问过了Arrow的员工 @Alex，得到回答也是暂时还没有BSP。
于是不得以，又换了一块支持OpenCL开发的板子——友晶的DE1-SoC，这块性价比最高的板子。与我交换板子的是 @coyoo 大神（《深入理解Altera FPGA应用设计》作者），不得不说，论坛果然卧虎藏龙啊。

有幸参加这次比赛，有幸体验了三块不同的板子（总共才4块，太值了），有幸认识了一群技术上的大牛，想想这次赚大发了。

一定有同学会问，你到底要做什么东东，非要用Open CL？

不止一个人问过这个问题了，其实我看到这个比赛时，想想自己都已经不是学生了，没有那么多课外时间搞比赛，所以没打算报名，但刚好看到在全球计算机大会上Altera与百度合作研发的深度神经网络加速器（DNN by FPGA），而自己恰好又有个想法在FPGA上完成卷积神经网络的搭建（工作相关），各种机缘巧合下，毅然报名了。

神经网络有什么用途？它是模拟人大脑的组织形式，用大量神经元之间相互传递消息实现认知功能的，最简单的例子就是物体识别，人看到一张桌子，就会知道这是个桌子，而不是凳子，因为符合“桌子”特征。在人脑中已经通过大量训练，将“桌子”特征记录在神经元之间的权值上了。而对于计算机，通过摄像头看到桌子时，只是一堆像素值（RGB），浅层次的处理如中值滤波，相关，Sobel滤波是无法认知“桌子”这个特征的，而只是将某一维度的信息呈现给用户，让用户自己判断。为了将信息有效组织，需要构建大量的相同功能的神经元，每个单元执行最基本的操作（将输入累加，满足条件时输出给下一个神经元），这样层层累积，最终实现深层次的认知功能，在最末端的神经元直接可以回答“这是个桌子”或者“这是个凳子”或者“这是个椅子”。
卷积神经网络是在上面神经网络基础上做了一些近似。将同一层的神经元权值共享，减少了连接数，有利于计算机实现。

好了，说了这么多，其实说白了一句话就是，我目前算法是用C/C++以及CUDA实现的，如果迁移到FPGA上运行，使用OpenCL是最快的方式，也是这次体验最重要的内容（以前在FPGA上开发都是VHDL/Verilog，设计+仿真验证+调试太花时间，短期内难以完成，而且我目前只关心算法，不关心底层实现，如果能实现最基本的功能，这一阶段就算完成了，后面再考虑资源、时序、性能上的优化。

拿到板子后，仔细阅读了官方文档，搭建OpenCL环境。

今天时间关系，不再详细展开OpenCL的语法、结构，直接上例子。

烧写TF卡，流程参考我之前的帖子。烧写完成，将SW10拨码开关设置为“01010”（这个很重要，如果没有配置FPGA，后面脚本会lock），上电启动。
上一张图：

PC上打开Putty，设置波特率115200，用户名root，没有密码，进入系统。

可以看得出系统是Poky 8.0 (Yocto Project 1.3 Reference Distro) 1.3 socfpga ttyS0，和之前Lark板子上默认的系统是一样的。
ls一下，当前目录下有很多例程。
先做个准备活动：运行初始化OpenCL环境的脚本：
source ./init_opencl.sh
很快就结束了。我们打开看下这个脚本内容都是什么东东？
root@socfpga:~/vector_Add# cat ~/init_opencl.sh
export ALTERAOCLSDKROOT=/home/root/opencl_arm32_rte
export AOCL_BOARD_PACKAGE_ROOT=$ALTERAOCLSDKROOT/board/c5soc
export PATH=$ALTERAOCLSDKROOT/bin:$PATH
export LD_LIBRARY_PATH=$ALTERAOCLSDKROOT/host/arm32/lib:$LD_LIBRARY_PATH
insmod $AOCL_BOARD_PACKAGE_ROOT/driver/aclsoc_drv.ko

首先设置了几个环境变量：
ALTERAOCLSDKROOT
AOCL_BOARD_PACKAGE_ROOT
PATH
LD_LIBRARY_PATH
之后执行了insmod操作，加载驱动。
我们可以知道OpenCL的服务是由驱动模块$AOCL_BOARD_PACKAGE_ROOT/driver/aclsoc_drv.ko 提供的。
OK，就绪，下面先进入helloworld目录。
root@socfpga:~# cd helloworld/
root@socfpga:~/helloworld# ls
hello_world.aocxhelloworld

这个目录有hello_world.aocx和 helloworld两个文件。前者运行在FPGA上（OpenCL中称为核函数, Kernel），后者运行在ARM上（OpenCL中称为主机程序，Host Program）。两者编译过程如图所示。

运行步骤如下：
root@socfpga:~/helloworld# aocl program /dev/acl0 hello_world.aocx
aocl program: Running reprogram from /home/root/opencl_arm32_rte/board/c5soc/arm32/bin
Reprogramming was successful!
root@socfpga:~/helloworld# ./helloworld
Querying platform for info:
==========================
CL_PLATFORM_NAME                      = Altera SDK for OpenCL
CL_PLATFORM_VENDOR                   = Altera Corporation
CL_PLATFORM_VERSION                   = OpenCL 1.0 Altera SDK for OpenCL, Version 14.0

Querying device for info:
========================
CL_DEVICE_NAME                         = de1soc_sharedonly : Cyclone V SoC Development Kit
CL_DEVICE_VENDOR                      = Altera Corporation
CL_DEVICE_VENDOR_ID                   = 4466
CL_DEVICE_VERSION                      = OpenCL 1.0 Altera SDK for OpenCL, Version 14.0
CL_DRIVER_VERSION                      = 14.0
CL_DEVICE_ADDRESS_BITS                = 64
CL_DEVICE_AVAILABLE                   = true
CL_DEVICE_ENDIAN_LITTLE                = true
CL_DEVICE_GLOBAL_MEM_CACHE_SIZE       = 32768
CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE    = 0
CL_DEVICE_GLOBAL_MEM_SIZE             = 536870912
CL_DEVICE_IMAGE_SUPPORT                = false
CL_DEVICE_LOCAL_MEM_SIZE             = 16384
CL_DEVICE_MAX_CLOCK_FREQUENCY          = 1000
CL_DEVICE_MAX_COMPUTE_UNITS          = 1
CL_DEVICE_MAX_CONSTANT_ARGS          = 8
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE    = 134217728
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS    = 3
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS    = 8192
CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE    = 1024
CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR = 4
CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT = 2
CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT = 1
CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG = 1
CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT = 1
CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE= 0
Command queue out of order?          = false
Command queue profiling enabled?       = true
Using AOCX: hello_world.aocx

Kernel initialization is complete.
Launching the kernel...

Thread #2: Hello from Altera's OpenCL Compiler!

Kernel execution is complete.

可见，运行成功了。
想看源代码，可以在DE1-SoC_openCL_BSP.zip中找到，路径为examples/helloworld/。
后缀为.cl的文件为核函数。上面例子的核函数如下：
// AOC kernel demonstrating device-side printf call
__kernel void hello_world(int thread_id_from_which_to_print_message) {
// Get index of the work item
unsigned thread_id = get_global_id(0);

if(thread_id == thread_id_from_which_to_print_message) {
printf("Thread #%u: Hello from Altera's OpenCL Compiler!\n", thread_id);
}
}

类似C函数，只不过前缀加上“__kernel”关键词，指定它运行在设备（FPGA）上。使用Altera的OpenCL工具就可以编译为FPGA比特流配置文件。
这里的函数功能很简单，只是判断自身线程号是否与主机指定的相同，如果相同则输出一句话，其他线程保持沉默。

（未完，跟帖中）

dongliudongliu 发表于 2015-7-22 15:21:35

接着看下Host Program长什么样。
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cstring>
#include "CL/opencl.h"
#include "AOCL_Utils.h"

using namespace aocl_utils;

#define STRING_BUFFER_LEN 1024

// Runtime constants
// Used to define the work set over which this kernel will execute.
static const size_t work_group_size = 8;// 8 threads in the demo workgroup
// Defines kernel argument value, which is the workitem ID that will
// execute a printf call
static const int thread_id_to_output = 2;

// OpenCL runtime configuration
static cl_platform_id platform = NULL;
static cl_device_id device = NULL;
static cl_context context = NULL;
static cl_command_queue queue = NULL;
static cl_kernel kernel = NULL;
static cl_program program = NULL;

// Function prototypes
bool init();
void cleanup();
static void device_info_ulong( cl_device_id device, cl_device_info param, const char* name);
static void device_info_uint( cl_device_id device, cl_device_info param, const char* name);
static void device_info_bool( cl_device_id device, cl_device_info param, const char* name);
static void device_info_string( cl_device_id device, cl_device_info param, const char* name);
static void display_device_info( cl_device_id device );

// Entry point.
int main() {
cl_int status;

if(!init()) {
return -1;
}

// Set the kernel argument (argument 0)
status = clSetKernelArg(kernel, 0, sizeof(cl_int), (void*)&thread_id_to_output);
checkError(status, "Failed to set kernel arg 0");

printf("\nKernel initialization is complete.\n");
printf("Launching the kernel...\n\n");

// Configure work set over which the kernel will execute
size_t wgSize = {work_group_size, 1, 1};
size_t gSize = {work_group_size, 1, 1};

// Launch the kernel
status = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, gSize, wgSize, 0, NULL, NULL);
checkError(status, "Failed to launch kernel");

// Wait for command queue to complete pending events
status = clFinish(queue);
checkError(status, "Failed to finish");

printf("\nKernel execution is complete.\n");

// Free the resources allocated
cleanup();

return 0;
}

/////// HELPER FUNCTIONS ///////

bool init() {
cl_int status;

if(!setCwdToExeDir()) {
return false;
}

// Get the OpenCL platform.
platform = findPlatform("Altera");
if(platform == NULL) {
printf("ERROR: Unable to find Altera OpenCL platform.\n");
return false;
}

// User-visible output - Platform information
{
char char_buffer;
printf("Querying platform for info:\n");
printf("==========================\n");
clGetPlatformInfo(platform, CL_PLATFORM_NAME, STRING_BUFFER_LEN, char_buffer, NULL);
printf("%-40s = %s\n", "CL_PLATFORM_NAME", char_buffer);
clGetPlatformInfo(platform, CL_PLATFORM_VENDOR, STRING_BUFFER_LEN, char_buffer, NULL);
printf("%-40s = %s\n", "CL_PLATFORM_VENDOR ", char_buffer);
clGetPlatformInfo(platform, CL_PLATFORM_VERSION, STRING_BUFFER_LEN, char_buffer, NULL);
printf("%-40s = %s\n\n", "CL_PLATFORM_VERSION ", char_buffer);
}

// Query the available OpenCL devices.
scoped_array<cl_device_id> devices;
cl_uint num_devices;

devices.reset(getDevices(platform, CL_DEVICE_TYPE_ALL, &num_devices));

// We'll just use the first device.
device = devices;

// Display some device information.
display_device_info(device);

// Create the context.
context = clCreateContext(NULL, 1, &device, NULL, NULL, &status);
checkError(status, "Failed to create context");

// Create the command queue.
queue = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, &status);
checkError(status, "Failed to create command queue");

// Create the program.
std::string binary_file = getBoardBinaryFile("hello_world", device);
printf("Using AOCX: %s\n", binary_file.c_str());
program = createProgramFromBinary(context, binary_file.c_str(), &device, 1);

// Build the program that was just created.
status = clBuildProgram(program, 0, NULL, "", NULL, NULL);
checkError(status, "Failed to build program");

// Create the kernel - name passed in here must match kernel name in the
// original CL file, that was compiled into an AOCX file using the AOC tool
const char *kernel_name = "hello_world";// Kernel name, as defined in the CL file
kernel = clCreateKernel(program, kernel_name, &status);
checkError(status, "Failed to create kernel");

return true;
}

// Free the resources allocated during initialization
void cleanup() {
if(kernel) {
clReleaseKernel(kernel);
}
if(program) {
clReleaseProgram(program);
}
if(queue) {
clReleaseCommandQueue(queue);
}
if(context) {
clReleaseContext(context);
}
}

// Helper functions to display parameters returned by OpenCL queries
static void device_info_ulong( cl_device_id device, cl_device_info param, const char* name) {
cl_ulong a;
clGetDeviceInfo(device, param, sizeof(cl_ulong), &a, NULL);
printf("%-40s = %lu\n", name, a);
}
static void device_info_uint( cl_device_id device, cl_device_info param, const char* name) {
cl_uint a;
clGetDeviceInfo(device, param, sizeof(cl_uint), &a, NULL);
printf("%-40s = %u\n", name, a);
}
static void device_info_bool( cl_device_id device, cl_device_info param, const char* name) {
cl_bool a;
clGetDeviceInfo(device, param, sizeof(cl_bool), &a, NULL);
printf("%-40s = %s\n", name, (a?"true":"false"));
}
static void device_info_string( cl_device_id device, cl_device_info param, const char* name) {
char a;
clGetDeviceInfo(device, param, STRING_BUFFER_LEN, &a, NULL);
printf("%-40s = %s\n", name, a);
}

// Query and display OpenCL information on device and runtime environment
static void display_device_info( cl_device_id device ) {

printf("Querying device for info:\n");
printf("========================\n");
device_info_string(device, CL_DEVICE_NAME, "CL_DEVICE_NAME");
device_info_string(device, CL_DEVICE_VENDOR, "CL_DEVICE_VENDOR");
device_info_uint(device, CL_DEVICE_VENDOR_ID, "CL_DEVICE_VENDOR_ID");
device_info_string(device, CL_DEVICE_VERSION, "CL_DEVICE_VERSION");
device_info_string(device, CL_DRIVER_VERSION, "CL_DRIVER_VERSION");
device_info_uint(device, CL_DEVICE_ADDRESS_BITS, "CL_DEVICE_ADDRESS_BITS");
device_info_bool(device, CL_DEVICE_AVAILABLE, "CL_DEVICE_AVAILABLE");
device_info_bool(device, CL_DEVICE_ENDIAN_LITTLE, "CL_DEVICE_ENDIAN_LITTLE");
device_info_ulong(device, CL_DEVICE_GLOBAL_MEM_CACHE_SIZE, "CL_DEVICE_GLOBAL_MEM_CACHE_SIZE");
device_info_ulong(device, CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE, "CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE");
device_info_ulong(device, CL_DEVICE_GLOBAL_MEM_SIZE, "CL_DEVICE_GLOBAL_MEM_SIZE");
device_info_bool(device, CL_DEVICE_IMAGE_SUPPORT, "CL_DEVICE_IMAGE_SUPPORT");
device_info_ulong(device, CL_DEVICE_LOCAL_MEM_SIZE, "CL_DEVICE_LOCAL_MEM_SIZE");
device_info_ulong(device, CL_DEVICE_MAX_CLOCK_FREQUENCY, "CL_DEVICE_MAX_CLOCK_FREQUENCY");
device_info_ulong(device, CL_DEVICE_MAX_COMPUTE_UNITS, "CL_DEVICE_MAX_COMPUTE_UNITS");
device_info_ulong(device, CL_DEVICE_MAX_CONSTANT_ARGS, "CL_DEVICE_MAX_CONSTANT_ARGS");
device_info_ulong(device, CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE, "CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE");
device_info_uint(device, CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS, "CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS");
device_info_uint(device, CL_DEVICE_MEM_BASE_ADDR_ALIGN, "CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS");
device_info_uint(device, CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE, "CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE");
device_info_uint(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR, "CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR");
device_info_uint(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT, "CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT");
device_info_uint(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT, "CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT");
device_info_uint(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG, "CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG");
device_info_uint(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT, "CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT");
device_info_uint(device, CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE, "CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE");

{
   cl_command_queue_properties ccp;
   clGetDeviceInfo(device, CL_DEVICE_QUEUE_PROPERTIES, sizeof(cl_command_queue_properties), &ccp, NULL);
   printf("%-40s = %s\n", "Command queue out of order? ", ((ccp & CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE)?"true":"false"));
   printf("%-40s = %s\n", "Command queue profiling enabled? ", ((ccp & CL_QUEUE_PROFILING_ENABLE)?"true":"false"));
}
}

主机程序比较长，主要执行流程为：
初始化平台、寻找设备、打印设备信息、创建设备上下文、在设备上下文中创建指令队列、载入设备代码、编译设备代码、创建核函数对象、设置核函数参数、运行核函数、等待核函数运行结束、清除所有对象。
这是OpenCL的最基本流程，虽然比较繁琐，但熟悉之后几乎每次都是这几步，代码改动很少，真正需要用心设计的是核函数。

（未完，跟帖中）

dongliudongliu 发表于 2015-7-22 15:27:35

好了，再运行一个例子就睡觉。
进入上一级目录，然后切入vectorAdd，运行一下：

root@socfpga:~/helloworld# cd ..
root@socfpga:~# ls
README          helloworld    opencl_arm32_rtevector_Add
boardtest       init_opencl.sh swapper
root@socfpga:~# cd vector_Add/
root@socfpga:~/vector_Add# ls
vectorAdd    vectorAdd.aocx
root@socfpga:~/vector_Add# aocl program /dev/acl0 vectorAdd.aocx
aocl program: Running reprogram from /home/root/opencl_arm32_rte/board/c5soc/arm32/bin
Reprogramming was successful!
root@socfpga:~/vector_Add# ./vectorAdd
Initializing OpenCL
Platform: Altera SDK for OpenCL
Using 1 device(s)
de1soc_sharedonly : Cyclone V SoC Development Kit
Using AOCX: vectorAdd.aocx
Launching for device 0 (1000000 elements)

Time: 107.127 ms
Kernel time (device 0): 6.933 ms

Verification: PASS

这是个向量相加的例子，也是很经典的并行计算例子。核函数内容如下：
__kernel void vectorAdd(__global const float *x,
                     __global const float *y,
                     __global float *restrict z)
{
// get index of the work item
int index = get_global_id(0);

// add the vector elements
z = x + y;
}

（未完，跟帖中）

dongliudongliu 发表于 2015-7-22 15:29:11

主机程序如下：
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "CL/opencl.h"
#include "AOCL_Utils.h"

using namespace aocl_utils;

// OpenCL runtime configuration
cl_platform_id platform = NULL;
unsigned num_devices = 0;
scoped_array<cl_device_id> device; // num_devices elements
cl_context context = NULL;
scoped_array<cl_command_queue> queue; // num_devices elements
cl_program program = NULL;
scoped_array<cl_kernel> kernel; // num_devices elements
scoped_array<cl_mem> input_a_buf; // num_devices elements
scoped_array<cl_mem> input_b_buf; // num_devices elements
scoped_array<cl_mem> output_buf; // num_devices elements

// Problem data.
const unsigned N = 1000000; // problem size
scoped_array<scoped_aligned_ptr<float> > input_a, input_b; // num_devices elements
scoped_array<scoped_aligned_ptr<float> > output; // num_devices elements
scoped_array<scoped_array<float> > ref_output; // num_devices elements
scoped_array<unsigned> n_per_device; // num_devices elements

// Function prototypes
float rand_float();
bool init_opencl();
void init_problem();
void run();
void cleanup();

// Entry point.
int main() {
// Initialize OpenCL.
if(!init_opencl()) {
return -1;
}

// Initialize the problem data.
// Requires the number of devices to be known.
init_problem();

// Run the kernel.
run();

// Free the resources allocated
cleanup();

return 0;
}

/////// HELPER FUNCTIONS ///////

// Randomly generate a floating-point number between -10 and 10.
float rand_float() {
return float(rand()) / float(RAND_MAX) * 20.0f - 10.0f;
}

// Initializes the OpenCL objects.
bool init_opencl() {
cl_int status;

printf("Initializing OpenCL\n");

if(!setCwdToExeDir()) {
return false;
}

// Get the OpenCL platform.
platform = findPlatform("Altera");
if(platform == NULL) {
printf("ERROR: Unable to find Altera OpenCL platform.\n");
return false;
}

// Query the available OpenCL device.
device.reset(getDevices(platform, CL_DEVICE_TYPE_ALL, &num_devices));
printf("Platform: %s\n", getPlatformName(platform).c_str());
printf("Using %d device(s)\n", num_devices);
for(unsigned i = 0; i < num_devices; ++i) {
printf("%s\n", getDeviceName(device).c_str());
}

// Create the context.
context = clCreateContext(NULL, num_devices, device, NULL, NULL, &status);
checkError(status, "Failed to create context");

// Create the program for all device. Use the first device as the
// representative device (assuming all device are of the same type).
std::string binary_file = getBoardBinaryFile("vectorAdd", device);
printf("Using AOCX: %s\n", binary_file.c_str());
program = createProgramFromBinary(context, binary_file.c_str(), device, num_devices);

// Build the program that was just created.
status = clBuildProgram(program, 0, NULL, "", NULL, NULL);
checkError(status, "Failed to build program");

// Create per-device objects.
queue.reset(num_devices);
kernel.reset(num_devices);
n_per_device.reset(num_devices);
input_a_buf.reset(num_devices);
input_b_buf.reset(num_devices);
output_buf.reset(num_devices);

for(unsigned i = 0; i < num_devices; ++i) {
// Command queue.
queue = clCreateCommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, &status);
checkError(status, "Failed to create command queue");

// Kernel.
const char *kernel_name = "vectorAdd";
kernel = clCreateKernel(program, kernel_name, &status);
checkError(status, "Failed to create kernel");

// Determine the number of elements processed by this device.
n_per_device = N / num_devices; // number of elements handled by this device

// Spread out the remainder of the elements over the first
// N % num_devices.
if(i < (N % num_devices)) {
   n_per_device++;
}

// Input buffers.
input_a_buf = clCreateBuffer(context, CL_MEM_READ_ONLY,
   n_per_device * sizeof(float), NULL, &status);
checkError(status, "Failed to create buffer for input A");

input_b_buf = clCreateBuffer(context, CL_MEM_READ_ONLY,
   n_per_device * sizeof(float), NULL, &status);
checkError(status, "Failed to create buffer for input B");

// Output buffer.
output_buf = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
   n_per_device * sizeof(float), NULL, &status);
checkError(status, "Failed to create buffer for output");
}

return true;
}

// Initialize the data for the problem. Requires num_devices to be known.
void init_problem() {
if(num_devices == 0) {
checkError(-1, "No devices");
}

input_a.reset(num_devices);
input_b.reset(num_devices);
output.reset(num_devices);
ref_output.reset(num_devices);

// Generate input vectors A and B and the reference output consisting
// of a total of N elements.
// We create separate arrays for each device so that each device has an
// aligned buffer.
for(unsigned i = 0; i < num_devices; ++i) {
input_a.reset(n_per_device);
input_b.reset(n_per_device);
output.reset(n_per_device);
ref_output.reset(n_per_device);

for(unsigned j = 0; j < n_per_device; ++j) {
   input_a = rand_float();
   input_b = rand_float();
   ref_output = input_a + input_b;
}
}
}

void run() {
cl_int status;

const double start_time = getCurrentTimestamp();

// Launch the problem for each device.
scoped_array<cl_event> kernel_event(num_devices);
scoped_array<cl_event> finish_event(num_devices);

for(unsigned i = 0; i < num_devices; ++i) {

// Transfer inputs to each device. Each of the host buffers supplied to
// clEnqueueWriteBuffer here is already aligned to ensure that DMA is used
// for the host-to-device transfer.
cl_event write_event;
status = clEnqueueWriteBuffer(queue, input_a_buf, CL_FALSE,
   0, n_per_device * sizeof(float), input_a, 0, NULL, &write_event);
checkError(status, "Failed to transfer input A");

status = clEnqueueWriteBuffer(queue, input_b_buf, CL_FALSE,
   0, n_per_device * sizeof(float), input_b, 0, NULL, &write_event);
checkError(status, "Failed to transfer input B");

// Set kernel arguments.
unsigned argi = 0;

status = clSetKernelArg(kernel, argi++, sizeof(cl_mem), &input_a_buf);
checkError(status, "Failed to set argument %d", argi - 1);

status = clSetKernelArg(kernel, argi++, sizeof(cl_mem), &input_b_buf);
checkError(status, "Failed to set argument %d", argi - 1);

status = clSetKernelArg(kernel, argi++, sizeof(cl_mem), &output_buf);
checkError(status, "Failed to set argument %d", argi - 1);

// Enqueue kernel.
// Use a global work size corresponding to the number of elements to add
// for this device.
//
// We don't specify a local work size and let the runtime choose
// (it'll choose to use one work-group with the same size as the global
// work-size).
//
// Events are used to ensure that the kernel is not launched until
// the writes to the input buffers have completed.
const size_t global_work_size = n_per_device;
printf("Launching for device %d (%d elements)\n", i, global_work_size);

status = clEnqueueNDRangeKernel(queue, kernel, 1, NULL,
   &global_work_size, NULL, 2, write_event, &kernel_event);
checkError(status, "Failed to launch kernel");

// Read the result. This the final operation.
status = clEnqueueReadBuffer(queue, output_buf, CL_FALSE,
   0, n_per_device * sizeof(float), output, 1, &kernel_event, &finish_event);

// Release local events.
clReleaseEvent(write_event);
clReleaseEvent(write_event);
}

// Wait for all devices to finish.
clWaitForEvents(num_devices, finish_event);

const double end_time = getCurrentTimestamp();

// Wall-clock time taken.
printf("\nTime: %0.3f ms\n", (end_time - start_time) * 1e3);

// Get kernel times using the OpenCL event profiling API.
for(unsigned i = 0; i < num_devices; ++i) {
cl_ulong time_ns = getStartEndTime(kernel_event);
printf("Kernel time (device %d): %0.3f ms\n", i, double(time_ns) * 1e-6);
}

// Release all events.
for(unsigned i = 0; i < num_devices; ++i) {
clReleaseEvent(kernel_event);
clReleaseEvent(finish_event);
}

// Verify results.
bool pass = true;
for(unsigned i = 0; i < num_devices && pass; ++i) {
for(unsigned j = 0; j < n_per_device && pass; ++j) {
   if(fabsf(output - ref_output) > 1.0e-5f) {
   printf("Failed verification @ device %d, index %d\nOutput: %f\nReference: %f\n",
         i, j, output, ref_output);
   pass = false;
   }
}
}

printf("\nVerification: %s\n", pass ? "PASS" : "FAIL");
}

// Free the resources allocated during initialization
void cleanup() {
for(unsigned i = 0; i < num_devices; ++i) {
if(kernel && kernel) {
   clReleaseKernel(kernel);
}
if(queue && queue) {
   clReleaseCommandQueue(queue);
}
if(input_a_buf && input_a_buf) {
   clReleaseMemObject(input_a_buf);
}
if(input_b_buf && input_b_buf) {
   clReleaseMemObject(input_b_buf);
}
if(output_buf && output_buf) {
   clReleaseMemObject(output_buf);
}
}

if(program) {
clReleaseProgram(program);
}
if(context) {
clReleaseContext(context);
}
}

将100w维度的两个向量相加，用时107.127ms，你可以试试只用ARM计算，看需要多久，对比下性能。

好了，今天到此为止，大家晚安！

（完）

页: [1]

MyFPGA Forum's Archiver

【Altera SoC体验之旅】+ 正式开启OpenCL模式