Qualcomm开发者专区

并行模式演示

操作系统	云服务/平台	技术难度	关注领域
Android		Intermediate	Low Power Gaming Embedded

任务目标

当运行某些算法的时候，我们需要测试硬件处理器的性能。也许我们能够运行自己的算法在适合的处理器。或者将我们的算法拆分，然后放到不同的处理器上运行，也许能够获得更好的性能。

我希望提供一个解决方案，用来优化应用和设备性能。

所需材料/所需清单/工具

• Snapdragon Heterogeneous Compute SDK v1.0.0 - Linux

• android-ndk-r14b-linux-x86_64

源码/示例/可执行的应用程序

• Source Code

附加资料

• opencl-sdk-1.2.2

构建/装配说明

以下为此项目中用到的工具

1. 晓龙845处理器的安卓设备，安装并行计算SDK、Hexagno SDK和应用。

2.Ubuntu 18.04LTS

3.Type-C 数据线

4.并行计算SDK的应用在此基础上进行开发

部署项目

1. 高通并行计算SDK下载：https://developer.qualcomm.com/download/snapdragon-heterogeneous-compute-sdk-1.0.0.deb?referrer=node/35864, and install it to PC.

2. android-ndk-r14b-linux-x86_64 下载：https://dl.google.com/android/repository/android-ndk-r14b-linux-x86_64.zip, and install it to PC.

3. Hexagon DSP SDK 下载：https://developer.qualcomm.com/download/hexagon/hexagon-sdk-v3-3-3-linux.zip?referrer=node/6116, and install it to PC.

4. 生成应用程序和库并将其复制到设备，您可以参考readme.md文件进行构建。

5. 将应用程序推送到“/data/local/tmp“目录下并运行:

./hetcompute_sample_ParallelPatternsDemo

6. 如果没有问题，上传代码到Github.

工作流程

一、开始应该做些什么？

首先我们需要创建处理器内核通道用于控制硬件

/1.0.0/samples/ParallelPatternsDemo.cc

long getCurrentTimeMsec()

{

    unsigned long msec = 0;

    char str[20] = {0};

    struct timeval stuCurrentTime;

    gettimeofday(&stuCurrentTime, NULL);

    sprintf(str, "%ld%03ld", stuCurrentTime.tv_sec, (stuCurrentTime.tv_usec) / 1000);

    for (int i = 0; i < strlen(str); i++) {

        msec = msec * 10 + (str[i] - '0');

    }

return msec;

}

int

main(int argc, char *argv[])

{

    hetcompute::runtime::init();

// This is to ensure all buffers are deleted before we call shutdown.

    if (argc > 2) {

        HETCOMPUTE_ILOG("********************************************");

        HETCOMPUTE_ILOG("eg: ./hetcompute_sample_ParallelPatternsDemo");

        HETCOMPUTE_ILOG("********************************************");

        return -1;

    }

    // Create vector martix

    std::vector<int> matrixDataA(VEC_MATRIX_SIZE, 1); //Init matrix all parameters to 1

    std::vector<int> matrixDataB(VEC_MATRIX_SIZE, 0);

    // initialize the input array with random numbers

    /*std::random_device              random_device;

    std::mt19937                    generator(random_device());

    const int                       min_value = 0, max_value = RANDOM_MAX_VALUE;

    std::uniform_int_distribution<> dist(min_value, max_value);

    auto                            gen_dist = std::bind(dist, std::ref(generator));

    std::generate(matrixDataA.begin(), matrixDataA.end(), gen_dist);*/

    // CPU - pscan_inclusive init the matrix

    // matrixDataA[x] = matrixDataA[x - 1] + matrixDataA[x]

    matrix_parallel_scan(matrixDataA);

    // CPU - Iteration program

    // By Iteration function copy matrix A data to B

    matrix_iteration_process(matrixDataA, matrixDataB);

    // CPU - preduce program

    // Summing all matrix[x] value

    matrix_preduce_process(matrixDataA);

    hetcompute::runtime::shutdown();

   return 0;

}

void matrix_iteration_process(std::vector<int> DataA, std::vector<int> DataB)

{

    unsigned long begin_process_time = 0;

    unsigned long end_process_time = 0;

    int parallel_sum = 0;

    // Run parallel calc from pfor_each

    begin_process_time = getCurrentTimeMsec();

    for (size_t x = 0; x < LOOP_NUM; x++) {

        hetcompute::pfor_each(size_t(0), DataA.size(), [DataA, &DataB](size_t i) {

            DataB[i] += DataA[i] * 2;

        });

    }

    end_process_time = getCurrentTimeMsec();

    process_calc_time = (float)(end_process_time - begin_process_time);

    HETCOMPUTE_ILOG("matrix_iteration_process function copy matrix A data to B, 100 times consume time: %f ms.", process_calc_time);

}

void matrix_preduce_process(std::vector<int> inputData)

{

    unsigned long begin_process_time = 0;

    unsigned long end_process_time = 0;

    const int identity = 0;

    int parallel_sum = 0;

    // Run parallel calc from preduce

    begin_process_time = getCurrentTimeMsec();

    for (size_t x = 0; x < LOOP_NUM; x++) {

        parallel_sum = hetcompute::preduce(size_t(0), inputData.size(), identity,

                                           [inputData](size_t f, size_t l, int& init) {

                                               for (size_t k = f; k < l; ++k) {

                                                   init += inputData[k];

                                               }

                                           },

                                           std::plus<int>());

    }

    end_process_time = getCurrentTimeMsec();

    process_calc_time = (float)(end_process_time - begin_process_time);

    HETCOMPUTE_ILOG("matrix_preduce_process function matrix[0] + .. + matrix[%d], %d times consume time: %f ms, sum: %d.",

        VEC_MATRIX_SIZE - 1, LOOP_NUM, process_calc_time, parallel_sum);

}

void matrix_parallel_scan(std::vector<int> inputData)

{

    unsigned long begin_process_time = 0;

    unsigned long end_process_time = 0;

    begin_process_time = getCurrentTimeMsec();

    hetcompute::pscan_inclusive(inputData.begin(), inputData.end(), std::plus<int>());

end_process_time = getCurrentTimeMsec();

process_calc_time = (float)(end_process_time - begin_process_time);

HETCOMPUTE_ILOG("matrix_parallel_scan function matrix[x] = matrix[x - 1] + matrix[x]. Consume time: %f ms.", process_calc_time);

}

贡献者信息

姓名	公司
Shen Tao shentao1012@thundersoft.com	Thundersoft
Yang Rong yangrong0925@thundersoft.com	Thundersoft
Wu kouzw0723@thundersoft.com	Thundersoft

>>浏览更多Qualcomm硬件案例：http://qualcomm.csdn.net/m/zone/qualcomm2016/project

Qualcomm 开发者专区是 Qualcomm 联合CSDN 共同打造的面向中国开发者的技术专区。致力于通过提供全球最新资讯和最多元的技术资源及支持，为开发者们打造全面一流的开发环境。本专区将以嵌入式、物联网、游戏开发、Qualcomm® 骁龙™处理器的软件优化等技术为核心，打造全面的开发者技术服务社区，为下一代高性能体验和设计带来更多的想法和灵感。

加入 Qualcomm 开发者专区

高通软件中心

通过集中式门户站无缝管理您的高通®软件和工具

下载软件中心