What is meant by programming model

Hello World with Data Parallel C ++

A short tutorial for those familiar with C ++ as an introduction to programming in the future


With hardware accelerators for HPC and AI systems, programming will become increasingly parallel in the future. Exciting times for developers who work with C ++ today. One of the fundamental innovations that oneAPI brings for you is the Data Parallel C ++ programming model, or DPC ++ for short. This is a modern, parallel C ++ for heterogeneous architectures. The basis is Khronos SYCL. A short tutorial provides C ++ connoisseurs with clarity. “Hello World” doesn't make a lot of sense in a programming model that does many things in parallel. We therefore start with vector addition as the "Hello World" of parallel programming. The operation that we want to implement is SAXPY. This is short for A times X plus Y with single precision. This operation can be implemented in C or C ++ as follows:

for (size_t i = 0; I

Z [i] + = A * X [i] + Y [i];

}


There are many more ways to write this operation in C ++. For example, we could use areas that make the code look a little more like the SYCL version below.

Here is the same loop, this time programmed in SYCL; the explanation then follows step by step:


As you might guess, parallel_for means a parallel executable for loop. The body of the loop is called a lambda; So lambda is the code here that looks like [..] {..}.

The loop iterator is expressed as sycl :: range and sycl :: id. In our simple example, both are one-dimensional, as indicated by <1>. SYCL ranges and IDs can be one, two or three-dimensional; OpenCL and CUDA have the same restriction.

It might be a little strange to write loops this way, but it is the same as how lambda expressions work. Anyone ever in parallel STL, TBB, Coconut or RAJA programmed will know this pattern.

You may be wondering about the template argument for the parallel_for command. This is just one of the ways to name the kernel, which in turn is required because SYCL may need to be used with not only the local device compiler but also another host's C ++ compiler. In this case, the two compilers need a way to agree on the kernel name. With many SYCL compilers, such as for example Intel DPC ++, this is not necessary. With the option fsycl-unnamed-lambda we can also tell the compiler not to “worry” about searching for names. At this point we will not attempt to explain the h in h.parallel_for, but will deal with the subject later.

Challenges of heterogeneous programming

The challenges of heterogeneous programming include the different types of processing elements - and often different types of memory. These things make compilers and runtimes more complicated. The SYCL programming model allows such a heterogeneous execution, but on a much higher level of abstraction than OpenCL. Not everything is explicit either. In contrast to other common GPU programming models, the SYCL kernels can be integrated into the host program flow, which improves readability.

With DPC ++ there is another elementary requirement: Whenever we want to calculate on a device, we have to create a work queue:

sycl :: queue q (sycl :: default_selector {});

The standard selector prefers a GPU (if available), otherwise a CPU. We can create queues associated with specific types of devices:

 
The host and CPU selectors can produce significantly different results even when targeting the same hardware. This may be because the host selector may be using a sequential implementation optimized for debugging, while the CPU selector uses the OpenCL runtime and runs across all cores. In addition, the OpenCL Just-in-Time Compiler (JIT) may generate different code because it uses a completely different compiler. So don't assume that the host and CPU mean the same thing in SYCL just because the host is a CPU.

Managing data in SYCL

The canonical method of managing data in SYCL works with buffers. A SYCL buffer is an "opaque" container. While this is a sleek design, some applications use pointers provided by the Unified Shared Memory (USM) extension (discussed later).


In the example just presented, the user assigns a C ++ container on the host and then passes it to SYCL. Until the destructor of the SYCL buffer is called, the user can only access the data via a SYCL mechanism. SYCL access functions are the most important aspect of SYCL data management with buffers, which are explained below.

Since a different compiler or generation mechanism than the host may be required for device code, sections of the device code must be uniquely identified. Below we see what this looks like in SYCL 1.2.1. We use the submit method to put the work on the device queue q. This method returns an opaque handler against which we run the kernels, in this case via parallel_for.


We can synchronize the execution of the code on the devices using the wait () method. There are more granular methods of syncing - but we'll start with the simplest, the hammer method.

Some users may find the code above a bit lengthy, especially when compared to coconut. The Intel compiler for DPC ++ supports a concise syntax, which we explain below.

The last piece of the puzzle

 
Let's come back to the SYCL access functions, the final piece of the puzzle for our first SYCL program. These "accessors" may not be familiar to GPU programmers, but they have some nice features compared to other methods. While SYCL allows the programmer to move data explicitly, for example by using the copy () method, this is not necessary with the accessor methods. You can generate a data flow diagram that the compiler and runtime can use to move data around at the right time. This is particularly effective when several kernels are called sequentially.

In this case, the SYCL implementation deduces that data is being reused and does not copy it back to the host unnecessarily. We can also schedule data movement asynchronously (i.e., overlapping with code execution on the device). While experienced GPU programmers can do this manually, we often find that SYCL accessors perform better than OpenCL programs, which require programmers to explicitly move data.

Because programming models in which pointers handle the memory have difficulties with SYCL accessors, the USM extension makes these accessors superfluous. While USM challenges the programmer with data movement and synchronization in mind, it improves the compatibility of legacy code that uses pointers.

Our first SYCL program

Here are all the components of our SAXPY program in SYCL:


The full source code for this example is available in this GitHub repository:

https://github.com/jeffhammond/dpcpp-tutorial

While this program works perfectly and can be implemented on many platforms, some users will find it quite "chatty". In addition, it is incompatible with libraries and frameworks that need to manage memory using pointers. To solve this problem with SYCL 1.2.1, Intel has the USM extension in DPC ++, which supports pointer-based memory management.

USM supports two important usage models, which we present in the following. The first supports automatic data transfer between the host and the device. The second is used to explicitly move the data to and from devices.

The details are contained in the preliminary SYCL-2020 specification. To get started, all you need to know is the following. The q argument is the queue associated with the device on which the assigned data is stored (either permanent or temporary):


When we use device mapping, we need to explicitly move data (e.g. using the SYCL memcpy method). This is the same as with std :: memcpy (e.g. the target is on the left):


If we use USM, accessors are no longer required, which means we can simplify the above kernel code:

 
The complete working examples for both versions of USM can be found in this repo with the names saxpy-usm.cc and saxpy-usm2.cc.

In the meantime, in case you've wondered why the opaque handler h was required in each of these programs, it turns out that it is ultimately not required after all. The following is an equivalent implementation that is possible with the preliminary specification SYCL 2020. We can also take advantage of the fact that lambda names are only optional in the preliminary specification SYCL 2020. Together, these two small changes ensure that the SYCL kernels are the same length as the original loop in C ++ listed at the beginning of this tutorial:


We started with three lines of code that run sequentially on a CPU. We end with three lines of code running in parallel on CPUs, GPUs, FPGAs, and other devices.

Of course, everything won't be as simple as SAXPY, but at least now you know that SYCL won't make simple things difficult, and it builds on a number of modern C ++ functions and universal concepts like "parallel for" instead of introducing new things, which have to be learned first.


is "Principal Engineer" of the Intel Data Center Group.