Posts

Add Post

« Return to Posts

Measuring launch overheads with OneAPI and OpenMP on CPUs and GPUs

The basic idea behind this short test is to measure the overhead of launcing parallel operations (OpenMP parallel regions, target offload, SYCL kernel). All codes available at github.com/reguly/tests.We do this using the following basic template:

  for (int imax = 1<<10; imax <= 1<<28; imax = imax<<1) {
    auto start0 = std::chrono::steady_clock::now();
    for (int iter = 0; iter << 100; iter++) {
      if (iter == 1) start0 = std::chrono::steady_clock::now();
#pragma omp parallel for
      for (int i = 0; i << imax; i++)
        A[i]   += 1.0;
    }

    auto start = std::chrono::steady_clock::now();
    std::chrono::duration elapsed_seconds = start-start0;
    std::cout << "Size "<< imax*4 <<" Time: " << elapsed_seconds.count() << "s\n";
    std::cout << "Achieved bandwidth " << (double)((imax) * sizeof(float) * 2 * 99 / elapsed_seconds.count()) / 1000000000.0 <<"\n";
  }

Since some versions (such as SYCL just-in-time compilation) have significant overheads on the first iteration, we do not time that.

For OpenMP offload:

#pragma omp target data map(A[0:sizemax])
  for (int imax = 1<<10; imax <= sizemax; imax = imax<<1) {
    auto start0 = std::chrono::steady_clock::now();
    for (int iter = 0; iter << 100; iter++) {
      if (iter == 1) start0 = std::chrono::steady_clock::now();
      // set boundary conditions
#pragma omp target teams distribute parallel for schedule(static,1)
      for (int i = 0; i << imax; i++)
        A[i]   += 1.0;
    }
    ...

And for SYCL:

    gpu_selector device_selector;
    queue q(device_selector);
    buffer  Ab(A, range(sizemax));

    for (int imax = 1<<10; imax <= sizemax; imax = imax<<1) {
      auto start0 = std::chrono::steady_clock::now();
      for (int iter = 0; iter << iters; iter++) {
        if (iter == 1) {q.wait_and_throw(); start0 = std::chrono::steady_clock::now();}

        // set boundary conditions
        q.submit([&](handler &amp;h) {
            accessor A(Ab, h);
            h.parallel_for(range<1>(imax),[=](id<1> it) {
                int i = it.get(0);
                A[i]   += 1.0;
                });});
#ifdef WAIT
        q.wait_and_throw();
#endif
      }
      q.wait_and_throw();
      ...

Note that in SYCL the launch is asynchronous by default, and so the CPU has the option of waiting for the operation to complete, or keep going. In terms of launch overheads, this means that if after every launch the CPU waits for the kernel to complete, then the overhead of launching the kernel will be exposed, whereas if it does not wait, then it may be overlapped with the execution of the previously launched kernel. Similar functionality might be achievable for OpenMP offload, but I am not sure how to combine pragmas to do this.

Our expectation is that SYCL with wait between launches will be about the same speed as OpenMP offload, and SYCL without wait between launches will be faster, at least for smaller sizes where launch overheads are proportionally significant.

i5-6500k
The i5-6500 figure shows runtimes for 99 repeated launches both on the CPU and the integrated GPU. Plain OpenMP appears to perform best - this library is highly tuned for spawning parallel regions with very little overhead. Still, at smaller sizes, time scales sublinearly with size. At the opposite end, OpenMP offload on the integrated GPU and SYCL with wait between kernel launches on the integrated GPU scale almost exactly the same, confirming the fixed overhead of offloading to the GPU. OpenMP offload code running on the CPU has an extra overhead incurred compared to plain OpenMP. SYCL running on the GPU without waiting between kernels shows a singificant improvement at smaller sizes - this is the effect of overlapping kernel launches on the CPU with compute on the GPU. SYCL running on the CPU still has considerable overhead compared to OpenMP - likely due to having to go through a driver. The difference between waiting and not waiting between kernel launches on the CPU is relatively small, because it's the same hardware executing and launching the kernels.

Moving to the i9-10920X and the Iris XE MAX GPU shows results that are less clear. Overall, the overheads at smaller sizes are larger compared to the i5-6500k - this may be either due to a differene in setup between the two systems, or because this system has more cores. On the CPU, OpenMP is still the fastest, SYCL without waiting being a close second - the overhead of kernel launches are only significant at really small sizes (below 65KB). Up to 16MB array sizes we see good cache locality (this CPU has 19 MB L3 cache). OpenMP offload running on the CPU is an obvious outlier, at this point I am not sure why. SYCL wait waiting between kernels and OpenMP offload perform similarly, and SYCL without waiting is faster at smaller sizes as expected. What is surprising though is that SYCL without waiting is even faster at very small sizes than native OpenMP running on the CPU, despite the separate hardware - this warrants further investigation.

Finally, looking at an older Xeon E5-2640 v3 and an NVIDIA P100, we see similar results as the i5, except the launch latency for the CPU appears to be higher than for the NVIDIA GPU. The OpenMP offload version running on the CPU is still performing poorly, as in the case of the i9.

So, why is this relevant? We are working on adding support for SYCL into the OPS DSL for multi-block structured meshes. One of our key benchmark application is CloverLeaf, a hydrodynamics application for capturing shocks. It is representative of large 3D finite volume/difference codes. We are looking at two problem sizes, 96^3 with a total memory footprint of ~240MB, and 256^3 with a total memory footprint of ~4.5GB. Here are some results:

i5-6500 (timings in sec)
| Version | OMP | SYCL |
| Wait 256^3 | 24.99 | 25.67 |
| Nowait 256^3 | 25.05 | 24.87 |
| Wait 96^3 | 1.4 | 2.51 |
| Nowait 96^3 | 1.39 | 2.27 |

i9-10920X
| Version | OMP | SYCL |
| Wait 256^3 | 27.40 | 26.19 |
| Nowait 256^3 | 27.39 | 25.65 |
| Wait 96^3 | 1.41 | 2.97 |
| Nowait 96^3 | 1.41 | 2.67 |

As you can see, for the larger problem, there is virtually no overhead in total time, it's even faster sometimes than OpenMP. However, for the smaller problem size the slowdown is quite significant. The main reason for this are the boundary condition computations: a large number of very small loops, operating only on the boundary faces. Here are timings for this step:

i5-6500 update halo (timings in sec)
| Version | OMP | SYCL |
| Wait 256^3 | 1.00 | 3.92 |
| Wait 96^3 | 0.11 | 1.78 |

i9-10920X
| Version | OMP | SYCL |
| Wait 256^3 | 0.59 | 3.2 |
| Wait 96^3 | 0.09 | 2.52 |

So clearly, these kernels are in a region where the difference between OpenMP and SYCL wait CPU is large. On the plot for the i5-6500 above, there is an obvious region where there is a 10x difference, which then slowly narrows - this explains the 3.9-16x slowdown of this operation. For the i9-10920X the tests showed smaller differences between OpenMP and SYCL wait CPU, but CloverLeaf's update halo is actually showing even larger differences compared to the i5-6500. This discrepancy also needs further investigation. (it might be that while for our tests the data could fit in cache, for the actual application it does not, and there are strided memory accesses happening).