Simple Operation of Vector & Complex
- 0 Collaborators
In the one API project, we re-implement the inner product of simple vectors and the multiplication of complex numbers. In fact, the latter one can been seen as an extension of inner product of simple vectors. We hope to further extend the result above to the field of operation one matrices. ...learn more
Project status: Under Development
Overview / Usage
Linear algebra is an important branch of mathematics, the main research objects include vector and matrix. In quantum computing, linear algebra plays an important role and is regarded as the "language" of quantum computing. This is because vector and matrix algebraic structures can be used to represent quantum states and corresponding operations.
Thus, we focus on the basic operation of vectors & complex numbers.
We can just use the existing programs in the oneAPI_Intro to achieve our goal.
Methodology / Approach
DPC programs are standard C. The program is invoked on the host computer, and offloads computation to the accelerator. You will use DPC++’s queue, buffer, device, and kernel abstractions to direct which parts of the computation and data should be offloaded.
The DPC++ compiler and the oneAPI libraries automate the tedious and error-prone aspects of compute and data offload, but still allow you to control how computation and data are distributed for best performance. The compiler knows how to generate code for both the host and the accelerator, how to launch computation on the accelerator, and how to move data back and forth.
In the program below you will use a data parallel algorithm with DPC to leverage the computational power in heterogenous computers. The DPC platform model includes a host computer and a device. The host offloads computation to the device, which could be a GPU, FPGA, or a multi-core CPU.
As a first step in a DPC program, create a queue. Offload computation to a device by submitting tasks to a queue. You can choose CPU, GPU, FPGA, and other devices through the selector. This program uses the default q here, which means the DPC runtime selects the most capable device available at runtime by using the default selector. You will learn more about devices, device selectors, and the concepts of buffers, accessors and kernels in the upcoming modules, but here is a simple DPC++ program to get you started.
Device and host can either share physical memory or have distinct memories. When the memories are distinct, offloading computation requires copying data between host and device. DPC does not require you to manage the data copies. By creating Buffers and Accessors, DPC ensures that the data is available to host and device without any effort on your part. DPC++ also allows you explicit control over data movement to achieve best peformance.
In a DPC program, we define a kernel, which is applied to every point in an index space. For simple programs like this one, the index space maps directly to the elements of the array. The kernel is encapsulated in a C lambda function. The lambda function is passed a point in the index space as an array of coordinates. For this simple program, the index space coordinate is the same as the array index. The parallel_for in the below program applies the lambda to the index space. The index space is defined in the first argument of the parallel_for as a 1 dimensional range from 0 to N-1.
The parallel_for is nested inside another lamba function, which is passed as an argument in the below program where we submit to the queue. The DPC++ runtime invokes the lambda when the accelerator connected to the queue is ready. The handler argument to the lambda allows operations inside the lambda to define the data and dependences with other computation that may be executed on host or devices. You will see more of this in later modules.
Finally, the program does a q.wait() on the queue. The earlier submit operation queues up an operation to be performed at a later time and immmediately returns. If the host wants to see the result of the computation, it must wait for the work to complete with a wait. Sometimes the device will encounter an error. The q.wait_and_throw() is a way for the host to capture and handle the error that has happened on the device.
Based on the tech of DPC++, we mainly explain the codes that we present below in the following part.
For inner product of vectors, inspect: The code presents one input buffer (vector1) for which Sycl buffer memory is allocated. The associated with vector1_accessor set to read/write gets the contents of the buffer. Then we add another input vector as vector2, and print it out. After that, we add another Sycl buffer - vector2_buffer and add an accessor for vector2_buffer as well. Then we calculate the inner product of the given vectors.
For the multiplication of complex numbers, we firstly query for the custom device specific to a Vendor and if it is a GPU device we are giving the highest rating as 3 . The second preference is given to any GPU device and the third preference is given to CPU device. We define in_vect1 and in_vect2 which are the vectors with num_elements complex numbers and are inputs to the parallel function.
Then, we set up input buffers and set up output buffers, submit command group function object to the queue as well. Make accessors set as read mode and another accessor set to write mode. Compare the results of the two output vectors from parallel and scalar. They should be equal.
In the main function, firstly we declare your input and output vectors of the complex 2 class. Initialize our input and output Vectors. Inputs are initialized as codes represent. Outputs are initialized with 0.
Pass in the name of the vendor for which the device we want to query. Here we choose "Nvidia" .
Then queue constructor passed exception handler and call the DpcppParallel with the required inputs and output. Finally print the outputs of the parallel function.