Dev Tools for the Connex-S Vector Processor

Hello. And welcome!

Connex-S is a wide (128 or more lanes) vector (array) accelerator, similar to the NVIDIA GPUs or the Intel x86 AVX or ARM NEON SIMD units. It is an educational vector processor, similar to the (V-)DLX processor - Politehnica University of Bucharest uses this processor in the Functional Electronics course, and offers a Verilog simulator for it. It is basically open-architecture: see the Verilog code available in the Functional Electronics course.

Very important: if you want to try out the Connex-S Vector Processor, please download the OPINCAA library, which contains also a small, simple-to-use simulator written in C++ for Connex-S, besides the Connex-S assembler (very easy to learn).

Please use our mailing list associated to this site: https://groups.google.com/forum/#!forum/connex-tools .

Note: If you have extra-time, please take a look at the webpage (and this paper) that describes the very interesting Connex memory, a different project, which is not directly related to the Connex processor.

ISA

The Connex-S Instruction Set Architecture (ISA) is presented in Table 1 - a complete description is available in [Gheorghe M. Ştefan. The Connex Instruction Set Architecture, 2015], available for example in the OPINCAA library distribution.

Table 1. The Connex-S instructions with OPINCAA syntax, also with normal mnemonics in bold. R(d) is an arbitrary destination register, R(s1) is the first source register for a binary operator (d, s1, s2 ∈ {0..31}). All instructions take vector operands of i16 element type, unless otherwise specified. imm is the immediate constant operand with imm ∈ {−32768..32767}, unless otherwise specified. In the first column, in parenthesis is the instruction category inspired from the scan vector model [Guy E. Blelloch. Vector Models for Data-parallel Computing. MIT Press, Cambridge, MA, USA, 1990].

Documentation

Further documentation can be found in the PhD thesis of the author.

We published also an ACM Transactions on Embedded Computing Systems (TECS) paper - you can the PDF here or here. Also we have a WPMVP (Workshop on Programming Models for SIMD/Vector Processing) 2019 paper - you can find the PDF here or here.

Examples of compilation

Soon: online compiler

Benchmarks used

We use standard benchmarks of different input sizes:

- MatMul, matrix multiplication,

- SSD, Sum of Squared Differences, and

- SAD, Sum of Absolute Differences, used in embedded Computer Vision applications

Also, we compiled some of the Polybench suite tests, namely:

- Covar (covariance),

- Correl (correlation).

You can find here a benchmark repository (examples of compilation).

The performance model

The performance model for the Connex-S processor we present in the paper is accurate. It has an error of less than 3.4% w.r.t. measurements on the actual system with the Connex-S processor.

Experimental Results

We presented Figure 1 in our paper submission with a few experiments for all the types supported by our back end. The reason we obtain very good results for SAD.f16 (compared to SSD.f16) is that we optimize heavily the emulation of absolute value (more exactly operations x < 0 and fneg x, where fneg means float negation) for _Float16.

Figure 1. Semi-log plot with speedups of the benchmarks tiled on Connex-S with 128 lanes, at 100 MHz w.r.t. the dual-core ARM Cortex A9 at 667 MHz with 2x128-bit NEON SIMD (not tiled since GCC for ARM does not tile automatically).

We can run the same generated C++ OPINCAA program on Connex-S machines of different widths since the number of lanes is an OPINCAA program environment variable, CVL, and we perform JIT vector assembling. In Figure 2, we present the speedups achieved by the generated OPINCAA programs when running on Connex-S processors of different widths. We see how the performance of the benchmarks varies when increasing the width of the machine, a trend that is not linear but asymptotic due to the big communication overhead and the overhead of the prologue, epilogue and scalar code of loop.

Figure 2. Semi-log plot with the speedups of the i16 tiled benchmarks on the Connex-S processor with a number of lanes between 32 and 1024, clocked at 100 MHz w.r.t. the dual-core ARM Cortex A9 at 667 MHz with NEON SIMD. Note the experiments with 256, 512 and 1024 lanes are actually estimated (with very good precision) since the Zynq FPGA does not accommodate these large designs. All benchmarks have a memory footprint on Connex-S of 1 MB, 4 times larger than the 256 KB capacity of the LS memory.

The following experiments were presented in a different paper of ours [give citation].

In Figure 3 all benchmarks employ arrays with elements of native type i16, but also of emulated types i32 and f16, where appropriate. The benchmarks perform: dot product on arrays of 64K i16 or f16 elements, or 32K i32 elements; sum-reduction of population counts of words (CtPop-Reduce) on an array of 128K elements of type i16 - it is less efficient to run it on i32 also; matrix multiplication (MatMul) for sizes 128x128, 170x170 and 256x256, the second matrix being already transposed to allow better vectorization; Sum of Squared Differences (SSD) and Sum of Absolute Differences (SAD), standard functions used in computer vision, compute statistics for all pairs of two groups of 64, 64 or 32 collections of 1024 i16, f16 or i32 elements, respectively; Covar-128 is the Polybench benchmark for input data of size 128x128 elements. All these kernels, except MatMul-128/170/256, have input data of 256 KB, the size of the LS memory. MatMul-256 can support i32 if doing manual tiling because its memory footprint is bigger than the 256 KB of the SPM. Note that manual optimization for the i16 tests can yield on Connex-S w.r.t. one ARMv7 core an average of 1.077x extra speedup and a maximum of 1.316x.

Figure 3. Semi-log plot with speedups of the benchmarks on Connex-S with 128 lanes, at 100 MHz w.r.t. the dual-core ARM Cortex A9 at 667 MHz with 2x128-bit NEON SIMD

We can run the same generated C++ OPINCAA program from one of the above benchmarks on Connex-S machines of different widths, since we represent the number of lanes as program variable CVL. We noticed also that the time to compile a program with a Connex-S back end increases when we use a larger width. Therefore, generic OPINCAA programs can be compiled with one Connex-S back end, possibly of smaller width, to keep the compilation time small.

In Figure 4, we present the speedups achieved by the generated OPINCAA programs when running on Connex-S processors of different widths. We see how the performance of the benchmarks varies when increasing the width of the machine, a trend that is not proportional due to the big communication overhead and the overhead of the prologue, epilogue and scalar code of loop.

Figure 4. Semi-log plot with the speedups of the i16 benchmarks on Connex-S with a number of lanes between 32 and 1024, clocked at 100 MHz w.r.t. the dual-core ARM Cortex A9 at 667 MHz with NEON SIMD. Note the experiments with 256, 512 and 1024 lanes are actually estimated since the Zynq FPGA does not accommodate these large designs.

The CodeGen Tool for LLVM’s Instruction Selection Pass from Connex-S Assembly Code

Our Connex-S processor does not have native support for floating point nor 32-bit integer, etc.

To emulate efficiently these arithmetic operations we need to generate automatically LLVM Instruction Selection (ISel) pass custom C++ code which is then included in the Connex-S back end. This code generation is performed with our OPINCAA library.

Example:

- go to folder opincaalib/examples/Emulate_f16/ADD_SUB_f16_manual/

- give:

make CodeGen_ISel

Note: The source code of our Connex-S processor LLVM back end can be found, for example at: https://reviews.llvm.org/D97783, https://reviews.llvm.org/D97638, https://reviews.llvm.org/D60052 .

Note: A great polyhedral modelling and optimization tutorial can be found here.

Page updated

Google Sites

Report abuse