AVX

AVX is an extension to ISA proposed by Intel, which allows user to operate a single instruction on different values for one time. For example, VADDPS, when using 256-bit registers, is able to perform addition on 8 floating numbers for one time.

Read AVX document [pdf] and [pdf], understand the semantics of AVX instructions (e.g. VMOVUPS, VMULPS, VADDPS, etc). At the end of the first pdf, all the AVX instructions are listed, and students can find the semantics of the instructions in the second pdf. The following is just a piece of copy about semantics for VMOVUPS.

Instr.	Op / En	64 32 bit mode support	CPUID Feature Flag	Description
VMOVUPS ymm1, ymm2/m256	RM	V / V	AVX	Move unaligned packed single-precision floating-point from ymm2/mem to ymm1.
VMOVUPS ymm2/m256, ymm1	MR	V / V	AVX	Move unaligned packed single-precision floating-point from ymm1 to ymm2/mem.
VMOVUPS xmm1 {k1}{z}, xmm2/m128	FVM-RM	V / V	AVX512VL AVX512F	Move unaligned packed single-precision floating-point values from xmm2/m128 to xmm1 using writemask k1.

Jump-start task

Assuming there is a CPU that supports AVX, write an assembly program with AVX (256-bit and 512-bit versions) to do the following operations:

Addition of two vectors in assembly codes
Inner product of two matrics in assembly codes

For example, the addition of two vectors implemented in AVX should have the same functionality with the following C code:

// a, b and c are arrays of floating point with the same length. The length of them are both larger than 100,000.
// len is the length of a.
void add(float*a, float *b, float *c, int len){
    for(int i=0;i<len;i++)
       c[i]=a[i]+b[i];
}

Minimal-requirement task

At its minimum, the students taking this project should deliver the following:

Choose an iterative machine-learning algorithm, for example, logistic regression with fixed Hessian Matrix [pdf],[link].
Implement selected algorithm using AVX instruction. The implementation can be in assembly code or in C code with AVX annotations (and then using gcc to compile).
Conduct performance study and measure the performance advantage of using AVX by comparing the performance differences between the code with AVX and that with ordinary assembly code.

Requirements for Mid-term report

Read the provided materials and understand the opcodes needed for the target algorithm
Design workflow of the calculation
Design the operations within each step of the workflow

Bonus

We do provide extra credits or bonus for those whose work goes beyond the minimal requirements and have interesting and inspiring discoveries about AVX. For example,

Implement standard Hessian Matrix rather than approximated Hessian Matrix for Logistic Regression.
Implement multi-threading with AVX