How do I achieve the theoretical maximum of 4 FLOPs per cycle

2025-01-26 (Last Modified: 2025-01-26)

Unlocking the afloat possible of contemporary processors is a changeless pursuit successful advanced-show computing. A cardinal metric successful this quest is FLOPS (Floating-Component Operations Per 2nd), a measurement of a machine’s processing powerfulness. Galore CPUs theoretically boast a most of four FLOPs per rhythm, a tantalizing fig that represents highest show. However however bash you really accomplish this theoretical most? Reaching this pinnacle requires a heavy knowing of hardware structure, education units, and package optimization strategies. This article delves into the methods and concerns for maximizing FLOPS, bridging the spread betwixt theoretical limits and existent-planet show.

Knowing the four FLOPS/Rhythm Bounds

Contemporary CPUs employment precocious vector processing models, specified arsenic AVX (Precocious Vector Extensions), permitting them to execute aggregate floating-component operations concurrently inside a azygous timepiece rhythm. With AVX-512, for case, a processor tin theoretically execute 512 bits of floating-component information, equating to four treble-precision (sixty four-spot) FLOPs per rhythm. This theoretical bounds assumes perfect situations, wherever the CPU is perpetually fed with information and directions are absolutely aligned for vectorized execution.

Nevertheless, existent-planet functions seldom accomplish this perfect script. Bottlenecks similar representation entree speeds, information dependencies, and subdivision mispredictions tin importantly hinder show. Knowing these limitations is important for optimizing codification and reaching larger FLOPS.

For illustration, if representation entree is slower than the CPU’s processing velocity, the CPU volition pass cycles ready for information, efficaciously decreasing the FLOPS achieved. This is frequently referred to arsenic being “representation-sure.” Likewise, if directions inside a loop are babelike connected the outcomes of former iterations, vectorization turns into difficult, limiting the figure of FLOPs per rhythm.

Optimizing Codification for Most FLOPS

Reaching adjacent to the theoretical four FLOPs/rhythm requires cautious codification optimization. Vectorization is paramount, guaranteeing that operations are carried out connected arrays of information instead than idiosyncratic parts. Compilers drama a important function successful this procedure, however builders tin additional heighten vectorization utilizing intrinsics oregon libraries particularly designed for vector operations.

Loop unrolling, different optimization method, reduces loop overhead and permits for amended education scheduling. By processing aggregate loop iterations concurrently, CPU sources are utilized much effectively. Moreover, minimizing subdivision mispredictions done methods similar subdivision prediction hints oregon loop restructuring tin importantly better show.

See this illustration: a elemental loop including 2 arrays. With out vectorization, all summation would beryllium carried out individually. With vectorization, the CPU tin adhd aggregate parts concurrently, drastically expanding the FLOPS. Moreover, utilizing intrinsics oregon specialised libraries tin additional good-tune the codification for circumstantial hardware architectures.

The Function of Hardware and Structure

Piece package optimization is indispensable, hardware performs an as important function. The CPU’s structure, cache hierarchy, and representation bandwidth each power the achievable FLOPS. Selecting a CPU with bigger vector registers, larger timepiece speeds, and quicker representation entree tin importantly contact show.

Moreover, using aggregate cores and threads done parallelization tin additional heighten FLOPS, particularly for functions that tin beryllium breached behind into autarkic duties. Effectively distributing workload crossed aggregate cores is important for realizing the afloat possible of multi-center processors.

For case, a CPU with a bigger L1 cache tin shop much information person to the processing cores, decreasing the demand for slower representation accesses. Likewise, a processor with increased representation bandwidth tin provender information to the CPU much effectively, stopping representation bottlenecks.

Benchmarking and Investigation

Measuring the existent FLOPS achieved is critical for evaluating optimization efforts. Benchmarking instruments supply insights into show bottlenecks and aid place areas for betterment. Analyzing show metrics similar CPU utilization, representation entree patterns, and cache deed charges tin usher additional optimization methods.

Instruments similar LINPACK and Intel’s VTune Amplifier supply elaborate show investigation, permitting builders to pinpoint circumstantial areas inside the codification that bounds FLOPS. By knowing the show traits of antithetic codification sections, builders tin direction their optimization efforts connected the about captious areas.

For illustration, if benchmarking reveals that a peculiar loop is representation-certain, optimizing representation entree patterns oregon utilizing prefetching strategies tin importantly better show. Likewise, if the CPU utilization is debased, parallelizing the codification tin amended make the most of disposable assets and addition FLOPS.

Prioritize vectorization utilizing compiler choices, intrinsics, oregon specialised libraries.
Optimize representation entree patterns to trim bottlenecks.

Analyse the exertion’s show traits utilizing benchmarking instruments.
Place show bottlenecks, specified arsenic representation entree oregon subdivision mispredictions.
Instrumentality focused optimizations based mostly connected the investigation outcomes.

A fine-optimized codification, leveraging hardware capabilities, tin attack the theoretical four FLOPS per rhythm, particularly successful compute-intensive duties.

Larn much astir optimization strategiesOuter Assets:

[Infographic Placeholder: Illustrating the travel of information and directions inside a CPU, highlighting the contact of vectorization and representation entree connected FLOPS.]

FAQ

Q: What are the capital components limiting FLOPS successful existent-planet purposes?

A: Representation bandwidth, information dependencies betwixt directions, and subdivision mispredictions are communal components limiting FLOPS. Inefficient codification that doesn’t full make the most of vectorization capabilities tin besides hinder show.

Maximizing FLOPS requires a multifaceted attack encompassing hardware action, codification optimization, and thorough show investigation. Piece attaining the theoretical most of four FLOPS per rhythm is difficult, by implementing the methods mentioned successful this article, builders tin importantly better exertion show and span the spread betwixt theoretical limits and existent-planet outcomes. Research assets similar Intel’s Developer Region and AMD’s Developer Halfway for much precocious optimization methods. Steady benchmarking and investigation are indispensable for good-tuning show and staying abreast of the newest developments successful hardware and package.

Question & Answer :
However tin the theoretical highest show of four floating component operations (treble precision) per rhythm beryllium achieved connected a contemporary x86-sixty four Intel CPU?

Arsenic cold arsenic I realize it takes 3 cycles for an SSE adhd and 5 cycles for a mul to absolute connected about of the contemporary Intel CPUs (seat for illustration Agner Fog’s ‘Education Tables’ ). Owed to pipelining, 1 tin acquire a throughput of 1 adhd per rhythm, if the algorithm has astatine slightest 3 autarkic summations. Since that is actual for some packed addpd arsenic fine arsenic the scalar addsd variations and SSE registers tin incorporate 2 treble’s, the throughput tin beryllium arsenic overmuch arsenic 2 flops per rhythm.

Moreover, it appears (though I’ve not seen immoderate appropriate documentation connected this) adhd’s and mul’s tin beryllium executed successful parallel giving a theoretical max throughput of 4 flops per rhythm.

Nevertheless, I’ve not been capable to replicate that show with a elemental C/C++ programme. My champion effort resulted successful astir 2.7 flops/rhythm. If anybody tin lend a elemental C/C++ oregon assembler programme which demonstrates highest show, that’d beryllium significantly appreciated.

My effort:

#see <stdio.h> #see <stdlib.h> #see <mathematics.h> #see <sys/clip.h> treble stoptime(void) { struct timeval t; gettimeofday(&t,NULL); instrument (treble) t.tv_sec + t.tv_usec/one million.zero; } treble addmul(treble adhd, treble mul, int ops){ // Demand to initialise otherwise other compiler mightiness optimise distant treble sum1=zero.1, sum2=-zero.1, sum3=zero.2, sum4=-zero.2, sum5=zero.zero; treble mul1=1.zero, mul2= 1.1, mul3=1.2, mul4= 1.three, mul5=1.four; int loops=ops/10; // We person 10 floating component operations wrong the loop treble anticipated = 5.zero*adhd*loops + (sum1+sum2+sum3+sum4+sum5) + pow(mul,loops)*(mul1+mul2+mul3+mul4+mul5); for (int i=zero; i<loops; i++) { mul1*=mul; mul2*=mul; mul3*=mul; mul4*=mul; mul5*=mul; sum1+=adhd; sum2+=adhd; sum3+=adhd; sum4+=adhd; sum5+=adhd; } instrument sum1+sum2+sum3+sum4+sum5+mul1+mul2+mul3+mul4+mul5 - anticipated; } int chief(int argc, char** argv) { if (argc != 2) { printf("utilization: %s <num>\n", argv[zero]); printf("figure of operations: <num> hundreds of thousands\n"); exit(EXIT_FAILURE); } int n = atoi(argv[1]) * a million; if (n<=zero) n=one thousand; treble x = M_PI; treble y = 1.zero + 1e-eight; treble t = stoptime(); x = addmul(x, y, n); t = stoptime() - t; printf("addmul:\t %.3f s, %.3f Gflops, res=%f\n", t, (treble)n/t/1e9, x); instrument EXIT_SUCCESS; }

Compiled with:

g++ -O2 -march=autochthonal addmul.cpp ; ./a.retired a thousand

produces the pursuing output connected an Intel Center i5-750, 2.sixty six GHz:

addmul: zero.270 s, three.707 Gflops, res=1.326463

That is, conscionable astir 1.four flops per rhythm. Trying astatine the assembler codification with g++ -S -O2 -march=autochthonal -masm=intel addmul.cpp the chief loop appears benignant of optimum to maine.

.L4: inc eax mulsd xmm8, xmm3 mulsd xmm7, xmm3 mulsd xmm6, xmm3 mulsd xmm5, xmm3 mulsd xmm1, xmm3 addsd xmm13, xmm2 addsd xmm12, xmm2 addsd xmm11, xmm2 addsd xmm10, xmm2 addsd xmm9, xmm2 cmp eax, ebx jne .L4

Altering the scalar variations with packed variations (addpd and mulpd) would treble the flop number with out altering the execution clip and truthful I’d acquire conscionable abbreviated of 2.eight flops per rhythm. Is location a elemental illustration which achieves 4 flops per rhythm?

Good small programme by Mysticial; present are my outcomes (tally conscionable for a fewer seconds although):

gcc -O2 -march=nocona: 5.6 Gflops retired of 10.sixty six Gflops (2.1 flops/rhythm)
cl /O2, openmp eliminated: 10.1 Gflops retired of 10.sixty six Gflops (three.eight flops/rhythm)

It each appears a spot analyzable, however my conclusions truthful cold:

gcc -O2 modifications the command of autarkic floating component operations with the purpose of alternating addpd and mulpd’s if imaginable. Aforesaid applies to gcc-four.6.2 -O2 -march=core2.
gcc -O2 -march=nocona appears to support the command of floating component operations arsenic outlined successful the C++ origin.
cl /O2, the sixty four-spot compiler from the SDK for Home windows 7 does loop-unrolling mechanically and appears to attempt and put operations truthful that teams of 3 addpd’s alternate with 3 mulpd’s (fine, astatine slightest connected my scheme and for my elemental programme).
My Center i5 750 (Nehalem structure) doesn’t similar alternating adhd’s and mul’s and appears incapable to tally some operations successful parallel. Nevertheless, if grouped successful three’s, it abruptly plant similar magic.
Another architectures (perchance Sandy Span and others) look to beryllium capable to execute adhd/mul successful parallel with out issues if they alternate successful the meeting codification.
Though hard to acknowledge, however connected my scheme cl /O2 does a overmuch amended occupation astatine debased-flat optimising operations for my scheme and achieves adjacent to highest show for the small C++ illustration supra. I measured betwixt 1.eighty five-2.01 flops/rhythm (person utilized timepiece() successful Home windows which is not that exact. I conjecture, demand to usage a amended timer - acknowledgment Mackie Messer).
The champion I managed with gcc was to manually loop unroll and put additions and multiplications successful teams of 3. With g++ -O2 -march=nocona addmul_unroll.cpp I acquire astatine champion zero.207s, four.825 Gflops which corresponds to 1.eight flops/rhythm which I’m rather blessed with present.

Successful the C++ codification I’ve changed the for loop with:

for (int i=zero; i<loops/three; i++) { mul1*=mul; mul2*=mul; mul3*=mul; sum1+=adhd; sum2+=adhd; sum3+=adhd; mul4*=mul; mul5*=mul; mul1*=mul; sum4+=adhd; sum5+=adhd; sum1+=adhd; mul2*=mul; mul3*=mul; mul4*=mul; sum2+=adhd; sum3+=adhd; sum4+=adhd; mul5*=mul; mul1*=mul; mul2*=mul; sum5+=adhd; sum1+=adhd; sum2+=adhd; mul3*=mul; mul4*=mul; mul5*=mul; sum3+=adhd; sum4+=adhd; sum5+=adhd; }

And the meeting present seems similar:

.L4: mulsd xmm8, xmm3 mulsd xmm7, xmm3 mulsd xmm6, xmm3 addsd xmm13, xmm2 addsd xmm12, xmm2 addsd xmm11, xmm2 mulsd xmm5, xmm3 mulsd xmm1, xmm3 mulsd xmm8, xmm3 addsd xmm10, xmm2 addsd xmm9, xmm2 addsd xmm13, xmm2 ...

I’ve executed this direct project earlier. However it was chiefly to measurement powerfulness depletion and CPU temperatures. The pursuing codification (which is reasonably agelong) achieves adjacent to optimum connected my Center i7 2600K.

The cardinal happening to line present is the monolithic magnitude of handbook loop-unrolling arsenic fine arsenic interleaving of multiplies and provides…

The afloat task tin beryllium recovered connected my GitHub: https://github.com/Mysticial/Flops

Informing:

If you determine to compile and tally this, wage attraction to your CPU temperatures!!!
Brand certain you don’t overheat it. And brand certain CPU-throttling doesn’t impact your outcomes!

Moreover, I return nary duty for any harm that whitethorn consequence from moving this codification.

Notes:

This codification is optimized for x64. x86 doesn’t person adequate registers for this to compile fine.
This codification has been examined to activity fine connected Ocular Workplace 2010/2012 and GCC four.6.
ICC eleven (Intel Compiler eleven) amazingly has problem compiling it fine.
These are for pre-FMA processors. Successful command to accomplish highest FLOPS connected Intel Haswell and AMD Bulldozer processors (and future), FMA (Fused Multiply Adhd) directions volition beryllium wanted. These are past the range of this benchmark.

#see <emmintrin.h> #see <omp.h> #see <iostream> utilizing namespace std; typedef unsigned agelong agelong uint64; treble test_dp_mac_SSE(treble x,treble y,uint64 iterations){ registry __m128d r0,r1,r2,r3,r4,r5,r6,r7,r8,r9,rA,rB,rC,rD,rE,rF; // Make beginning information. r0 = _mm_set1_pd(x); r1 = _mm_set1_pd(y); r8 = _mm_set1_pd(-zero.zero); r2 = _mm_xor_pd(r0,r8); r3 = _mm_or_pd(r0,r8); r4 = _mm_andnot_pd(r8,r0); r5 = _mm_mul_pd(r1,_mm_set1_pd(zero.37796447300922722721)); r6 = _mm_mul_pd(r1,_mm_set1_pd(zero.24253562503633297352)); r7 = _mm_mul_pd(r1,_mm_set1_pd(four.1231056256176605498)); r8 = _mm_add_pd(r0,_mm_set1_pd(zero.37796447300922722721)); r9 = _mm_add_pd(r1,_mm_set1_pd(zero.24253562503633297352)); rA = _mm_sub_pd(r0,_mm_set1_pd(four.1231056256176605498)); rB = _mm_sub_pd(r1,_mm_set1_pd(four.1231056256176605498)); rC = _mm_set1_pd(1.4142135623730950488); rD = _mm_set1_pd(1.7320508075688772935); rE = _mm_set1_pd(zero.57735026918962576451); rF = _mm_set1_pd(zero.70710678118654752440); uint64 iMASK = 0x800fffffffffffffull; __m128d Disguise = _mm_set1_pd(*(treble*)&iMASK); __m128d vONE = _mm_set1_pd(1.zero); uint64 c = zero; piece (c < iterations){ size_t i = zero; piece (i < one thousand){ // Present's the food - the portion that truly issues. r0 = _mm_mul_pd(r0,rC); r1 = _mm_add_pd(r1,rD); r2 = _mm_mul_pd(r2,rE); r3 = _mm_sub_pd(r3,rF); r4 = _mm_mul_pd(r4,rC); r5 = _mm_add_pd(r5,rD); r6 = _mm_mul_pd(r6,rE); r7 = _mm_sub_pd(r7,rF); r8 = _mm_mul_pd(r8,rC); r9 = _mm_add_pd(r9,rD); rA = _mm_mul_pd(rA,rE); rB = _mm_sub_pd(rB,rF); r0 = _mm_add_pd(r0,rF); r1 = _mm_mul_pd(r1,rE); r2 = _mm_sub_pd(r2,rD); r3 = _mm_mul_pd(r3,rC); r4 = _mm_add_pd(r4,rF); r5 = _mm_mul_pd(r5,rE); r6 = _mm_sub_pd(r6,rD); r7 = _mm_mul_pd(r7,rC); r8 = _mm_add_pd(r8,rF); r9 = _mm_mul_pd(r9,rE); rA = _mm_sub_pd(rA,rD); rB = _mm_mul_pd(rB,rC); r0 = _mm_mul_pd(r0,rC); r1 = _mm_add_pd(r1,rD); r2 = _mm_mul_pd(r2,rE); r3 = _mm_sub_pd(r3,rF); r4 = _mm_mul_pd(r4,rC); r5 = _mm_add_pd(r5,rD); r6 = _mm_mul_pd(r6,rE); r7 = _mm_sub_pd(r7,rF); r8 = _mm_mul_pd(r8,rC); r9 = _mm_add_pd(r9,rD); rA = _mm_mul_pd(rA,rE); rB = _mm_sub_pd(rB,rF); r0 = _mm_add_pd(r0,rF); r1 = _mm_mul_pd(r1,rE); r2 = _mm_sub_pd(r2,rD); r3 = _mm_mul_pd(r3,rC); r4 = _mm_add_pd(r4,rF); r5 = _mm_mul_pd(r5,rE); r6 = _mm_sub_pd(r6,rD); r7 = _mm_mul_pd(r7,rC); r8 = _mm_add_pd(r8,rF); r9 = _mm_mul_pd(r9,rE); rA = _mm_sub_pd(rA,rD); rB = _mm_mul_pd(rB,rC); i++; } // Demand to renormalize to forestall denormal/overflow. r0 = _mm_and_pd(r0,Disguise); r1 = _mm_and_pd(r1,Disguise); r2 = _mm_and_pd(r2,Disguise); r3 = _mm_and_pd(r3,Disguise); r4 = _mm_and_pd(r4,Disguise); r5 = _mm_and_pd(r5,Disguise); r6 = _mm_and_pd(r6,Disguise); r7 = _mm_and_pd(r7,Disguise); r8 = _mm_and_pd(r8,Disguise); r9 = _mm_and_pd(r9,Disguise); rA = _mm_and_pd(rA,Disguise); rB = _mm_and_pd(rB,Disguise); r0 = _mm_or_pd(r0,vONE); r1 = _mm_or_pd(r1,vONE); r2 = _mm_or_pd(r2,vONE); r3 = _mm_or_pd(r3,vONE); r4 = _mm_or_pd(r4,vONE); r5 = _mm_or_pd(r5,vONE); r6 = _mm_or_pd(r6,vONE); r7 = _mm_or_pd(r7,vONE); r8 = _mm_or_pd(r8,vONE); r9 = _mm_or_pd(r9,vONE); rA = _mm_or_pd(rA,vONE); rB = _mm_or_pd(rB,vONE); c++; } r0 = _mm_add_pd(r0,r1); r2 = _mm_add_pd(r2,r3); r4 = _mm_add_pd(r4,r5); r6 = _mm_add_pd(r6,r7); r8 = _mm_add_pd(r8,r9); rA = _mm_add_pd(rA,rB); r0 = _mm_add_pd(r0,r2); r4 = _mm_add_pd(r4,r6); r8 = _mm_add_pd(r8,rA); r0 = _mm_add_pd(r0,r4); r0 = _mm_add_pd(r0,r8); // Forestall Asleep Codification Elimination treble retired = zero; __m128d temp = r0; retired += ((treble*)&temp)[zero]; retired += ((treble*)&temp)[1]; instrument retired; } void test_dp_mac_SSE(int tds,uint64 iterations){ treble *sum = (treble*)malloc(tds * sizeof(treble)); treble commencement = omp_get_wtime(); #pragma omp parallel num_threads(tds) { treble ret = test_dp_mac_SSE(1.1,2.1,iterations); sum[omp_get_thread_num()] = ret; } treble secs = omp_get_wtime() - commencement; uint64 ops = forty eight * a thousand * iterations * tds * 2; cout << "Seconds = " << secs << endl; cout << "FP Ops = " << ops << endl; cout << "FLOPs = " << ops / secs << endl; treble retired = zero; int c = zero; piece (c < tds){ retired += sum[c++]; } cout << "sum = " << retired << endl; cout << endl; escaped(sum); } int chief(){ // (threads, iterations) test_dp_mac_SSE(eight,10000000); scheme("intermission"); }

Output (1 thread, 10000000 iterations) - Compiled with Ocular Workplace 2010 SP1 - x64 Merchandise:

Seconds = fifty five.5104 FP Ops = 960000000000 FLOPs = 1.7294e+010 sum = 2.22652

The device is a Center i7 2600K @ four.four GHz. Theoretical SSE highest is four flops * four.four GHz = 17.6 GFlops. This codification achieves 17.three GFlops - not atrocious.

Output (eight threads, 10000000 iterations) - Compiled with Ocular Workplace 2010 SP1 - x64 Merchandise:

Seconds = 117.202 FP Ops = 7680000000000 FLOPs = 6.55279e+010 sum = 17.8122

Theoretical SSE highest is four flops * four cores * four.four GHz = 70.four GFlops. Existent is sixty five.5 GFlops.

Fto’s return this 1 measure additional. AVX…

#see <immintrin.h> #see <omp.h> #see <iostream> utilizing namespace std; typedef unsigned agelong agelong uint64; treble test_dp_mac_AVX(treble x,treble y,uint64 iterations){ registry __m256d r0,r1,r2,r3,r4,r5,r6,r7,r8,r9,rA,rB,rC,rD,rE,rF; // Make beginning information. r0 = _mm256_set1_pd(x); r1 = _mm256_set1_pd(y); r8 = _mm256_set1_pd(-zero.zero); r2 = _mm256_xor_pd(r0,r8); r3 = _mm256_or_pd(r0,r8); r4 = _mm256_andnot_pd(r8,r0); r5 = _mm256_mul_pd(r1,_mm256_set1_pd(zero.37796447300922722721)); r6 = _mm256_mul_pd(r1,_mm256_set1_pd(zero.24253562503633297352)); r7 = _mm256_mul_pd(r1,_mm256_set1_pd(four.1231056256176605498)); r8 = _mm256_add_pd(r0,_mm256_set1_pd(zero.37796447300922722721)); r9 = _mm256_add_pd(r1,_mm256_set1_pd(zero.24253562503633297352)); rA = _mm256_sub_pd(r0,_mm256_set1_pd(four.1231056256176605498)); rB = _mm256_sub_pd(r1,_mm256_set1_pd(four.1231056256176605498)); rC = _mm256_set1_pd(1.4142135623730950488); rD = _mm256_set1_pd(1.7320508075688772935); rE = _mm256_set1_pd(zero.57735026918962576451); rF = _mm256_set1_pd(zero.70710678118654752440); uint64 iMASK = 0x800fffffffffffffull; __m256d Disguise = _mm256_set1_pd(*(treble*)&iMASK); __m256d vONE = _mm256_set1_pd(1.zero); uint64 c = zero; piece (c < iterations){ size_t i = zero; piece (i < one thousand){ // Present's the food - the portion that truly issues. r0 = _mm256_mul_pd(r0,rC); r1 = _mm256_add_pd(r1,rD); r2 = _mm256_mul_pd(r2,rE); r3 = _mm256_sub_pd(r3,rF); r4 = _mm256_mul_pd(r4,rC); r5 = _mm256_add_pd(r5,rD); r6 = _mm256_mul_pd(r6,rE); r7 = _mm256_sub_pd(r7,rF); r8 = _mm256_mul_pd(r8,rC); r9 = _mm256_add_pd(r9,rD); rA = _mm256_mul_pd(rA,rE); rB = _mm256_sub_pd(rB,rF); r0 = _mm256_add_pd(r0,rF); r1 = _mm256_mul_pd(r1,rE); r2 = _mm256_sub_pd(r2,rD); r3 = _mm256_mul_pd(r3,rC); r4 = _mm256_add_pd(r4,rF); r5 = _mm256_mul_pd(r5,rE); r6 = _mm256_sub_pd(r6,rD); r7 = _mm256_mul_pd(r7,rC); r8 = _mm256_add_pd(r8,rF); r9 = _mm256_mul_pd(r9,rE); rA = _mm256_sub_pd(rA,rD); rB = _mm256_mul_pd(rB,rC); r0 = _mm256_mul_pd(r0,rC); r1 = _mm256_add_pd(r1,rD); r2 = _mm256_mul_pd(r2,rE); r3 = _mm256_sub_pd(r3,rF); r4 = _mm256_mul_pd(r4,rC); r5 = _mm256_add_pd(r5,rD); r6 = _mm256_mul_pd(r6,rE); r7 = _mm256_sub_pd(r7,rF); r8 = _mm256_mul_pd(r8,rC); r9 = _mm256_add_pd(r9,rD); rA = _mm256_mul_pd(rA,rE); rB = _mm256_sub_pd(rB,rF); r0 = _mm256_add_pd(r0,rF); r1 = _mm256_mul_pd(r1,rE); r2 = _mm256_sub_pd(r2,rD); r3 = _mm256_mul_pd(r3,rC); r4 = _mm256_add_pd(r4,rF); r5 = _mm256_mul_pd(r5,rE); r6 = _mm256_sub_pd(r6,rD); r7 = _mm256_mul_pd(r7,rC); r8 = _mm256_add_pd(r8,rF); r9 = _mm256_mul_pd(r9,rE); rA = _mm256_sub_pd(rA,rD); rB = _mm256_mul_pd(rB,rC); i++; } // Demand to renormalize to forestall denormal/overflow. r0 = _mm256_and_pd(r0,Disguise); r1 = _mm256_and_pd(r1,Disguise); r2 = _mm256_and_pd(r2,Disguise); r3 = _mm256_and_pd(r3,Disguise); r4 = _mm256_and_pd(r4,Disguise); r5 = _mm256_and_pd(r5,Disguise); r6 = _mm256_and_pd(r6,Disguise); r7 = _mm256_and_pd(r7,Disguise); r8 = _mm256_and_pd(r8,Disguise); r9 = _mm256_and_pd(r9,Disguise); rA = _mm256_and_pd(rA,Disguise); rB = _mm256_and_pd(rB,Disguise); r0 = _mm256_or_pd(r0,vONE); r1 = _mm256_or_pd(r1,vONE); r2 = _mm256_or_pd(r2,vONE); r3 = _mm256_or_pd(r3,vONE); r4 = _mm256_or_pd(r4,vONE); r5 = _mm256_or_pd(r5,vONE); r6 = _mm256_or_pd(r6,vONE); r7 = _mm256_or_pd(r7,vONE); r8 = _mm256_or_pd(r8,vONE); r9 = _mm256_or_pd(r9,vONE); rA = _mm256_or_pd(rA,vONE); rB = _mm256_or_pd(rB,vONE); c++; } r0 = _mm256_add_pd(r0,r1); r2 = _mm256_add_pd(r2,r3); r4 = _mm256_add_pd(r4,r5); r6 = _mm256_add_pd(r6,r7); r8 = _mm256_add_pd(r8,r9); rA = _mm256_add_pd(rA,rB); r0 = _mm256_add_pd(r0,r2); r4 = _mm256_add_pd(r4,r6); r8 = _mm256_add_pd(r8,rA); r0 = _mm256_add_pd(r0,r4); r0 = _mm256_add_pd(r0,r8); // Forestall Asleep Codification Elimination treble retired = zero; __m256d temp = r0; retired += ((treble*)&temp)[zero]; retired += ((treble*)&temp)[1]; retired += ((treble*)&temp)[2]; retired += ((treble*)&temp)[three]; instrument retired; } void test_dp_mac_AVX(int tds,uint64 iterations){ treble *sum = (treble*)malloc(tds * sizeof(treble)); treble commencement = omp_get_wtime(); #pragma omp parallel num_threads(tds) { treble ret = test_dp_mac_AVX(1.1,2.1,iterations); sum[omp_get_thread_num()] = ret; } treble secs = omp_get_wtime() - commencement; uint64 ops = forty eight * a thousand * iterations * tds * four; cout << "Seconds = " << secs << endl; cout << "FP Ops = " << ops << endl; cout << "FLOPs = " << ops / secs << endl; treble retired = zero; int c = zero; piece (c < tds){ retired += sum[c++]; } cout << "sum = " << retired << endl; cout << endl; escaped(sum); } int chief(){ // (threads, iterations) test_dp_mac_AVX(eight,10000000); scheme("intermission"); }

Output (1 thread, 10000000 iterations) - Compiled with Ocular Workplace 2010 SP1 - x64 Merchandise:

Seconds = fifty seven.4679 FP Ops = 1920000000000 FLOPs = three.34099e+010 sum = four.45305

Theoretical AVX highest is eight flops * four.four GHz = 35.2 GFlops. Existent is 33.four GFlops.

Output (eight threads, 10000000 iterations) - Compiled with Ocular Workplace 2010 SP1 - x64 Merchandise:

Seconds = 111.119 FP Ops = 15360000000000 FLOPs = 1.3823e+011 sum = 35.6244

Theoretical AVX highest is eight flops * four cores * four.four GHz = one hundred forty.eight GFlops. Existent is 138.2 GFlops.

Present for any explanations:

The show captious portion is evidently the forty eight directions wrong the interior loop. You’ll announcement that it’s breached into four blocks of 12 directions all. All of these 12 directions blocks are wholly autarkic from all another - and return connected mean 6 cycles to execute.

Truthful location’s 12 directions and 6 cycles betwixt content-to-usage. The latency of multiplication is 5 cycles, truthful it’s conscionable adequate to debar latency stalls.

The normalization measure is wanted to support the information from complete/underflowing. This is wanted since the bash-thing codification volition slow addition/change the magnitude of the information.

Truthful it’s really imaginable to bash amended than this if you conscionable usage each zeros and acquire free of the normalization measure. Nevertheless, since I wrote the benchmark to measurement powerfulness depletion and somesthesia, I had to brand certain the flops have been connected “existent” information, instead than zeros - arsenic the execution items whitethorn precise fine person particular lawsuit-dealing with for zeros that usage little powerfulness and food little energy.

Much Outcomes:

Intel Center i7 920 @ three.5 GHz
Home windows 7 Eventual x64
Ocular Workplace 2010 SP1 - x64 Merchandise

Threads: 1

Seconds = seventy two.1116 FP Ops = 960000000000 FLOPs = 1.33127e+010 sum = 2.22652

Theoretical SSE Highest: four flops * three.5 GHz = 14.zero GFlops. Existent is thirteen.three GFlops.

Threads: eight

Seconds = 149.576 FP Ops = 7680000000000 FLOPs = 5.13452e+010 sum = 17.8122

Theoretical SSE Highest: four flops * four cores * three.5 GHz = fifty six.zero GFlops. Existent is fifty one.three GFlops.

My processor temps deed 76C connected the multi-threaded tally! If you runs these, beryllium certain the outcomes aren’t affected by CPU throttling.

2 x Intel Xeon X5482 Harpertown @ three.2 GHz
Ubuntu Linux 10 x64
GCC four.5.2 x64 - (-O2 -msse3 -fopenmp)

Threads: 1

Seconds = seventy eight.3357 FP Ops = 960000000000 FLOPs = 1.22549e+10 sum = 2.22652

Theoretical SSE Highest: four flops * three.2 GHz = 12.eight GFlops. Existent is 12.three GFlops.

Threads: eight

Seconds = seventy eight.4733 FP Ops = 7680000000000 FLOPs = 9.78676e+10 sum = 17.8122

Theoretical SSE Highest: four flops * eight cores * three.2 GHz = 102.four GFlops. Existent is ninety seven.9 GFlops.