J.C. Huang, T. Leng - Generalized Loop-Unrolling - a Method for Program Speed-Up (798425), страница 2
Текст из файла (страница 2)
To avoid thisproblem, we need to modify the unrolled loop to prevent the pointer from getting out of range:⇔count=0;while (lp != NULL){lp = lp->next;if (lp != NULL){lp = lp->next;if (lp != NULL)count+=3;else {count+=2; break;}}else {count++; break;}}Another possible solution is to attach a special sentinel node named NULL_NODE at the end ofthe list. The link field of this node points to the node itself as depicted below.NULL_NODE6Note that a self-pointing sentinel can be used in other applications to unroll a loop whose lengthis unknown at the beginning of its execution.Its use allows us to unroll the loop k times, and test at the end of every k iterations whether thepointer is at the sentinel node:⇔count=0;lp[0]=lp;lp[1]=lp->next;lp[2]=lp1->next;lp[k-1]=lp[k-2] ->next;while (lp[k-1] != NULL_LIST){count+=k;lp[0] = lp[k-1]->next;lp[1] = lp[0]->next;lp[2] = lp[1]->next;lp[k-1]=lp[k-2] ->next}while (lp[0] != NULL_LIST){lp[0] = lp[0]->next;count++;}//unrolled loop//end of the unrolled loopWe would expect improved performance from the transformation.
Since the gains are not onlyfrom reducing the loop overhead but also from compacting the computation performed in theloop bodies. In an experiment where the loop is unrolled thrice and the linked lists used are ofsizes 100 and 500, the average speed-up factor is approximately 1.19.Example 4. The power algorithm for computing r ≡ an (mod m) where a, n, m, and r are integers.r=1;while (n>0){d = n % 2;if (d=1){r = r * a;r = r % m}a = a * a;a = a % m;n = (n-d) / 2;}7After unroll the loop three times, the loop predicate becomes (n>0) && (n>2) && (n>4).
Thiscondition can be simplified to (n>4). If the original loop will iterate I times, the unrolled loop williterate a maximum of (I/3)+2 times (I/3 unrolled iterations and up to 2 iterations of the originalloop). In this case because of data dependence between iterations, not much instructionreduction can be achieved.
Therefore, the performance gain can only be obtained from thereduction in the number of condition tests.Example 5. Compute the GCD of two positive integers x and y by using the so-called binaryalgorithm. This algorithm requires no division operations (which may be time-consuming), andrelies solely on the operations of subtraction, shifts, and bitwise operations. It has been provedthat binary algorithm is about 15% to 20% faster than Euclid’s algorithm (Example 2).L1:while (((x | y) & 1) == 0){x >>= 1;y >>= 1;++common_power_of_two;}while ((x & 1) == 0) x >>= 1;L2: while (y != 0)L2.1:{while ((y & 1) == 0) y >>= 1;temp = y;y = abs(x - y);x = temp;}gcd = x << common_power_of_two;First, we unroll loop L1 three times:⇔while (((x | y) & 7) == 0){x >>= 3;y >>= 3;common_power_of_two+=3;}while (((x | y) & 1) == 0){x >>= 1;y >>= 1;++common_power_of_two;}//unrolled loop//end of the unrolled loopSimilarly, we unroll both loops L2 and L2.1 two times as follow:8⇔while ( y!= 0){while ((y&3) == 0) y >>= 2; //unrolled loop L2.1if ((y&1) == 0) y >>= 1;//end of the unrolled loopif (x != y){ temp = y;y = abs(x - y);while ((y&3) ==0) y >>= 2; //unrolled loop L2.1if ((y&1) == 0) y >>= 1; //end of the unrolled loopx = y;y = abs(temp - y);}else{temp = y;y = 0;break;}}For this example program, the experimental results show that we are able to achieve a speed-upfactor of 1.15 in average.In conclusion, we have presented a method for speeding up programs by unrolling its loopconstructs.
Our preliminary investigation reveals that, for most real-world programs, the degreeof speed-up that can be achieved is modest. Nevertheless, it can be easily done, and does notrequire any special hardware to implement -- it works on any platform. Since the degree ofspeed-up is proportional to the number of times a loop is iterated, it is economically justifiable toapply the method only to programs that will be used often (such as library routines), and toloops that may be iterated a great many number of times during execution.References[1]J. C. Huang, “State Constraints and Pathwise Decomposition of Programs”, IEEE Tran. onSoftware Engineering.
Vol 16. No. 8. August 1990.[2]John L. Hennessy; David A. Patterson, “Computer Architecture A QuantitativeApproach”, 2nd Edition, 1995[3]L.J. Hendren; G.R. Gao, “Designing Programming Languages for Analyzability: A FreshLook at Pointer Data Structure”, IEEE, 1992.[4]Michael J. Wolfe, “High Performance Compilers for Parallel Computing”, 1996.9.