Cray-Cyber - Guide to Vectorization/Parallelization

General Links:
- Start Page
- Make a donation
- Result of the Logo contest
- Links Page
- Contact us
- Frequently asked questions
- Thanks

Live Information:
- Webcams
- Diary

Access:
- Access Request
- Terms of usage
- A/Detaching sessions
- Transfering files

Information area:
- Press
- History

About the systems:
- Cray Y-MP EL (yel)
- Cyber 960 (cy960)
- Cyber 860 (cy860)
- Cyber 830 (phoenix)
- Desktop Cyber (dtcyber)
- Control Data Net (cdcnet)
- SGI Origin 2000 (o2000)
- CData 4680 (majestix)
- Sun Enterprise 10000 (e10k)
- Cray T3E (t3e)
- NEC SX-4B (sx4b)
- NEC UP4800/675 (siox)
- Cray J916 (uhu)
- Cray J932 (huh)
- Cray C90
- Cray T3D
- Fujitsu VPP300 (vpp)
- Bull DPS 6000
- Login server (server)
- Successful SW-Ports
- System Logbooks

Documentation:
- Search engine
- List of all books
- Suggested reading
- CDC Models
- Papers & Talks
- On German Computers

Pictures:
- General pictures
- Deinstallation of CY830s
- CDC Cyber Boards
- Tours day on Nov 23rd 2002
- Tours day on Nov 29th 2003
- Tours day on Nov 27th 2004
- Picking up a Cray J932
- Acquisition of IPPs Cray J916
- The move
- The renovation
- Vintage computer rooms
- Operating a DEC10
- Screenshots CC545
- Air Conditioning
- Cyber 2000 from Zurich (external)
- Moving to datArena (external, short overview)
- Moving to datArena (external, long version)

Hardware Projects
- Cyber Disk Emulator
- MUNIAC Vacuum Tube Computer

Team Members:
- John G. Zabolitzky
- Alexander Mann
- Freddy Meerwaldt
- Wolfgang Stief

In memory of:
- Seymour Cray
- Control Data Corp
- Cray Research
- Cray Computer Corp
- ERA

Short form Guide to Vectorization/Parallelization

The Cray YMP-EL has four identical vector processors.
In order to make optimal use of the machine, there is a clear priority of goals:
1) Make sure your code vectorizes well.
This ensures that one individual processor may be put to good use for your code.
2) Only after #1 has been taken care of, maybe you can also parallelize your code, i.e. ensure that a good part of it can be broken into four pieces which can excute concurrently in parallel on the four processors.

Both tasks will be described below. The vectorization task is equally important for other vector processors, like the Cray-1, -XMP, -YMP, -C90, -T90, -J90, the CDC Cyber 205/ETA-10, or the modern Cray-X1. The parallelization task comes in only for parallel processors, i.e. the Cray-XMP, -YMP, -C90, -T90, -J90, -X1, but also MPP (massively parallel) systems like the Cray-T3E. There is no principal difference between the very first parallel vector machine, the Cray-XMP, and the most modern Cray-X1 in this respect.

The actual transformation of your code into vector operations and parallel tasks can be taken care of automatically by the compiling/optimizing systems. The compilers available on our Cray YMP-EL are:
cc Cray C-compiler (details see man cc)
cf77 Cray FORTRAN-77 compiler (details see man cf77)

It is the task of the programmer, however, to make sure that the compiler/optimizer can do good work, that is that the algorithms used are fit for vectorization/parallelization. That is the topic of the present text.

Vectorization essentially implies that there is an inner loop where the individual iterations are independent of each other. Typical example:
x(i) = a(i) * c + b(i)
Here, the calculation for one value of i is totally independent of the calculation for any other value of i. This independence ensures that the results do not depend upon the sequence of operations, i.e. some operations can be carried out quasi-simultaneously. The pipelining inherent in vector operations is logically equivalent to parallel operations.

Let us start with a simple example. We will use FORTRAN-77 (in reality, all we do is FORTRAN IV) throughout in the hope that it will be understandable to everyone.
do 1 i=2,64000
1 x(i) = x(i-1) * x(i)
or sequentially
x(2)=x(1)*x(2) x(3)=x(2)*x(3) x(4)=x(3)*x(4) etc
This loop is recursive. Each multiply needs to wait for the result of the previous multiply. Nothing can be done in parallel. This loop neither vectorizes nor can it be parallelized without changing the algorithm. Recursion is to be strictly avoided. This is even bad for scalar processors which tend to be somewhat pipelined since many years (CDC 6600 onward).
This property changes drastically if i-1 is replaced by i+1 within above loop 1:
x(i) = x(i+1) * x(i)
or sequentially
x(1)=x(2)*x(1) x(2)=x(3)*x(2) x(3)=x(4)*x(2)
Now, all multiplies are independent of each other, the results are not reused. The only requirement is not to change the sequence of operations. That is sufficient for pipelining: no change in sequence, but result #1 will not be available quite some time before calculation #2 is begun. Therefore, this new loop vectorizes (all operations independent) as well as parallelizes perfectly. Of course, we have an entirely different algorithm now presumably solving a totally different problem.

A slightly more complex case is
       do 2 i=1,n
       do 2 k=1,n
       s=0.0
       do 3 j=1,n
     3 s=s+a(i,j)*b(j,k)
     2 c(i,k)=s
This obviously is a matrix multiply by inner products. The addition is recursive like the multiplication in loop 1. Therefore, this code will not vectorize. One could say that all the (i,k) operations are independent of each other; therefore the code can be parallelized, i.e. split into four tasks for the four processors. Each task is doing some of the scalar products. However, each processor would be used very inefficiently. This should not be done therefore; there are more efficient ways for matrix multiply.

One generally helpful recipe for vectorization of recursive loops (which can be applied only by the programmer, not by any automatic vectorizer) is the exchange of inner and outer loops. Look at this rewrite of our matrix multiply
       do 4 i=1,n
       do 4 k=1,n
     4 c(i,k)=0.0
       do 5 j=1,n
       do 5 i=1,n
       do 5 k=1,n
     5 c(i,k)=c(i,k)+a(i,j)*b(j,k)
Loop 4 is negligible with respect to loop 5. The inner one of the loops 5 (k-loop) is vectorizable: the operations for all k, given fixed i,j are independent of each other. So our primary goal, vbectorizable loop. has been achieved.

However, the code is far from optimal yet. One reason is that FORTRAN arrays are stored by column, i.e. the first index runs fastest, i.e. addresses consecutive storage locations. It is always advisable that the innermost loop should increment addresses consecutively, since that is the most efficient way of memory access. Therefore, it is good to exchange the i-loop and the k-loop
       do 5 j=1,n
       do 5 k=1,n
       do 5 i=1,n
     5 c(i,k)=c(i,k)+a(i,j)*b(j,k)
Now, the inner loop not only vectorizes, but it does so with stride 1, the optimum memory access pattern. For given j,k, all the i elements are independent of each other, and to be found in storage with address increment 1. This is the optimal vector loop. Furthermore, n should be a multiple of 64 or large with respect to 64. This condition arises since the length of the vector registers within the early Cray machines is 64, so that any vector operation is broken down into chunks of size 64.

Next, let us consider parallelization. For fixed j, all operations in k are independent of each other. Therefore, one could dispatch n/4 of the k operations to the 4 processors, to be executed independently of each other, and not necessarily in sequence. However, because of the recursion in j, this must be done for each value of j independently. That necessitates n synchronization operations between the 4 processors, quite a big overhead.
Can we improve upon that ? Again, exchanging the loop sequence solves the problem:
       do 5 k=1,n
       do 5 j=1,n
       do 5 i=1,n
     5 c(i,k)=c(i,k)+a(i,j)*b(j,k)
The inner (i) loop is the same as before, optimally vectorizable. The j loop is recursive and can neither be vectorized nor parallelized. The k-operations are independent, however. So we can dispatch k=1...n/4, k=n/4+1...n/2, k=n/2+1...3n/4, k=3n/4+1...n to the four processors to be executed independently. We do not care about the sequence of k executions. We have only one single rendevouz operation left: all processors done. This is the optimum parallelization.

Another topic arises in vectorization: irregular memory access patterns. This may arise in several different forms; most of these can be dealt with by the gather operation:
       do 6 i=1,n
     6 v(i)=x(index(i))
and the scatter operation
       do 7 i=1,n
     7 v(index(i))=x(i)
These two operations are implemenetd on the hardware level in modern vector computers like our Cray YMP-EL. In order to work on irregularly stored data elements, the rule of good programming practice is to gather the required input data into contiguous arrays (gather), work during the computation with these contiguous arrays, and then distrbute the results to the desired locations (scatter operation). The loops 6 and 7 above will be recognized by the Cray vectorizer and will be transformed into the corresponding hardware instructions.

Unfortunately the matrix multiply example above is not useful for trying it out on the YMP-EL. This is so since the compiler in many cases recognizes the intent: matrix multiply, and instead of compiling individual instructions generates a call to a highly optimized assembler routine for matrix multiply.

While this is a very short guide, the basic principles outlined here should carry you a long way to write good Cray YMP-EL programs. There also is a vast body of literature on these topics; search for "vectorization" and/or "parallelization". The very significant importance of these topics arises since there can be factors as large as 100 in runtime between unoptimized or bad code, i.e. single processor scalar code, and optimized parallel vector code. For example matrix multiply ranges from 4.7 MFLOPS worst-case to 450 MFLOPS best case on our YMP-EL.

All content on www.cray-cyber.org and all subsites of www.cray-cyber.org are Copyright © 2002-2007 by the Cray-Cyber Team
Cray is a registered trademark of Cray Inc.
Document last modified on: Sun, 12.October.2014 16:07:00