Conununication Unfavorable Strides in Cache Memory Systems ( RNR Technical Report RNR-92-015 )

An important issue in obtaining high performance on a scientific application running on a cache-based computer system is the behavior of the cache when data are accessed at a constant stride. Others who have discussed this issue have noted an odd phenomenon in such situations: A few particular innocent-looking strides result in sharply reduced cache efficiency. In this article, this problem is analyzed, and a simple formula is presented that accurately gives the cache efficiency for various cache parameters and data strides. © 1995 John Wiley & Sons, Inc.


INTRODUCTION
Scientists accustomed to running large computationally intensive applications on Cray supercomputers have never had to concern themselves with cache issues.However.with the recent sharp riooe in the floating point performance of RISC workstations, many oocientistoo are now using these syootems for serious computations, and cache issues can no longer be avoided.Another avenue from which supercomputer :-;cicntists hm•e been introduced to cache me1norie:-; is the recent incorporation of RISC procesooors into highly parallel supercomputers.In any event, it is clear that serious program1neroo need to understand better how caches operate, so that they can implement their algorithms in ways that optimize potential performance.
An important concept in this article is memory stride.i.e .. the increment in memorv address.measured in words.between successive elements fetched or stored in the important inner loops of an application program.Ylany important oocientific applications do not feature exclusively stride one data access but inootead feature large nonunit strides.For instance.many codes perform similar operations on each dimension of a two-or threedimensional array.Performing computations in the first dimension of a Fortran program (or the laoot dimension of a C program) can be done with unit stride, but the strides of the computations in the other dimensions are typically large values.and significantly degraded performance may result when the codes are ported to cache-based systems without change.
One solution to this problem is to rewrite the code to employ array transpositions between the computational steps in each dimension.In this way all computation can be done at unit stride.But such revision may require substantial effort.and it may still not result in significant performance i1nprovement unless the time spent in stride one computation is substantial enough to offset the cost of the array transpositions.
As a result, many problems of this smt are ooimply ignored, as scientists accept with a certain fa-talism their codes will not perform very well.However, for smne prof!rarns the reduction in performance i;.; ;.;ufficient!y largP that it i;.; worthwhile to make an effort to understand and alleviate this problem.

DEFINITIONS AND NOTATION
To better under;.;tand the phenomenon of performance reduction with stride;.;.consider the following model of a cache memory sy;.;tem.First a,;sume that the cache is configured with R = 2' cache lines, and as;.;ume that each cache line contains W = 2"' words., so that a total of R Tl words can be cached.
It will be assumed that this cache memon s\stem operates as follows.\\-hen a word at a virtual address A is fetched.it is placed in cachP location Q, where Q is determined by zeroing the bits in the address to the left of the rightmost r + u• bits and then shifting the resultinf!integer to the right by LL' bits (i.e .. dividing by W). :\'ote that this operation produces an integer Q in the ranf!:e 0 :S Q < R.
When a single word is requested" all Tr words of the Tr-long cache line that it resides in are also fetched.
.\lany cache-based systPm,.; employ .. associativity sets."This means that up to C cache lines with the same cache address.a;.; determirlf'd bv the mapping function described in the previous paragraph, can be stored simultaneously in the cache.In this way, potentially RC lim's or RCII words may be cached.•"Then a reque;.;ri,.; made for data that are not in cache, its cache line replaces one of Clines currently stored at the cache location where it is a'isigned by the mapping function.On some systems the least recently u:-;ed line i'i replaced, while on others the line replaced is determined by some un:-;pecified '•random•' procedure.The above model of an associativP cache is satisfied by many, but not all current RlSC systems.
If the stride Sofa vector fetch is unitv.then II consecutive words re;.;idP on the same cache line.This is obviously a very favorable situation.The situation is similarly quite favorable if the memory stride is some integer less than W. since in that case many cache lines contain multiple words required by the CPC.Many scientific applications.however, involve strides larger than T•V.:oo that each cache line retrieved from memorv contains at most one word required by the CPL.This last case will be the focus of this article.
Lnfortunatelv" at some strides even RC words cannot be cached because ;.;ome of the associativitv sets are overutilized.while others are underutilized.Let us con;.;ider a vector fetch of L words with strideS and ask what fraction of the L resulting cache lines remain in the cache when the fetch is complete.This question is of interest for two reasons: 1) a computation may need to acces;.; thi;.; same set of L words again.and 2) if this vector fetch was a single row of a matrix stort>d in column major order (as in Fortran).the next W rows of the matrix reside in these same cache lines.Either way, performance will be sif!nificantly improved if these data can remain in the cacht>.
Accordingly, the efficiency E of a vector fetch of length L will be defined as T/ L. where T is the number of cache lines that still remain in the cache when the vector fetch operation is complete.and where Lis the vector length.For simplicity. in the following it will be assumed that L = RC.
An obvious example of an inefiicit>nt stri(Ie i, a large power of two.Then all cache lines will be fetched into the same location of the cache.and the other R-1locations will be completely unutilized.In other word,;.at most Cline,; of these data can be stored in the cache.The re:-;ulting efficiency is only 1 I R. Clearly if an applicaiton program has arrays whose dimension,; are large powers of two.these arrays should be .. padded.••such as by declaring their leading dinwnsion,; (in Fortran) to be slightly larger than a power of two.In thi;.; wav . .acces;.;es of successive row" of data from such an arrav will have cache addres,;es that are slif!htly offset.resulting in much more efficient cache utilization.1\fost usPrs of Crav svstems are familiar with this tuning technique.since it eliminates bank conflicts that may reduce performance by factors as high as 10 or 20 [ 11.

CACHE EFFICIENCY WITH NON-POWER-OF-TWO STRIDES
It rnay con1e a;.; a surprise to some that large power-of-two strides are not the only particularly unfavorable ;.;trides for cache memory systems [4:.To facilitate concrete discussion in the following.we will consider the particular case R = :)2.C = 4. and TF = 16.The;.;e values match the cache parameters of the lBJ\1 RS 6000/:320 system.\\ e will also assume in the following that the vector lenf!th L of the fetch is 128.\Vhen S = 72.it turns out that in 128 consecutive fetches.the respective cache lines neatly fill the :32 X -± array.resulting in perfect utilization of the cache (except that only one word in each cache line may actually be required by the CPC:.The resulting efficiency E is unity.ewn though -:'2 is divisible by 8. a highly unfavorable situation on many vector con1puters.1'\ow considerS = -:'3.a completely favorable stride for mo,;t vector computers.In this case the cache efficiency is only about 0.414.The efficiencies for stride:-, 16-2:16 are shown in Figure 1.This i,.; obviously a wry complicated function. This curious phenonwnon has been noted hy others [2. 3. 4. 6:.One way to understand it is to list the cache addresses of consecutively fetched cache lines in a 128-long vector fetch.with stride ?3, horizontally in a sen~n-wide table (see Table 1 ).This table also includes the notation R to imlicate instances when a cache replacemf'nt would occur.lt is clear from examining thi,.; table that the root cause of this poor performance is the very nearly periodic behavior of these cache addresse,.;.In particular, the:oe addre:o:oes are nearly periodic with a period of seven.
Recall that virtual addres:o bits higher than po-:oition r + w are ignored when placing the cache line in the cache.Thus we may in general write the cache address Q of the k-th word fetched a::; where int denotes the greatest integer function.and where mod denotes the modulo operation (i.e .. the remainder when tbe first argument is divided by the second).The function Q(k) is precisely periodic with period RW.But when the strideS' is exactly (or very nearly) a simple fraction of RW, then this function is also precisely (or yery nearly precisely) periodic with period nint(RH/S).where nint denotes the nearest int.,ger function.From these facts one can compute the approximate cache efficiency E for this example (recall that the cache efficiency was defined above as the fraction of cache lines that remain in the cache when the vector fetch is complete).In Table 1. the first 4 X 7 = 28 fetches completely fill cache addresses 4, 9, 13, 18, 22, 27, and 31.except that address nine has one line empty.Thereafter approximately 3/4 of the fetches result in a replacement.Thus we have the approximation which in this case exactly matches the actual efficiency determined by counting replacement,.; in Table 1.
As we have seen.the replacement frequency G = 3/4 used in the above calculation results from the fact that 7 X 73 = .511differs from 512 by only one.ln general.define the minimum difference D as follows: When D is zero (i.e .. when S is a large power of two, such as 64).then the corresponding value of G may easily be seen to be unity.\\'hen D = 1.
then G = 3/4: when D = 2 . .then G = 1/2: wlwn D = 3, then G = 1/4: and when D 2: 4. thPn G = 0.In other words . .when D is larger than the ,.,et associativity size C, then successive fetches move to a different cache address before a given a:-;soci-atrvrty set is exhausted.In general, the replacement frequency G is given by the formula Suppose that S/(RW) is very close to a simple fraction a/ b, b :S R, so that D = lbS -aRWI is small.Compute G from the above formula.Generalizing from the above example.note that the first bC fetches will completely fill the b associativity sets whose addresses are those that nearly repeat.Thereafter. the fraction G (approximately) of the fetches will result in replacements.Thus a general formula that is an approximation to the cache efficiency E for general strides and cache parameters is given by ----'-;-L---'-l\'ote that when Lis large, E = 1 -Gas expected.
A graph of the efficiencies for various strides in the standard case used above.computed with the above formula, is shown in FigurP 2. By comparing Figures 1 and 2. it is clear that this formula is very accurate, particularly at the ••spike:-;.••which are the cases of greatest interest.In fact.thP replacement count G(L-bC).which is the key subexpression of this formula.is (with one exception) alwavs within one of the actual value whenever G is nonzero.

A RANDOM STRIDE APPROXIMATION
\Vhen the differencf' D i,., greater than C. the formula above 1£ive" perfect efficiency.since Gin that case is zero.However. the actual efficiencv i,; somewhat less than unitv for manv such case:-;. . .resulting in a low-level background .. noise .. !compare Figs. 1 and 2).This phenomenon can lw explained by noting that when tlw stride S is a substantial fraction of RW. the operation mod(kS.RTV) is a good pseudorandom number ?-"enerator.and a certain number of .. collisions'' can be expected to occur in the resulting cache addresses.In fact.thi,., operation is a member of the widely studied clas:-; of linear conwuential pseudorandom number generators l:'i.p. 91.
If one assumes that the assigrunent of memory fetches to the R addres,;e,; i,; actuallv random.then one can compute the expected cache efficiency by applying techniques of probability and statistics.The probability P(k) that an individual address contains exactly k entries after an L-long fetch is given by the formula for a binomial distribution: where p = 1 I R. The expected number of replacements F is then

FINDING SIMPLE FRACTIONS
One detail was omitted from the above discu,.;sion:how can one compute the minimum difference D for a given stride.or in other words.how doe,; one determine the best simple fraction approximation a/ b to S/(RW)?The straightforward scheme of computing lbS -aRWI for all pairs of integers a and b less than R, in order to find the minimum value of this expression, is time-consuming when R is even moderate in size.
A more direct and elegant means to find these rational approximation a/ b is to employ the Euclidean algorithm [5, p. 319] as follows.Start with the 2-long vector V = (5, RW) and the 2 X 2 identity matrix.At a given step let x be the smaller entry of V, lety be the larger entry . .and let X andY be the columns of the 2 X 2 matrix corresponding to x andy.Compute q = int(y/ x).Then replace y by y -qx and X by X + q Y.This process continues until one entry of the vector Vis zero.At that point one column of the final matrix will contain the original vector (with any common factor divided out) and the other column will contain a close rational approximation.In this application, the Euclidean algorithm may be halted whenever an entrv of the matrix exceeds R. "The operation of this algorithm in this application is more easily understood by an example.Let us consider the particular parameters as above, with the strideS= 197.ln other words.we wish to find a good simple fraction approximation a/ b to 197 I 512.The algorithm proceeds as shown below.The value of q used in each step (computed from the previous step• s vector) is shown at the right.
= 39 13 q In this case the desired pair of integers (a, b) is in the next-to-last column generated in the matrix, i.e. (5,13).l\'ote that 5/1:3 = 0.384615 does not exist, and the stride may be considered a favorable stride.In this particular example.where S = 19?, the resulting values a = 5 and b = 13 yield D = L so that C = 0.?5 and E = 0.554687.').

IMPROVING CACHE PERFORMANCE OF DATA ACCESS WITH STRIDES
We have demonstrated a fairly simple scheme that can accurately predict the phenomenon of unusual slow-downs for particular strides.It should be emphasized.however, that the above analysis and conclusions depend on the particular model assumed above for an associative cache.This model is satisfied by many.but not alL of the currently popular RISC systems.\Vhat can a programmer do if his or her program features a particularly unfavorable stride?The most straightforward solution is to •'pad" (slightly increase) the dimensions of arrays having such dimensions.This solution has the advantage that in most cases only dimension statements need to be changed, and the executable part of the program does not need to be altered.Some space is "wasted" in this manner, but the resulting performance improvement is almost certainly worth the additional memory required.
There does not appear to be a simple formula giving the optimal amount of padding for a given unfavorable stride (i.e., array dimension) S, but in practice it suffices to merelv evaluate the efficiency function described above for S + 1, S + 2, etc. until an efficient stride is found.In examples the author has studied, it appears that a pad of only one or two is effective in most cases.
However, this type of tuning should not be necessary, nor should it be necessary for programmers to analvze whether their strides are unfavorable.By applying techniques such as those described in this article, compilers should be able to detect unfavorable strides and automaticallv adjust the appropriate array dimensions.Such adjustments will need to be optional, since they technically depart from the Fortran ?? standard, but they will likely be welcomed by the majority of users who prefer the compiler to shield them from such unsavory features of the underlying architecture.

FIGURE 2
FIGURE 2 Cache efficiencies using the formula.
k~C+l and the resulting expected efficiency E = (L -F;l L. For the example parameters above.this formula vields E = 0.807714 • • • .The actual average efficiency, determined from the data in Figure 1.. is 0.892:334 • • • .This indicate;-; that the operation mod(kS.RW) actually behaves somewhat better than a true random number generator.
Here the final column generated.(19?. 512). is identical to the original vector.If Sis divisible by a power of two, then the final column generated will be the original vector with the common power of two divided out.In that case.and if both entries of the final column are less than or equal to R, then this final column should be selected for (a.b) instead of the previously generated column.If for a given stride S, no pair (a.b), b ::s: R is found that satisfies I bS -oR WI < C. then the periodic effect