A Cache System Design for CMPs with Built-In Coherence Verification

Thiswork reports an effective design of cache system for ChipMultiprocessors (CMPs). It introduces built-in logic for verification of cache coherence in CMPs realizing directory based protocol. It is developed around the cellular automata (CA) machine, invented by John vonNeumann in the 1950s. A special class of CA referred to as single length cycle 2-attractor cellular automata (TACA) has been planted to detect the inconsistencies in cache line states of processors’ private caches. The TACAmodule captures coherence status of the CMPs’ cache system andmemorizes any inconsistent recording of the cache line states during the processors’ reference to a memory block. Theory has been developed to empower a TACA to analyse the cache state updates and then to settle to an attractor state indicating quick decision on a faulty recording of cache line status. The introduction of segmentation of the CMPs’ processor pool ensures a better efficiency, in determining the inconsistencies, by reducing the number of computation steps in the verification logic. The hardware requirement for the verification logic points to the fact that the overhead of proposed coherence verification module is much lesser than that of the conventional verification units and is insignificant with respect to the cost involved in CMPs’ cache system.


Introduction
The continual search for performance enhancement in computation has resulted in a variety of modifications in the processor design technique. This ultimately leads to the inevitable transition toward multicore architecture, the Chip Multiprocessors (CMPs), with thousands of processor cores on chip. The increasing number of cores in CMPs, however, puts threats on the reliability and dependability of a design [1]. A number of works [2][3][4][5] addressed these issues from different perspectives. Further, the low supply voltage in today's semiconductor technology narrows down the noise margin and increases the susceptibility to various factors causing transient faults [6] in CMPs.
Most of the fault tolerant schemes reported so far for CMPs are based on the spatial redundancy techniques which may not be effective for faults in on-chip hardware components [2]. Although the cache and memory components are protected by Error Correcting Codes and other techniques, the logic circuits commonly serving the multiple cores are error prone [2].
A CMPs memory subsystem is made up of multilayer caches including private cache for each processor core. It demands very efficient realization of cache coherence protocols. The cache coherence controller (CC) is dedicated to ensuring coherency of shared data in the CMPs' cache system. Such a prime hardware component can also be subjected to fault as well as design defect. A fault in the CC has serious effect on the correctness of computation as well as on maintaining power efficiency of a system. The schemes proposed in the literature [2,4,7,8], for ensuring coherency in CMPs with thousands of cores, incur huge communication overhead along the global wires. In [2], a verification logic has been proposed to detect errors in the coherence controller. It targets a system realizing snoopy protocol.
The snoopy protocols are easy to implement but are not so scalable [9]. For large scale CMPs, updating and invalidating caches, following snoop based protocols, become impractical 2 VLSI Design CPU core Router   Directory   Tile 1   Tile 5   Tile 9   Tile 13   Tile 2   Tile 6   Tile 10   Tile 14   Tile 3   Tile 7   Tile 11   Tile 4   Tile 8   Tile 12   Tile 15  Tile 16 L 1 I cache L 1 D cache L 2 cache Figure 1: The tiled CMPs organization. [10]. Several variants of directory based coherence protocols have been proposed in [5,10,11]. However, the verification of cache data inconsistencies resulting from defects in such systems is yet to be addressed. The above concerns motivate us to design an effective cache system for CMPs by developing a scheme to determine the accuracy in maintaining data consistency in the CMPs' cache system realizing directory based protocol. It targets design of built-in logic for verification of cache coherence that can function at speed and can be cost effective. To explore such a design, we consider cellular automata (CA) tool [12] invented by John von Neumann in the 1950s. As CA can handle large volume of data and efficiently be employed to make a decision, a CA based built-in verification logic is proposed to ensure accuracy in functioning of the directory based cache system in tiled CMPs [11]. A special class of CA structure referred to as the single length cycle 2-attractor cellular automata (TACA) is introduced for the design. The TACA analyses the status of CMPs' cache updates and settles to an attractor state (point state) indicating any faulty recording of cache line status and the sharing vector [9] stored in the directory of cache system. The hardware realization of the CA based design enables quick decision on the cache coherency. Further, the introduction of segmentation of the CMPs' processor pool assures better efficiency of the design. It reduces the computation steps while making decision on inconsistency, if any, in each cache update. The basic concept of the CA based solution is reported in [13,14]. The precise contribution of this paper can be summarized as follows: (i) A built-in verification logic for CMPs cache system, realizing directory based protocol, is proposed. For ease of understanding, the design is detailed out with the basic 3-state MSI protocol. However, the methodology proposed can be applicable for MESI, MOSI, MOESI, and others.
(ii) The verification logic is developed around an unconventional tool, called cellular automata (CA). The modular structure of the CA is exploited to enable scalable design.
(iii) Design of a high-speed verification unit for cache, harnessing the feature of CA to memorize information, is reported.
(iv) The verification unit is realized for full map as well as for the limited directory based cache systems.
(v) Relevant CA theory has been developed to provide the theoretical basis of the cache system design.
(vi) Experimental result establishing the claim has been reported.
The following section (Section 2) highlights the coherence issues in CMPs' cache system. Section 3 narrates CA theory relevant for the current design. The CA based verification unit is introduced in Section 4, and Section 5 describes the design in detail. The hardware realization and delay overhead reduction through introduction of segmentation are reported in Sections 6 and 7, respectively. A test structure that memorizes the inconsistent recording of cache line states during the processors' references to a memory block is reported in Section 8. The simulation results establishing the effectiveness of the CA based design and its hardware requirement are reported in Section 9. A sketch of the verification unit for limited directory based system is shown in Section 10. In Section 11, we provide a brief on the related works. Section 12 concludes the paper.

Cache Coherence in CMPs
In Chip Multiprocessors (CMPs) with a large number of onchip cores, the shared bus cannot be a good choice due to area overhead and bus contention [11]. The alternative to shared bus is a tiled CMPs architecture. A tile is composed of an array of identical building blocks connecting the cores with point-to-point unordered network [11,15]. In this work, we consider the ATAC processor architecture [15] which uses a tiled multicore architecture ( Figure 1). It is a low-latency, energy-efficient, global, and long distance communication network. Each core in ATAC consists of private L 1 and shared L 2 caches. The L 2 typically follows Nonuniform Cache Access (NUCA). An L 1 cache miss in ATAC generates coherence messages. The other L 1 cache line states of the system are updated in accordance with the coherence messages.
In general, for large scale CMPs, the directory based cache coherence system is desirable [11]. A directory, in directory based system, is a collection of sharing vectors [9]. Each vector corresponds to a data block B and maintains pointers VLSI Design to the processors that have the cached copies of B. The directory also stores the state of cached copies in L 1 caches. The organization of a sharing vector is shown in Figure 2. The "d" specifies the dirty bit and d = 1 (true) implies that there is a cache with the latest copy of B and the main memory copy is an invalid copy. The p 's are the presence bits. p = 1 implies that the processor P is having a cached copy of B.
In the current design of coherence verification logic/unit, we consider a distributed directory based system. Figure 3 is to describe the logical steps followed to maintain the cache coherence in CMPs, realizing the distributed directory based protocol, on a read miss. It consists of processors P , P , and P with local memory modules L 2 , L 2 , and L 2 , respectively. The distributed directories D , D , and D are also local to the corresponding processors P , P , and P . In such a system, if a processor (P ) requests block B, which is not present in P 's L 1 (C ), P encounters a read miss and consults its communication assistance unit to find the home (say processor P , i.e., L 2 ) of block B. The request then goes to P 's site and the directory D is consulted. If the sharing vector of B, stored in D , shows that d = 0, that is, the L 2 has the valid copy of block B, then P sends block B to P and updates the sharing vector corresponding to B at D by setting p = 1. On the other hand, if d = 1, that is, the copy of block B at L 2 is not a valid copy and P is having the dirty copy (as shown in Figure 3), then P sends block B to P and also writes it back to L 2 (home site for block B). P then updates the sharing vector of block B at D . Here we have assumed a 4-hop communication (request to directory → reply with owner → request to owner → reply to requester) to resolve read miss. However, for a system realizing 3-hop communication (request to directory → forward to owner → reply to requester), as in [16], the proposed verification logic is also compatible.

CA Preliminaries
A cellular automaton (CA) consists of a number of cells organized in the form of lattice. It can be viewed as an autonomous finite state machine (FSM). In a two-state 3neighborhood CA, each CA cell stores either 0 or 1 that refers to the present state (PS) at time of the cell and the next state (NS) of the cell at ( + 1) is where −1 and +1 are the present states of the left neighbor and right neighbor of th cell at time and is the next state function (Figure 4). The state of all the cells = ( 1 , 2 , . . . , ) at is the present state of the CA. Therefore,  )  NS  1  1  1  1  1  1  1  1  254  NS  1  1  1  1  1  1  1  0  255 the next state of an -cell CA is 3 ), . . . , ( −1 , , +1 )). The next state function of the th CA cell can be expressed in the form of a truth table ( Table 1). The decimal equivalent of the 8 outputs (NS) is called "rule" of the cell [12]. There are 256 rules in 2-state 3-neighborhood CA. Two such rules 254 and 255 are illustrated in Table 1. The first row lists the possible 2 3 (8) combinations of present states −1 , , and +1 . The last two rows indicate the next states of the th cell at time ( + 1), defining the rules 254 ( = −1 + + +1 ) and 255 ( = 1), respectively.
The rule vector = ⟨ 1 , 2 , . . . , , . . . , ⟩ configures the cells of a CA. If all the 's are the same, that is, 1 = 2 = ⋅ ⋅ ⋅ = , the CA is a uniform CA; otherwise, it is a nonuniform/hybrid CA [17]. In Figure 4, the left (right) neighbor of the leftmost (rightmost) terminal cell is permanently fixed to 0-state. It is a null boundary CA. Table 1 is the Min Term of a 3-variable −1 , , and +1 switching function and is referred to as the Rule Min Term (RMT).

Definition 1 (RMT). A combination of present states shown in the 1st row of
Column 011 of Table 1 is the 3rd RMT. The next states corresponding to this RMT are 1 for both rules 254 and 255.
Definition 2 (reversible and irreversible CA). A CA is reversible if its states form only cycles in the state transition diagram (all states are reachable); otherwise, it is irreversible ( Figure 5). Definition 3 (attractor and attractor basin). A set of states of a CA forms loop (cycle) and is called attractor. An attractor ( ) forms an -basin with the states that lead to the attractor.
Definition 4 (depth). The depth of a CA is defined as the length of the longest path from a state to an attractor in the state transition diagram. The depth of the CA shown in Figure 5 is 5 (2→10→ 6→4→5→7).
For example, the RMT 0 (000) is 0 and RMT 2 (010) is 1 in 254 (Table 1); that is, these two RMTs are passive. However, RMT 0 in 255 is active as it is 1 ( (2) Owner of B (P k ) Figure 3: Directory based protocol: read miss to a block.

Overview of Verification Logic
A defect in the computing logic of the cache coherence controller (CC) can lead to faulty next state computation in CMPs. Even if the computing logic operates correctly, faulty recording of state(s) may also result due to fault(s) in the communication network of the CMPs cache system. These faulty recordings of state(s) introduce inconsistencies in the cache data states. The identification of cache data inconsistencies in such a system, while maintaining the cache coherence, is addressed in [2]. The solution reported detects errors in a CC. It targets a snoopy protocol based cache system and involves complex data structures as well as computation intensive steps. In [18], we also report design of verification unit for the CC working in a snoopy protocol based system. The cache coherence protocol (ACKwise) in ATAC processor architecture is the coupling of directory and snoopy protocol named ACKwise [15]. In addition to the probable defects in CC and faults in the communication network, the faulty update of sharing vector is a major concern in ATAC. The sharing vector may also be subjected to faults, even if the sharing status of a cache block is recorded correctly thus making the coherence verification process hard to realize. Incorporating separate verification units for sharing status and sharing vector would be extremely costly. Therefore, we formulated the problem of coherence verification in ATAC like architectures by modelling it as the verification of the compatibility of sharing status and sharing vector on each cache state update. To prove the effectiveness of the proposed cellular automata (CA) based verification logic, we consider the ATAC (tiled CMPs) architecture realizing the directory based cache coherence system with MSI protocol. However, this scheme is also applicable for MESI/MOSI/MOESI protocol based designs. Figure 6 describes the basic 3-state MSI protocol. Table 2 displays the effect of faulty noting in sharing vector, resulting in faulty ("F") state (column 5) with full-map directory. The entry "All p's are 0s," in column 1 of the first row, represents that none of the processors'/tiles' L 1 caches has a copy of block B. On the other hand, the entry "All p's are 1s," in column 1 of the second row, represents that all the processors' caches have a copy of B. Similarly, the entry "p is 1 and all others are 0s" indicates only processor P has a copy of block B and there is no other cached copy of B.
On a read/write operation (event) of a processor P (Figure 3), the corresponding sharing vector for block B is noted in column 3 of Table 2. The contents of column 4 represent the possible faulty noting at the sharing vector. For example, row 1 of column 1 notes the sharing vector prior to "P read". Initially, B does not have a cached copy (represented by "All p's are 0s"). On an event of read by P (noted in column 2 of row 1), the desired sharing vector is "p is 1 and all others are 0s" (column 3 of row 1). All the possible faulty recordings of the sharing vector are shown in column 4. Column 5 notes the effect of fault ("faulty (F)"). Classification of faulty cases is noted in column 6. The consideration of columns 1, 4, and 5 of Table 2 indicates that the proposed fault detection unit should respond as "F" for the states of cached copies of B (cache lines for B) when the following cases occur: Case 1. Processor P reads/writes block B, but P 's presence bit (p ) in sharing vector for B is not updated to "1"; that is, it is still "0". Case 2. P reads/writes and some other processors' (P 's) presence bit(s) (p 's) is (are) set to "1," instead of P 's presence bit (p ).
Case 3. P writes and some other processors' (P 's) presence bits are not updated; that is, p 's remain "1".
It is to be noted that Cases 2 and 3 include all cases of all possible faults that affect more than one bit. The proposed CA based logic (shown in Figure 7), to realize the verification in cache coherence system, therefore, is designed so that it can correctly respond with either "NF" (nonfaulty) or "F" (faulty) following the above three cases. It employs an n-cell CA for CMPs with private (L 1 ) caches (C 1 , C 2 , . . . , C , where C is the cache attached to processor P and th CA cell corresponds to the cache C ). Now, with each read/write operation, the cache line state (sharing status 0 1 assuming MSI protocol) as well as the presence bits (p's) in the sharing vector are updated. In a fault-free system, the update of sharing status and sharing vector should be compatible. Therefore, the sharing status, to be more specific, the MSB of sharing status (in the current design), and the presence bits of the sharing vector are fed as input to the verification unit to form -bit compatibility status (CS). The -cell CA of the verification unit is then run for certain number ( ) of time steps with CS as the seed. The CA settles either in an attractor designated as P i write-back P i read/write hit P i read miss: Processor P i write miss: P j w r i t e m i s s P j read miss: P j w r i t e m i s s : signal to P j 's (sharers) P j 's refer to other processors signal P j 's to invalidate B signal to P j 's for invalidation 2n-bit sharing status

Input. Presence bits and MSBs of sharing status of cache lines.
Output. Decision on presence of fault.
(1) Form the -bit compatibility status (CS) where CS = 0 ⊕ p , ∀ = 1, to . (2) Run the CA selected for verification logic for t time steps ( = ( − 1) for the current design) with CS as the initial state (seed). It reaches the attractor state.
(3) Check LS cell of the CA (LSB of attractor). If it is "0", then record of sharing status and sharing vector is fault-free; otherwise, it is faulty.

The Design of CAVU FULL-DIR
The cellular automata (CA) based solution for the verification of cache inconsistencies, introduced in the earlier section, demands the following: (i) The CA constructed for the verification unit CAVU FULL-DIR should form single length cycle attractors.
(ii) The number of attractors should be minimum, preferably two attractors: one ( 1 ) corresponds to "NF" (nonfaulty) and the other ( 2 ) corresponds to "F" (faulty) and the CA should correctly report all the faulty cases of Table 2.
VLSI Design 7 (iii) The attractors 1 and 2 should differ at least at one position (say, LSB) so that the decision on "NF" or "F" (sensing a single bit of the attractor, that is, least significant cell of the CA) can be taken at speed.
The effect of fault propagates through the CA as it strides over time and induces LSB of the attractor. Therefore, the incidence of fault is translated as a switch from 1 attractor basin to 2 -basin (say from 0-basin of Figure 8 to its 1-basin). The following subsections characterize the single length cycle attractor CA that can be employed for the current design.

Single Length Cycle
Attractor. The next state of a single length cycle attractor is the attractor itself. In a single length cycle attractor CA, for at least one RMT (Section 3) of each cell rule of (CA), the cell is passive (Definition 5). It implies that the state change in cell is → . This is summarized in the following property.
In [19], based on Property 1, the 256 CA rules are classified in 9 groups (groups 0-8). Rule 254 (11111110) is in group 5 as it follows Property 1 for 5 RMTs (RMTs 0, 2, 3, 6, and 7 are passive (Table 1)). A CA configured with the rules that maintain Property 1 for its RMTs is a probable CA with the single length cycle attractors [19]. The construction of single length cycle 2-attractor CA ( Figure 8) can be followed from the theory of Reachability Tree for attractors introduced in [20].

Reachability Tree for Attractors.
Reachability Tree for attractors (RTA) is a binary tree representing single length cycle attractors of a CA [20]. Each node is constructed with RMT(s) of a rule that follows Property 1. The left edge is the 0-edge and the right edge is 1-edge ( Figure 9). For an -cell CA, the number of levels in the tree is ( + 1). Root node is at level 0 and the leaf/terminal nodes are at level . The nodes at level are constructed from the RMTs of ( + 1)th CA cell rule +1 . The decimal numbers within a node at level represent the RMTs of the CA cell rule +1 based on which the cell ( + 1) can change its state. The RMTs of a rule for which we follow 0-edge or 1-edge are noted in the bracket. The number of leaf nodes in an RTA denotes the number of single length cycle attractors of the CA and a sequence of edges from the root to a leaf node, representing an n-bit binary string, is the attractor state. The 0-edge and 1-edge represent 0 and 1, respectively. For example, the number of single length cycle attractor states in the CA ⟨254, 254, 254, 254⟩ of Figure 9 is 2 ( 1 and 2 ). The root node (level 0) of the RTA is constructed from passive RMTs 0, 2, and 3 as cell 1 (rule 11111110) can change its state following any one of the passive RMTs (null boundary). As the state of left neighbor of cell 1 is always 0, the passive RMTs 6 and 7 of rule 254 are the do not care RMTs for cell 1. Similarly, as the right neighbor of cell 4 is always 0, the passive RMTs 3, 5, and 7 are do not care for cell 4. The RMTs of two consecutive CA cell rules and +1 are related while the CA changes its state [20]. This relationship between the RMTs of and +1 is shown in Table 3. It implies that if the th CA cell changes its state following RMT , then ( +1)th cell changes its state following RMT +1 . For example, if the 1st cell in Figure 9 changes its state following RMT 0, then the 2nd cell changes its state following RMTs 0 and 1 (from Table 3).
The RTA of Figure 9 can be generalized to depict the attractors for an n-cell CA. The generalized RTA of the CA ⟨254, 254, . . . , 254⟩ is shown in Figure 10. The 0-edge at B of Figure 9 evolves to node D with the same set of RMTs ({0}); that is, nodes B and D are equivalent and, therefore, transition B to D is replaced by the transition B to B (Figure 10). Similarly, the transition C → F in Figure 9 is replaced by the transition C → C in Figure 10. Such transitions between equivalent states are true for level 1 to level ( − 2). For the last cell of a CA (⟨254, 254, . . . , 254⟩), some of the RMTs of B and C (e.g., RMT 1 of B and RMT 7 of C ) are do not care RMTs. Therefore, level ( − 1) is shown separately (G and K in Figure 10). RTA of Figure 10 depicts that the n-cell uniform CA with rule 254 forms 2single length cycle attractors (all 0s and all 1s). The theory reported in this subsection guides us to select appropriate TACA rules required for the current design.

CA Rule
Property 2. In 3-neighborhood null boundary, the -cell uniform TACA should have either an all 0s or an all 1s attractor or both. The attractors (say, 1 and 2 ) differ in consecutive 2 0 or 2 1 or 2 ⌊log ⌋ terminal bits [21].
The above properties and theorems enable identification of 15 rules as the TACA rules (Table 4). Further, from Property 3 and Corollary 8, it can be concluded that there are only 5 rules 218, 234, 248, 250, and 254 that have only all 0s and all 1s single length cycle attractors. However, from the NSRT diagrams introduced in [21], it can be shown that only rule 254 forms a uniform TACA for all lengths. The other four rules form multilength cycle attractors along with the two (all 0s and all 1s) single length cycle attractors.
From the state transition diagram of the uniform CA configured with rule 254 (Figure 8), it can be observed that

Sharing status
Sharing vector (n + 1-bit) Q i0 Q i1 p i · · · · · · · · · · · · · · · Vector (2 n-bit) FF FF FF Figure 11: Hardware realization of CAVU FULL-DIR . the appearance of "1"(s) in initial state can act as a switch from 0-basin to (2 − 1)-basin (referred to as 1-basin in the figure). The 4-cell CA ⟨254, 254, 254, 254⟩, when initialized with all 0s seed, follows 0-basin (LSB "0" attractor); on the other hand, the CA when initialized with nonzero seed follows the 15-basin (1-basin), that is, with LSB 1 attractor. This behavior of rule 254 matches with the design requirement for CAVU FULL-DIR . The design is detailed out in the following subsection.

The CAVU FULL-DIR .
Let the attractors of CA formed for all the cases of column 3 ("NF" cases) of Table 2 be 1 (all 0s) and for other cases ("F" cases) the attractors are 2 (all 1s). The best selection of CA rules is, therefore, such that Cond 1: 1 = { 1 } belongs to "NF" and 2 = { 2 } belongs to "F" are different.
For the current design, we consider a uniform CA with rule 254. In the "NF" case, the CA representing the cache system traces through the 1 -attractor basin (0-basin). Now, if there is a fault in the cache coherence system, the CA traces through the 2 -basin (1-basin). Hence, the LSB of attractor "0" signifies "nonfaulty" recording and "1" signifies a "faulty" recording. While representing the states "M," "S," and "I" of a cache line in the sharing status, we consider 11 for "M," 10 for "S," and 00 for "I". If a cache line for B is in "M" or "S" state in the C ( th processor's cache), the th bit of the sharing vector (p ) corresponding to block B is 1. That is, in a nonfaulty case, the MSB of sharing status at C and the th bit of the sharing vector (p ) both are equal (either 1 or 0). Theorem 10. The MSB of sharing status (denoted as "11" for Modified, "10" for Shared, and "00" for Invalid) suffices to be considered for checking compatibility of sharing status with sharing vector.
Proof. The cache line states when represented in the sharing status vector, the states "M," "S," and "I" are encoded as 11, 10, and 00, respectively. Whenever a block (say, block B) is in "M" state in one cache, the block's state in other caches should ideally be "I" for maintaining coherence. The cache holding the block in "M" state becomes a dirty sharer. Now, whenever, one or more caches have the block in "S" state; on read(s) the cache(s) (processor(s)) become clean sharer(s). In both cases of clean and dirty sharing, the presence bit(s) in the sharing vector should be set to "1". This "1"(s) in the presence bits conforms to the "1"(s) in the MSB of state code for "M" (11) or "S" (10). Hence, checking compatibility of the sharing status and the sharing vector can be performed by checking the MSB of sharing status and the presence bits.
To ensure the correctness in recording of sharing status and the sharing vector at an update of data block B, the CAVU FULL-DIR accepts the compatibility status which is formed by performing XOR of the MSB ( 0 ) of 2-bit sharing status ( 0 1 ) of B at C and the th presence bit (p ) of sharing vector for B as the initial state of th cell of the CA selected for the test design. The CA is then run for = ( − 1) time steps and the state of its least significant cell (LSB of attractor) defines presence of fault(s) either in sharing status or in sharing vector.
The scheme reported above is described for MSI protocol. However, the scheme also applies for MOSI/MESI/MOESI protocols. For example, if the states "M," "O," "E," "S," and "I" of MOESI are represented by 111, 101, 110, 100, and 000 (001, 010, and 011, do not care), respectively, then the same logic as applied in case of MSI can be effective for verification for MOESI. The MSBs of the state code for "M," "O," "E," and "S" are chosen as 1 (Theorem 10). Accordingly, the corresponding presence bits should be 1 and the compatibility of these two can be checked by applying XOR logic as in MSI (Section 4). Henceforth, for any state code of arbitrary length, if the state(s) representing a sharing (dirty or clean) is represented by a binary code having a "1" in MSB, the proposed logic applies.
The design of CAVU FULL-DIR reported in this section requires an -cell CA for CMPs with -processor cores and the CA needs to run for t (=depth of CA) time steps to decide on the inconsistency in the recording of cache line states. The computation steps = ( − 1), that is, delay, however, can be reduced through introduction of segmentation of the CMPs processor pool.

Delay Minimization
In segmentation, an -core CMPs processor pool is considered as the collection of 2 ( = 1, 2, . . .) segments each of = /2 cores. At each transition from a current cache line state to the next state (Table 2) and the corresponding update of the presence bit in the sharing vector and sharing status, the proposed verification unit forms 2 number of -cell CA. For example, let us consider the case when = 1 (Figure 12). The -bit compatibility status of the -core CMPs is partitioned into 2 = 2 1 = 2 halves/segments which are fed to two CA, CA 1 and CA 2 , respectively. The compatibility status from C 1 , C 2 , . . . , C /2 is fed to CA 1 and that of C /2+1 , . . . , C to CA 2 (assuming is in powers of 2). The CA 1 and CA 2 are then run parallelly for − 1 = ( /2 − 1) time steps. The resulting attractors of both CA 1 and CA 2 dictate an inconsistency (Figure 12), if it exists. That is, by sensing the LSBs of two attractors (called check bits) of CA 1 and CA 2 , the presence of fault can be detected.
The segmentation effectively reduces the number of computation steps of the verification logic by a factor of 2 (the number of segments). However, this is achieved at the cost of number of check bits. The number of check bits equals the number of segments (2 ).
The verification unit introduced in the earlier sections decides on the inconsistencies after each transaction. However, instead of verifying each individual transaction, sometimes we need to maintain a log book for a set of transactions. The target is to keep trace of whether one or more transactions are faulty. The CA has the capability to store/memorize information [17] and this feature has been successfully exploited to synthesize a more efficient (highspeed) test design, reported in the next section.

High-Speed Verification
The proposed high-speed verification unit (CAVU HIGH-SPEED ) is capable of verifying a set of transactions. For a set of transactions, if the th transaction ( ≤ ) is faulty, the CAVU HIGH-SPEED keeps a trace of this transaction till completion of all the transactions. That is, for an instance of a faulty transaction, CAVU HIGH-SPEED captures it and memorizes it.
For each transaction in -processor core CMPs, with compatibility status CS, an -cell CA is formed at the CAVU HIGH-SPEED . CS ( th bit of CS) is used to set the th CA cell rule. If the CA is then run for certain number ( ) of steps, it settles to an attractor. During the execution of transactions, if one or more transactions are found to be faulty, the effect of fault propagates to the least significant cell (LSB) of the CA (attractor). For the instance of faulty transaction(s), the CA settles to an attractor with LSB "1" and in cases wherein all the transactions are nonfaulty, the CA settles to an attractor with LSB "0". The precise steps followed to realize the CAVU HIGH-SPEED design are noted in the following algorithm. Unlike CAVU FULL-DIR , the current design demands synthesis of uniform as well as hybrid CA. That is, (i) the CA formed should be of single length cycle attractor CA; (ii) when all the transactions are nonfaulty, a uniform CA is formed with LSB of attractor "0" and the CA traces through the 0-basin of the attractor; (iii) in presence of one or more faulty transactions, a hybrid SACA having LSB of attractor "1" should be formed (say, all 1s attractor) and the CA traces through the 1-basin; (iv) the hybridization of uniform CA results in conversion of the CA into an SACA so that the occurrence of a fault is translated as a one-way switch from the 0basin to the 1-basin. This helps in preserving the trace of a faulty transaction, if any.
For example, the CA of Figure 8 can be chosen for the nonfaulty transactions. It then traces through 0-basin (selfloop). In presence of a faulty transaction, the hybrid CA of type shown in Figure 13 is formed. The hybrid CA traces through the 1-basin, that is, attractor with LSB "1". Proof. Let us consider the uniform TACA with rule is hybridized with rule 255 at ( + 1)th cell. Since the passive RMTs of rule 255 are 2, 3, 6, and 7, the passive RMTs for the th cell can be RMTs 1 and 5 (for which the RMTs of ( + 1)th cell are 2 and 3) and RMTs 3 and 7 (for which the RMTs of ( +1)th cell are 6 and 7). Since is a TACA rule, RMT 5 of cannot be passive (Property 4). Further, the RMTs on which the ( +2)th cell (configured with ) can change its state are 4, 6, and 7 (RMT 5 is active). The 0 branch (created with passive RMT 0 of ) of the RTA is stopped due to hybridization, as rule 255 does not have RMT 0 as passive. Now, for the continuity of the nonzero branch, either of RMTs 1 and 3 and one of RMTs 4 and 6 must be passive. With these sequences of passive RMTs of rule and rule 255, only one path from root to ( + 1)th node as well as only one path from ( + 1)th node to the leaf is possible in the RTA of the CA; that is, the RTA is having only one path from root to leaf node which corresponds to the only single length cycle attractor. Further, it can be verified from the NSRT diagram [21] of the hybrid CA that it does not introduce any additional multilength cycle in hybridization. Hence, the CA resulting from hybridization is an SACA. Proof. Uniform TACA configured with rule 254 forms only two attractors, 1 = 00 ⋅ ⋅ ⋅ 0 and 2 = 11 ⋅ ⋅ ⋅ 1 (Theorem 9); that is, the RTA has only two branches leading to two leaf nodes ( 1 and 2 ). The all 0s branch is followed on the selfreplication of RMT 0. The passive RMTs of rule 254 are RMTs 0, 2, 3, 6, and 7 and those of rule 255 are 2, 3, 6, and 7. Now, if the uniform CA ⟨254, 254, . . . , 254⟩ is hybridized with rule 255 at any position, it blocks the "0" branch (all 0s attractor), thus leaving only the all 1s attractor. Further, it can be verified from the NSRT diagram [21] of the hybrid CA that it does not introduce any additional multilength cycle in hybridization. Hence, the resulting CA is an SACA.
For the current realization, we consider hybridization (with ℎ = 255) of a uniform CA formed with rule = 254. Such hybridization allows merger of attractor basins of the uniform CA (0-basin of uniform CA is merged with the 1basin of hybrid CA). That is, for "NF" case of Section 4, the system traces through the 1 -basin (0-basin, Figure 8) of the uniform CA and for a fault the hybrid CA is formed and the system traces through the merged 2 -basin (15-basin, Figure 13) of hybrid CA. Figure 14 describes the operation of the CAVU HIGH-SPEED . Let us consider a system of 4 caches and a set of 5 transactions on the caches among which the 1st transaction results in a single fault and the 2nd transaction results in a double fault (indicated by the 1s in the compatibility status). The 0th transaction is a nonfaulty transaction (compatibility status 0000, as shown in Figure 14). So, a 4-cell uniform CA ⟨254, 254, 254, 254⟩ is formed. The CA is then run for 1 time step with all 0s initial seed. It produces the next state 0000. Since transaction 1 is a faulty transaction, it results in a hybrid

Experimental Results
The performance of proposed verification unit is evaluated in Multi2Sim [22]. A module realizing the verification unit is developed and is augmented in Multi2Sim. The standard programs in SPLASH-2 benchmark suit [23] are used as the workload. The L 1 cache in each core is unified for instructions and data and L 2 is shared. The following test environment and parameters as described in Table 5 are considered for the experimentations. The benchmarks programs are run with input data sets as listed in Table 6: (i) Operating system: Ubuntu 12.04LTS (64 bits).
We evaluate the percentage of memory references (load/store) that results in a state change in caches ( Figure 15) to determine how frequently the verification unit needs to verify transactions. If the number of memory references resulting in state change increases, it points to system vulnerability even for a single fault in the system. Figure 15 denotes that the memory references increase with the number of cores. This is due to increase in memory traffic resulting from coherence misses and other aspects.
Some faults in the system do not lead to error and hence remain hard to detect. Figure 16 depicts the percentage of faults turning into errors. For our evaluation, we have randomly injected fault at various parts of CC at an interval of 1000 and 10,000 clock cycles. The error coverage (percentage of errors detected) for fault injection interval of 10,000 cycles and 1000 cycles is reported in Figures 17 and 18, respectively.
These show that the proposed CA based verification unit (CAVU FULL-DIR ) ensures error coverage which is almost the same as that of the scheme reported in [2]. Table 7 reports the comparison of the CA based schemes with and without segmentation (Section 7). Column 1 notes the number of processor cores. Columns 2 and 3 report the number of computation steps and the number of check bits required to decide on the incoherency (without segmentation). The requirements in segmentation based scheme are shown in columns 4-6.
In columns 5-10 of Table 8, we report the area overhead (gate counts, FFs, and the 2-input NAND/XOR gates) of the CA based designs for the CMPs with 16 to 256 processor cores (column 1). The area computation follows units specified in mcnc.genlib [24]. The requirements for the design reported in [2] are given in columns 2-4 for comparison. Columns 5-8 show the overhead (gate count and area) of the verification logic for MSI (Section 4), without the consideration of segmentation of processor pool. The figures of columns 9 and 10 represent the area requirement for MESI and MOESI protocols, respectively. Comparison of the figures noted in columns 4 and 8 reveals the fact that the CA based verification logic achieves a considerable reduction in area.
The overhead of the CAVU HIGH-SPEED is shown in Table 9. The gate counts are provided in columns 2, 3, and 4. The area computed as per [24] is reported in column 5. Column 6 is reproduced (for CAVU FULL-DIR ) for comparison. It can be observed that the hardware overhead of CAVU HIGH-SPEED is the same as that of the CAVU FULL-DIR . However, the CAVU HIGH-SPEED can be better accepted when there is a need for verifying a set of transactions rather than individual transactions. For a system of -processor cores, the design of CAVU FULL-DIR requires ( − 1) computation steps to make a decision on a defect after each transaction whereas, for number of transactions, the CAVU HIGH-SPEED ensures a correct decision on fault in exactly ( + − 2) computation steps (( −1)-steps for the first ( −1) transactions and ( −1)steps for the last transaction) of the CA hardware. The design thus achieves a speedup of over the design of CAVU FULL-DIR for a set of transactions.

Verification for Limited Directory
The design described in Sections 4 and 5 is tuned to fullmap directory in which traditionally a sharing vector is maintained for a block B to indicate cached copies of B in the system. This sharing vector is of length ( + 1) for a system of -processor cores. As a result, the directory storage overhead is quite unacceptable for CMPs with thousands of processor cores. The alternative scheme that considers compact directory organization is the limited directory protocol [25,26]. In such organization, the ( + 1)-bit sharing vector of block B is replaced by the fixed number of pointers to the processors' caches which are having a copy of B. In this section, we address the design of a verification logic for a system with limited directory, considering non-broadcast-based solution to handle the case of directory runs short of pointers [25].
In a system with limited directory protocol, the pointers indicating processor ids/caches need to be decoded. The structure of limited directory, shown in Figure 19, indicates that caches C 2 and C 3 (corresponding to processors P 2 and P 3 ) are currently sharing a block B and at most four processors can share block B simultaneously (as 4 fields are for the pointers). That is, the update of sharing status and the pointers involves a small and fixed number of entries. Hence, the CA based verification unit for limited directory (CAVU LIM-DIR ) can be realized with an -cell CA ( ≪ ), where is the maximum number of sharers allowed for a block and thereby reduces the time to decide the coherence status. Figure 20 describes an architecture of CAVU LIM-DIR with four pointers. For each cache transaction, after the pointer updating, the pointer is decoded to access the corresponding cache and the status of the block is read from the cache. Depending on the number of pointers r (in Figure 20, = 4), an r-cell CA is formed. The MSB of status of a block (B) in the cache, pointed to by the th pointer, is used to set the th CA cell rule. If the MSB of status is "1", the corresponding CA cell rule is set to 254; otherwise, it is set to rule 255. This differs from the rule selection for CAVU FULL-DIR . Once the CA cell rule is set, the r-cell CA is run for ( − 1) time steps and the LSB of the attractor ("1"/"0") indicates the presence/absence of a fault in limited directory entry. The overhead of CAVU LIM-DIR , considering a fourpointer representation of the limited directory (as shown in Figure 19), is illustrated in Table 10. Column 1 represents the number of cores and columns 2 and 3 note the gate counts (number of FFs and 2-input NANDs). Column 4 records the area overhead. The area overhead of the CAVU FULL-DIR has been reproduced in column 5 for comparison. The results show considerable reduction in gate count and the area overhead in CAVU LIM-DIR compared to those of CAVU FULL-DIR .

Related Work
The schemes ensuring coherency in CMPs with large number of cores are reported in [2-5, 11, 27]. These deal with the interactions between on-chip interconnection network and the cache coherence protocol. Liqun et al., in [3], propose   a solution with interconnection network composed of wires with varying latency, bandwidth, and energy characteristics, and coherence operations are intelligently mapped to the appropriate wires of heterogeneous interconnection. Zhao et al. [28] propose an alternative novel L 2 cache architecture, where each processor has a split private and shared L 2 cache. When data is loaded, it is located in private L 2 or shared L 2 according to its state (exclusive or shared). This scheme efficiently utilizes the on-chip L 2 capacity ensuring low  average access latency. This scheme employs a snooping cache coherence protocol and verifies it with formal verification method. A network caching architecture was proposed to Table 9: Hardware requirements for high-speed design (CAVU HIGH-SPEED ). address the issues of on-chip memory cost of directory and long L 1 cache miss latencies in [5]. The directory information is stored in network interface component thus eliminating the directory structure from L 2 cache. A verification logic that can dynamically detect errors in coherence controller (CC) has been proposed in [2]. Ros et al. have proposed the coherence protocol Dico-CMP for tiled CMP architecture [11]. Dico-CMP avoids indirection by providing a block from the owner node instead of home node thus reducing the network traffic compared to broadcast-based protocols. Further, a scalable organization of directory based on duplicating tags has been proposed to ensure that the directory bank size is independent of the number of tiles. However, the verification of cache coherence, an important problem, has not been addressed properly, possibly due to lack of efficient verification tool [29]. A verification logic which caters to snoop based systems only is reported in [2].

Conclusion
A solution for quick determination of data inconsistencies in caches as well as in recording of the sharing vectors in a system realizing directory based cache coherence protocol is reported. It avoids rigorous computation and communication overhead assuring robust and scalable design, specially, for a system with thousands of processor cores. The design is proved to be highly flexible to cater to different cache coherence protocols, for example, MSI/MESI/MOESI. The introduction of segmentation of the CMPs' processor pool ensures better efficiency in making decision on the inconsistencies in maintaining cache coherence. Further, the design has been modified to cope up with limited directory based protocol.