This work reports an effective design of cache system for Chip Multiprocessors (CMPs). It introduces built-in logic for verification of cache coherence in CMPs realizing directory based protocol. It is developed around the cellular automata (CA) machine, invented by John von Neumann in the 1950s. A special class of CA referred to as single length cycle 2-attractor cellular automata (TACA) has been planted to detect the inconsistencies in cache line states of processors’ private caches. The TACA module captures coherence status of the CMPs’ cache system and memorizes any inconsistent recording of the cache line states during the processors’ reference to a memory block. Theory has been developed to empower a TACA to analyse the cache state updates and then to settle to an attractor state indicating quick decision on a faulty recording of cache line status. The introduction of segmentation of the CMPs’ processor pool ensures a better efficiency, in determining the inconsistencies, by reducing the number of computation steps in the verification logic. The hardware requirement for the verification logic points to the fact that the overhead of proposed coherence verification module is much lesser than that of the conventional verification units and is insignificant with respect to the cost involved in CMPs’ cache system.
1. Introduction
The continual search for performance enhancement in computation has resulted in a variety of modifications in the processor design technique. This ultimately leads to the inevitable transition toward multicore architecture, the Chip Multiprocessors (CMPs), with thousands of processor cores on chip. The increasing number of cores in CMPs, however, puts threats on the reliability and dependability of a design [1]. A number of works [2–5] addressed these issues from different perspectives. Further, the low supply voltage in today’s semiconductor technology narrows down the noise margin and increases the susceptibility to various factors causing transient faults [6] in CMPs.
Most of the fault tolerant schemes reported so far for CMPs are based on the spatial redundancy techniques which may not be effective for faults in on-chip hardware components [2]. Although the cache and memory components are protected by Error Correcting Codes and other techniques, the logic circuits commonly serving the multiple cores are error prone [2].
A CMPs memory subsystem is made up of multilayer caches including private cache for each processor core. It demands very efficient realization of cache coherence protocols. The cache coherence controller (CC) is dedicated to ensuring coherency of shared data in the CMPs’ cache system. Such a prime hardware component can also be subjected to fault as well as design defect. A fault in the CC has serious effect on the correctness of computation as well as on maintaining power efficiency of a system. The schemes proposed in the literature [2, 4, 7, 8], for ensuring coherency in CMPs with thousands of cores, incur huge communication overhead along the global wires. In [2], a verification logic has been proposed to detect errors in the coherence controller. It targets a system realizing snoopy protocol.
The snoopy protocols are easy to implement but are not so scalable [9]. For large scale CMPs, updating and invalidating caches, following snoop based protocols, become impractical [10]. Several variants of directory based coherence protocols have been proposed in [5, 10, 11]. However, the verification of cache data inconsistencies resulting from defects in such systems is yet to be addressed.
The above concerns motivate us to design an effective cache system for CMPs by developing a scheme to determine the accuracy in maintaining data consistency in the CMPs’ cache system realizing directory based protocol. It targets design of built-in logic for verification of cache coherence that can function at speed and can be cost effective. To explore such a design, we consider cellular automata (CA) tool [12] invented by John von Neumann in the 1950s. As CA can handle large volume of data and efficiently be employed to make a decision, a CA based built-in verification logic is proposed to ensure accuracy in functioning of the directory based cache system in tiled CMPs [11]. A special class of CA structure referred to as the single length cycle 2-attractor cellular automata (TACA) is introduced for the design. The TACA analyses the status of CMPs’ cache updates and settles to an attractor state (point state) indicating any faulty recording of cache line status and the sharing vector [9] stored in the directory of cache system. The hardware realization of the CA based design enables quick decision on the cache coherency. Further, the introduction of segmentation of the CMPs’ processor pool assures better efficiency of the design. It reduces the computation steps while making decision on inconsistency, if any, in each cache update. The basic concept of the CA based solution is reported in [13, 14]. The precise contribution of this paper can be summarized as follows:
A built-in verification logic for CMPs cache system, realizing directory based protocol, is proposed. For ease of understanding, the design is detailed out with the basic 3-state MSI protocol. However, the methodology proposed can be applicable for MESI, MOSI, MOESI, and others.
The verification logic is developed around an unconventional tool, called cellular automata (CA). The modular structure of the CA is exploited to enable scalable design.
Design of a high-speed verification unit for cache, harnessing the feature of CA to memorize information, is reported.
The verification unit is realized for full map as well as for the limited directory based cache systems.
Relevant CA theory has been developed to provide the theoretical basis of the cache system design.
Experimental result establishing the claim has been reported.
The following section (Section 2) highlights the coherence issues in CMPs’ cache system. Section 3 narrates CA theory relevant for the current design. The CA based verification unit is introduced in Section 4, and Section 5 describes the design in detail. The hardware realization and delay overhead reduction through introduction of segmentation are reported in Sections 6 and 7, respectively. A test structure that memorizes the inconsistent recording of cache line states during the processors’ references to a memory block is reported in Section 8. The simulation results establishing the effectiveness of the CA based design and its hardware requirement are reported in Section 9. A sketch of the verification unit for limited directory based system is shown in Section 10. In Section 11, we provide a brief on the related works. Section 12 concludes the paper.
2. Cache Coherence in CMPs
In Chip Multiprocessors (CMPs) with a large number of on-chip cores, the shared bus cannot be a good choice due to area overhead and bus contention [11]. The alternative to shared bus is a tiled CMPs architecture. A tile is composed of an array of identical building blocks connecting the cores with point-to-point unordered network [11, 15]. In this work, we consider the ATAC processor architecture [15] which uses a tiled multicore architecture (Figure 1). It is a low-latency, energy-efficient, global, and long distance communication network. Each core in ATAC consists of private L1 and shared L2 caches. The L2 typically follows Nonuniform Cache Access (NUCA). An L1 cache miss in ATAC generates coherence messages. The other L1 cache line states of the system are updated in accordance with the coherence messages.
The tiled CMPs organization.
In general, for large scale CMPs, the directory based cache coherence system is desirable [11]. A directory, in directory based system, is a collection of sharing vectors [9]. Each vector corresponds to a data block B and maintains pointers to the processors that have the cached copies of B. The directory also stores the state of cached copies in L1 caches. The organization of a sharing vector is shown in Figure 2. The “d” specifies the dirty bit and d=1 (true) implies that there is a cache with the latest copy of B and the main memory copy is an invalid copy. The pi’s are the presence bits. pi=1 implies that the processor Pi is having a cached copy of B. In the current design of coherence verification logic/unit, we consider a distributed directory based system.
Sharing vector for full-map directory.
Figure 3 is to describe the logical steps followed to maintain the cache coherence in CMPs, realizing the distributed directory based protocol, on a read miss. It consists of processors Pi, Pj, and Pk with local memory modules L2i, L2j, and L2k, respectively. The distributed directories Di, Dj, and Dk are also local to the corresponding processors Pi, Pj, and Pk. In such a system, if a processor (Pi) requests block B, which is not present in Pi’s L1 (Ci), Pi encounters a read miss and consults its communication assistance unit to find the home (say processor Pj, i.e., L2j) of block B. The request then goes to Pj’s site and the directory Dj is consulted. If the sharing vector of B, stored in Dj, shows that d=0, that is, the L2j has the valid copy of block B, then Pj sends block B to Pi and updates the sharing vector corresponding to B at Dj by setting pi=1. On the other hand, if d=1, that is, the copy of block B at L2j is not a valid copy and Pk is having the dirty copy (as shown in Figure 3), then Pk sends block B to Pi and also writes it back to L2j (home site for block B). Pj then updates the sharing vector of block B at Dj. Here we have assumed a 4-hop communication (request to directory → reply with owner → request to owner → reply to requester) to resolve read miss. However, for a system realizing 3-hop communication (request to directory → forward to owner → reply to requester), as in [16], the proposed verification logic is also compatible.
Directory based protocol: read miss to a block.
3. CA Preliminaries
A cellular automaton (CA) consists of a number of cells organized in the form of lattice. It can be viewed as an autonomous finite state machine (FSM). In a two-state 3-neighborhood CA, each CA cell stores either 0 or 1 that refers to the present state (PS) Sit at time t of the cell i and the next state (NS) of the cell i at (t+1) is(1)Sit+1=fiSi-1t,Sit,Si+1t,where Si-1t and Si+1t are the present states of the left neighbor and right neighbor of ith cell at time t and fi is the next state function (Figure 4). The state of all the cells St=(S1t,S2t,…,Snt) at t is the present state of the CA. Therefore, the next state of an n-cell CA is St+1=(f1(S0t,S1t,S2t),f2(S1t,S2t,S3t),…,fn(Sn-1t,Snt,Sn+1t)).
An n-cell null boundary CA.
The next state function fi of the ith CA cell can be expressed in the form of a truth table (Table 1). The decimal equivalent of the 8 outputs (NS) is called “rule” Ri of the cell [12]. There are 256 rules in 2-state 3-neighborhood CA. Two such rules 254 and 255 are illustrated in Table 1. The first row lists the possible 23 (8) combinations of present states Si-1t, Sit, and Si+1t. The last two rows indicate the next states of the ith cell at time (t+1), defining the rules 254 (fi=Si-1+Si+Si+1) and 255 (fi=1), respectively.
Next state functions.
PS
111
110
101
100
011
010
001
000
Rule
RMT
(7)
(6)
(5)
(4)
(3)
(2)
(1)
(0)
NS
1
1
1
1
1
1
1
1
254
NS
1
1
1
1
1
1
1
0
255
The rule vector R=R1,R2,…,Ri,…,Rn configures the cells of a CA. If all the Ri’s are the same, that is, R1=R2=⋯=Rn, the CA is a uniform CA; otherwise, it is a nonuniform/hybrid CA [17]. In Figure 4, the left (right) neighbor of the leftmost (rightmost) terminal cell is permanently fixed to 0-state. It is a null boundary CA.
Definition 1 (RMT).
A combination of present states shown in the 1st row of Table 1 is the Min Term of a 3-variable Si-1t,Sit,andSi+1t switching function and is referred to as the Rule Min Term (RMT).
Column 011 of Table 1 is the 3rd RMT. The next states corresponding to this RMT are 1 for both rules 254 and 255.
Definition 2 (reversible and irreversible CA).
A CA is reversible if its states form only cycles in the state transition diagram (all states are reachable); otherwise, it is irreversible (Figure 5).
A 4-cell irreversible CA 1,236,165,69.
Definition 3 (attractor and attractor basin).
A set of states of a CA forms loop (cycle) and is called attractor. An attractor (α) forms an α-basin with the states that lead to the attractor.
The cycles (7→7 and 9→1→9) of Figure 5 are the two attractors of the CA 1,236,165,69. The 7-basin of the CA contains 12 states including the attractor state 7.
Definition 4 (depth).
The depth of a CA is defined as the length of the longest path from a state to an attractor in the state transition diagram.
The depth of the CA shown in Figure 5 is 5 (2→10→6→4→5→7).
Definition 5 (active and passive RMT).
An RMT x0y (x1y) in a CA rule is called passive (self-replicating) if the RMT x0y (x1y) is 0 (1). On the other hand, if an RMT x0y (x1y) is 1 (0), it is active (non-self-replicating).
For example, the RMT 0 (000) is 0 and RMT 2 (010) is 1 in 254 (Table 1); that is, these two RMTs are passive. However, RMT 0 in 255 is active as it is 1 (Table 1).
4. Overview of Verification Logic
A defect in the computing logic of the cache coherence controller (CC) can lead to faulty next state computation in CMPs. Even if the computing logic operates correctly, faulty recording of state(s) may also result due to fault(s) in the communication network of the CMPs cache system. These faulty recordings of state(s) introduce inconsistencies in the cache data states. The identification of cache data inconsistencies in such a system, while maintaining the cache coherence, is addressed in [2]. The solution reported detects errors in a CC. It targets a snoopy protocol based cache system and involves complex data structures as well as computation intensive steps. In [18], we also report design of verification unit for the CC working in a snoopy protocol based system.
The cache coherence protocol (ACKwise) in ATAC processor architecture is the coupling of directory and snoopy protocol named ACKwise [15]. In addition to the probable defects in CC and faults in the communication network, the faulty update of sharing vector is a major concern in ATAC. The sharing vector may also be subjected to faults, even if the sharing status of a cache block is recorded correctly thus making the coherence verification process hard to realize. Incorporating separate verification units for sharing status and sharing vector would be extremely costly. Therefore, we formulated the problem of coherence verification in ATAC like architectures by modelling it as the verification of the compatibility of sharing status and sharing vector on each cache state update. To prove the effectiveness of the proposed cellular automata (CA) based verification logic, we consider the ATAC (tiled CMPs) architecture realizing the directory based cache coherence system with MSI protocol. However, this scheme is also applicable for MESI/MOSI/MOESI protocol based designs.
Figure 6 describes the basic 3-state MSI protocol. Table 2 displays the effect of faulty noting in sharing vector, resulting in faulty (“F”) state (column 5) with full-map directory. The entry “All p’s are 0s,” in column 1 of the first row, represents that none of the processors’/tiles’ L1 caches has a copy of block B. On the other hand, the entry “All p’s are 1s,” in column 1 of the second row, represents that all the processors’ caches have a copy of B. Similarly, the entry “pi is 1 and all others are 0s” indicates only processor Pi has a copy of block B and there is no other cached copy of B.
Cache sharing vector update.
Current sharing vector
Event
Desired sharing vector
Faulty sharing vector
Fault effect
Cases
(1)
(2)
(3)
(4)
(5)
(6)
All p’s are 0s
Pi reads
pi is 1 and all others are 0s
All p’s are 0s
Faulty (F)
Case 1
pj is 1 and all others are 0s
Faulty (F)
Case 2
Pi writes
pi is 1 and all others are 0s
All p’s are 0s
Faulty (F)
Case 1
pj is 1 and all others are 0s
Faulty (F)
Case 2
All p’s are 1s
Pi writes
pi is 1 and all others are 0s
pj is 1 and all others are 0s
Faulty (F)
Case 2
pi & pj are 1 and others are 0s
Faulty (F)
Case 3
pi is 1 and others are 1s & 0s
Faulty (F)
Case 3
p’s are 1s & 0s
Pi reads
pi is 1 and all others are 1s & 0s
pi is 0 and others are 1s & 0s
Faulty (F)
Case 1
Pi writes
pi is 1 and all others are 0s
pi is 1 and others are 1s & 0s
Faulty (F)
Case 3
pj is 1 and all others are 0s
Pi writes
pi is 1 and all others are 0s
pi & pj are 1 and others are 0s
Faulty (F)
Case 3
State transition diagram of MSI protocol.
On a read/write operation (event) of a processor Pi (Figure 3), the corresponding sharing vector for block B is noted in column 3 of Table 2. The contents of column 4 represent the possible faulty noting at the sharing vector. For example, row 1 of column 1 notes the sharing vector prior to “Pi read”. Initially, B does not have a cached copy (represented by “All p’s are 0s”). On an event of read by Pi (noted in column 2 of row 1), the desired sharing vector is “pi is 1 and all others are 0s” (column 3 of row 1). All the possible faulty recordings of the sharing vector are shown in column 4. Column 5 notes the effect of fault (“faulty (F)”). Classification of faulty cases is noted in column 6. The consideration of columns 1, 4, and 5 of Table 2 indicates that the proposed fault detection unit should respond as “F” for the states of cached copies of B (cache lines for B) when the following cases occur:
Case 1.
Processor Pi reads/writes block B, but Pi’s presence bit (pi) in sharing vector for B is not updated to “1”; that is, it is still “0”.
Case 2.
Pi reads/writes and some other processors’ (Pj’s) presence bit(s) (pj’s) is (are) set to “1,” instead of Pi’s presence bit (pi).
Case 3.
Pi writes and some other processors’ (Pj’s) presence bits are not updated; that is, pj’s remain “1”.
It is to be noted that Cases 2 and 3 include all cases of all possible faults that affect more than one bit. The proposed CA based logic (shown in Figure 7), to realize the verification in cache coherence system, therefore, is designed so that it can correctly respond with either “NF” (nonfaulty) or “F” (faulty) following the above three cases. It employs an n-cell CA for CMPs with n private (L1) caches (C1,C2,…,Cn, where Ci is the cache attached to processor Pi and ith CA cell corresponds to the cache Ci). Now, with each read/write operation, the cache line state (sharing status Qi0Qi1 assuming MSI protocol) as well as the presence bits (p’s) in the sharing vector are updated. In a fault-free system, the update of sharing status and sharing vector should be compatible. Therefore, the sharing status, to be more specific, the MSB of sharing status (in the current design), and the presence bits of the sharing vector are fed as input to the verification unit to form n-bit compatibility status (CS). The n-cell CA of the verification unit is then run for certain number (t) of time steps with CS as the seed. The CA settles either in an attractor designated as X1, corresponding to “NF” for nonfaulty recording, or in an attractor X2, corresponding to “F” (faulty) for the instance of a fault. Therefore, by observing the attractor (“NF” or “F”), the fault in coherence controller logic can be detected. Observing an attractor, however, is reduced to sensing of LSB of the attractor (least significant cell of the CA) to detect fault in the system. The steps in realizing the verification unit for a full-map directory are summarized in the following algorithm.
CA based verification unit for full-map directory.
Algorithm 1 (FUNCTION-CAVUFULL-DIR).
Input. Presence bits and MSBs of sharing status of cache lines.
Output. Decision on presence of fault.
Form the n-bit compatibility status (CS) where CSi=Qi0⊕pi, ∀i=1, to n.
Run the CA selected for verification logic for t time steps (t=(n-1) for the current design) with CS as the initial state (seed). It reaches the attractor state.
Check LS cell of the CA (LSB of attractor). If it is “0”, then record of sharing status and sharing vector is fault-free; otherwise, it is faulty.
5. The Design of CAVUFULL-DIR
The cellular automata (CA) based solution for the verification of cache inconsistencies, introduced in the earlier section, demands the following:
The CA constructed for the verification unit CAVUFULL-DIR should form single length cycle attractors.
The number of attractors should be minimum, preferably two attractors: one (X1) corresponds to “NF” (nonfaulty) and the other (X2) corresponds to “F” (faulty) and the CA should correctly report all the faulty cases of Table 2.
The attractors X1 and X2 should differ at least at one position (say, LSB) so that the decision on “NF” or “F” (sensing a single bit of the attractor, that is, least significant cell of the CA) can be taken at speed.
The effect of fault propagates through the CA as it strides over time and induces LSB of the attractor. Therefore, the incidence of fault is translated as a switch from X1 attractor basin to X2-basin (say from 0-basin of Figure 8 to its 1-basin). The following subsections characterize the single length cycle attractor CA that can be employed for the current design.
State transition diagram of 254,254,254,254.
5.1. Single Length Cycle Attractor
The next state of a single length cycle attractor is the attractor itself. In a single length cycle attractor CA, for at least one RMT (Section 3) of each cell rule Ri of R (CA), the cell i is passive (Definition 5). It implies that the state change in cell i is d→d. This is summarized in the following property.
Property 1.
A rule Ri can lead to the formation of single length cycle attractor(s) if at least one of its RMTs is passive; that is, at least one of the RMTs 0(000), 1(001), 4(100), or 5(101) is 0 and/or at least one of the RMTs 2(010), 3(011), 6(110), or 7(111) is 1 [19].
In [19], based on Property 1, the 256 CA rules are classified in 9 groups (groups 0–8). Rule 254 (11111110) is in group 5 as it follows Property 1 for 5 RMTs (RMTs 0, 2, 3, 6, and 7 are passive (Table 1)). A CA configured with the rules that maintain Property 1 for its RMTs is a probable CA with the single length cycle attractors [19]. The construction of single length cycle 2-attractor CA (Figure 8) can be followed from the theory of Reachability Tree for attractors introduced in [20].
5.2. Reachability Tree for Attractors
Reachability Tree for attractors (RTA) is a binary tree representing single length cycle attractors of a CA [20]. Each node is constructed with RMT(s) of a rule that follows Property 1. The left edge is the 0-edge and the right edge is 1-edge (Figure 9). For an n-cell CA, the number of levels in the tree is (n+1). Root node is at level 0 and the leaf/terminal nodes are at level n. The nodes at level i are constructed from the RMTs of (i+1)th CA cell rule Ri+1. The decimal numbers within a node at level i represent the RMTs of the CA cell rule Ri+1 based on which the cell (i+1) can change its state. The RMTs of a rule for which we follow 0-edge or 1-edge are noted in the bracket. The number of leaf nodes in an RTA denotes the number of single length cycle attractors of the CA and a sequence of edges from the root to a leaf node, representing an n-bit binary string, is the attractor state. The 0-edge and 1-edge represent 0 and 1, respectively. For example, the number of single length cycle attractor states in the CA 254,254,254,254 of Figure 9 is 2 (X1 and X2). The root node (level 0) of the RTA is constructed from passive RMTs 0, 2, and 3 as cell 1 (rule 11111110) can change its state following any one of the passive RMTs (null boundary). As the state of left neighbor of cell 1 is always 0, the passive RMTs 6 and 7 of rule 254 are the do not care RMTs for cell 1. Similarly, as the right neighbor of cell 4 is always 0, the passive RMTs 3, 5, and 7 are do not care for cell 4.
RT for attractors of 254,254,254,254.
The RMTs of two consecutive CA cell rules Ri and Ri+1 are related while the CA changes its state [20]. This relationship between the RMTs of Ri and Ri+1 is shown in Table 3. It implies that if the ith CA cell changes its state following RMT Ki, then (i+1)th cell changes its state following RMT Ki+1. For example, if the 1st cell in Figure 9 changes its state following RMT 0, then the 2nd cell changes its state following RMTs 0 and 1 (from Table 3).
Relationship between RMTs of cell i and cell (i+1) for next state computation.
RMT Ki of ith rule
RMTs Ki+1 of (i+1)th rule
0/4
0, 1
1/5
2, 3
2/6
4, 5
3/7
6, 7
The RTA of Figure 9 can be generalized to depict the attractors for an n-cell CA. The generalized RTA of the CA 254,254,…,254 is shown in Figure 10. The 0-edge at B′ of Figure 9 evolves to node D′ with the same set of RMTs ({0}); that is, nodes B′ and D′ are equivalent and, therefore, transition B′ to D′ is replaced by the transition B′ to B′ (Figure 10). Similarly, the transition C′→ F′ in Figure 9 is replaced by the transition C′→ C′ in Figure 10. Such transitions between equivalent states are true for level 1 to level (n-2). For the last cell of a CA (〈254,254,…,254〉), some of the RMTs of B′ and C′ (e.g., RMT 1 of B′ and RMT 7 of C′) are do not care RMTs. Therefore, level (n-1) is shown separately (G′ and K′ in Figure 10). RTA of Figure 10 depicts that the n-cell uniform CA with rule 254 forms 2-single length cycle attractors (all 0s and all 1s).
RT for attractors of 254,254,…,254.
5.3. CA Rule Selection for the CAVUFULL-DIR
As described in the earlier section, the CA with only two single length cycle attractors X1 and X2 (TACA) can be the best choice for the current design. The following properties guide the proper selection of CA rules for the CAVUFULL-DIR. To reduce the search space, the CA rules that form only single length cycle 2-attractors (TACA) for all lengths (n-cell) are identified. The theory reported in this subsection guides us to select appropriate TACA rules required for the current design.
Property 2.
In 3-neighborhood null boundary, the n-cell uniform TACA should have either an all 0s or an all 1s attractor or both. The attractors (say, A1 and A2) differ in consecutive 20 or 21 or 2logn terminal bits [21].
Property 3 (see [21]).
The rules of groups 3, 4, and 5 can only form single length cycle 2-attractor CA (TACA).
Property 4 (see [21]).
For all the TACA rules, RMT 5 is an active RMT.
Property 5 (see [21]).
For any TACA rule, at least one of RMTs 2 and 3 as well as at least one of RMTs 0 and 7 must be passive.
Theorem 6.
In a single length cycle uniform CA with all 0s attractor, the RMT 0 of the rule selected for CA must be passive.
Proof.
The root node of an RTA can only be formed with one or more RMTs of the RMT set {0,1,2,3}. Among these, RMTs 0 and 1 can be 0 (passive) and RMTs 2 and 3 can be 1 (passive). For all 0s attractor, either of RMTs 0 and 1 should be passive. Now, the next cell RMT for RMT 0 is {0,1} and it is {2,3} for RMT 1 (Table 3). If only RMT 1 is passive (RMT 0 is active), its next cell RMTs (i.e., {2,3}) having value “1” block the 0-edge (edge labelled with RMT value 0) of the RTA. Hence, RMT 0 must be passive for the continuity of 0-edge which corresponds to the all 0s attractor.
Theorem 7.
In a single length cycle uniform CA with all 1s attractor, RMTs 3, 6, and 7 must be passive.
Proof.
In RTA, among the RMTs of root node, if RMT 0 and/or 1 is passive, the RTA follows a 0-edge and if 2 and/or 3 is passive, it follows 1-edge. So, for an all 1s attractor, the root node RMTs should be 2 and/or 3. For RMT 2 as passive, the next cell RMTs are {4,5} (Table 3). RMTs 4 and 5, having value 0, cannot contribute to all 1s attractor. Now, for RMT 3 as passive RMT, the next cell RMTs are {6,7}. So, for continuity of the 1-edge (edge labelled with RMT value 1), the RMT 7 must be passive. However, since for the last cell RMT 7 is do not care, RMT 6 must also be a passive RMT, hence the proof.
Corollary 8.
In 3-neighborhood null boundary, the uniform CA constructed with only 16 rules (out of 256) can generate all 0s and all 1s single length cycle attractors.
Proof.
The proof is obvious as for all 0s attractor RMT 0 must be passive and for all 1s attractor RMTs 3, 6, and 7 are to be passive. Therefore, such CA rules vary in RMTs 1, 2, 4, and 5, leading to 16 possible combinations, that is, 16 possible CA rules.
Theorem 9.
The uniform CA with rule 254 has only two single length cycle attractors (all 0s and all 1s states).
Proof.
The self-replicating RMTs of rule 254 (11111110) are RMTs 0, 2, 3, 6, and 7. That is, root node of the RTA of the uniform CA configured with rule 254 (Figure 10) contains the self-replicating RMTs 0, 2, and 3. Now, the RMTs for the next cell rule (following Table 3) are (0, 1) for RMT 0, (4,5) for RMT 2, and (6,7) for RMT 3. Since RMTs 1, 4, and 5 are non-self-replicating, they will not appear in the RTA. Only RMTs 0, 6, and 7 appear. So, there can be two sets of self-replicating RMTs, set 0={0} and set 1={3,6,7}, that create two distinct paths from the root to two leaf nodes (X1 and X2 in Figure 10), hence producing two attractors: all 0s (X1=000⋯0) and all 1s (X2=111⋯1).
The above properties and theorems enable identification of 15 rules as the TACA rules (Table 4). Further, from Property 3 and Corollary 8, it can be concluded that there are only 5 rules 218, 234, 248, 250, and 254 that have only all 0s and all 1s single length cycle attractors. However, from the NSRT diagrams introduced in [21], it can be shown that only rule 254 forms a uniform TACA for all lengths. The other four rules form multilength cycle attractors along with the two (all 0s and all 1s) single length cycle attractors.
CA rules for single length 2-attractor CA (TACA).
Group
Rule for TACA
3
38, 52
4
46, 106, 116, 120, 166, 180, 235, 249
5
174, 239, 244, 253, 254
From the state transition diagram of the uniform CA configured with rule 254 (Figure 8), it can be observed that the appearance of “1”(s) in initial state can act as a switch from 0-basin to (2n-1)-basin (referred to as 1-basin in the figure). The 4-cell CA 254,254,254,254, when initialized with all 0s seed, follows 0-basin (LSB “0” attractor); on the other hand, the CA when initialized with nonzero seed follows the 15-basin (1-basin), that is, with LSB 1 attractor. This behavior of rule 254 matches with the design requirement for CAVUFULL-DIR. The design is detailed out in the following subsection.
5.4. The CAVUFULL-DIR
Let the attractors of CA formed for all the cases of column 3 (“NF” cases) of Table 2 be X1 (all 0s) and for other cases (“F” cases) the attractors are X2 (all 1s). The best selection of CA rules is, therefore, such that
Cond 1: A1=X1 belongs to “NF” and
A2=X2 belongs to “F” are different.
For the current design, we consider a uniform CA with rule 254. In the “NF” case, the CA representing the cache system traces through the X1-attractor basin (0-basin). Now, if there is a fault in the cache coherence system, the CA traces through the X2-basin (1-basin). Hence, the LSB of attractor “0” signifies “nonfaulty” recording and “1” signifies a “faulty” recording.
While representing the states “M,” “S,” and “I” of a cache line in the sharing status, we consider 11 for “M,” 10 for “S,” and 00 for “I”. If a cache line for B is in “M” or “S” state in the Ci (ith processor’s cache), the ith bit of the sharing vector (pi) corresponding to block B is 1. That is, in a nonfaulty case, the MSB of sharing status at Ci and the ith bit of the sharing vector (pi) both are equal (either 1 or 0).
Theorem 10.
The MSB of sharing status (denoted as “11” for Modified, “10” for Shared, and “00” for Invalid) suffices to be considered for checking compatibility of sharing status with sharing vector.
Proof.
The cache line states when represented in the sharing status vector, the states “M,” “S,” and “I” are encoded as 11, 10, and 00, respectively. Whenever a block (say, block B) is in “M” state in one cache, the block’s state in other caches should ideally be “I” for maintaining coherence. The cache holding the block in “M” state becomes a dirty sharer. Now, whenever, one or more caches have the block in “S” state; on read(s) the cache(s) (processor(s)) become clean sharer(s). In both cases of clean and dirty sharing, the presence bit(s) in the sharing vector should be set to “1”. This “1”(s) in the presence bits conforms to the “1”(s) in the MSB of state code for “M” (11) or “S” (10). Hence, checking compatibility of the sharing status and the sharing vector can be performed by checking the MSB of sharing status and the presence bits.
To ensure the correctness in recording of sharing status and the sharing vector at an update of data block B, the CAVUFULL-DIR accepts the compatibility status which is formed by performing XOR of the MSB (Qi0) of 2-bit sharing status (Qi0Qi1) of B at Ci and the ith presence bit (pi) of sharing vector for B as the initial state of ith cell of the CA selected for the test design. The CA is then run for t=(n-1) time steps and the state of its least significant cell (LSB of attractor) defines presence of fault(s) either in sharing status or in sharing vector.
The scheme reported above is described for MSI protocol. However, the scheme also applies for MOSI/MESI/MOESI protocols. For example, if the states “M,” “O,” “E,” “S,” and “I” of MOESI are represented by 111, 101, 110, 100, and 000 (001, 010, and 011, do not care), respectively, then the same logic as applied in case of MSI can be effective for verification for MOESI. The MSBs of the state code for “M,” “O,” “E,” and “S” are chosen as 1 (Theorem 10). Accordingly, the corresponding presence bits should be 1 and the compatibility of these two can be checked by applying XOR logic as in MSI (Section 4). Henceforth, for any state code of arbitrary length, if the state(s) representing a sharing (dirty or clean) is represented by a binary code having a “1” in MSB, the proposed logic applies.
6. Hardware Realization
The hardware realization of CAVUFULL-DIR is shown in Figure 11. If the state of cached copy of B at ith cache Ci is “Invalid,” that is, Qi0Qi1=00 and the ith bit in sharing vector (pi) is “0,” then the XOR of Qi0 and pi is 0⊕0 = “0”. The uniform CA 254,254,254,…,254, loaded with such all 0s seed, is then run for (n-1) time steps to settle to the attractor 00⋯0. However, with B’s state at ith cache Ci as “Invalid,” that is, Qi0Qi1=00, if the ith bit in sharing vector (pi) is “1,” then the XOR of Qi0 and pi is 0⊕1=“1”. If uniform CA 254,254,254,…,254 is loaded with nonzero seed (00⋯1⋯0) and run for (n-1) time steps, it settles to the attractor 11⋯1 (Figure 8). Hence, the LSB of attractor (state of least significant CA cell) “0” indicates the absence of fault and “1” indicates presence of single or multiple faults. The part of the next state logic, shown in Figure 11, is also shared in realizing the next state logic of cell i-1 and cell i+1.
Hardware realization of CAVUFULL-DIR.
The design of CAVUFULL-DIR reported in this section requires an n-cell CA for CMPs with n-processor cores and the CA needs to run for t (=depth of CA) time steps to decide on the inconsistency in the recording of cache line states. The computation steps t=(n-1), that is, delay, however, can be reduced through introduction of segmentation of the CMPs processor pool.
7. Delay Minimization
In segmentation, an n-core CMPs processor pool is considered as the collection of 2q (q=1,2,…) segments each of m=n/2q cores. At each transition from a current cache line state to the next state (Table 2) and the corresponding update of the presence bit in the sharing vector and sharing status, the proposed verification unit forms 2q number of m-cell CA. For example, let us consider the case when q=1 (Figure 12). The n-bit compatibility status of the n-core CMPs is partitioned into 2q=21=2 halves/segments which are fed to two CA, CA1 and CA2, respectively. The compatibility status from C1,C2,…,Cn/2 is fed to CA1 and that of Cn/2+1,…,Cn to CA2 (assuming n is in powers of 2). The CA1 and CA2 are then run parallelly for m-1=(n/2-1) time steps. The resulting attractors of both CA1 and CA2 dictate an inconsistency (Figure 12), if it exists. That is, by sensing the LSBs of two attractors (called check bits) of CA1 and CA2, the presence of fault can be detected.
Verification unit with segmentation.
The segmentation effectively reduces the number of computation steps of the verification logic by a factor of 2q (the number of segments). However, this is achieved at the cost of number of check bits. The number of check bits equals the number of segments (2q).
The verification unit introduced in the earlier sections decides on the inconsistencies after each transaction. However, instead of verifying each individual transaction, sometimes we need to maintain a log book for a set of transactions. The target is to keep trace of whether one or more transactions are faulty. The CA has the capability to store/memorize information [17] and this feature has been successfully exploited to synthesize a more efficient (high-speed) test design, reported in the next section.
8. High-Speed Verification
The proposed high-speed verification unit (CAVUHIGH-SPEED) is capable of verifying a set of transactions. For a set of k transactions, if the jth transaction (j≤k) is faulty, the CAVUHIGH-SPEED keeps a trace of this transaction till completion of all the k transactions. That is, for an instance of a faulty transaction, CAVUHIGH-SPEED captures it and memorizes it.
For each transaction in n-processor core CMPs, with compatibility status CS, an n-cell CA is formed at the CAVUHIGH-SPEED. CSi (ith bit of CS) is used to set the ith CA cell rule. If the CA is then run for certain number (t) of steps, it settles to an attractor. During the execution of k transactions, if one or more transactions are found to be faulty, the effect of fault propagates to the least significant cell (LSB) of the CA (attractor). For the instance of faulty transaction(s), the CA settles to an attractor with LSB “1” and in cases wherein all the transactions are nonfaulty, the CA settles to an attractor with LSB “0”. The precise steps followed to realize the CAVUHIGH-SPEED design are noted in the following algorithm.
Algorithm 2 (FUNCTION-CAVUHIGH-SPEED).
Input. Presence bits and MSBs of sharing status for k transactions.
Output. Decision on presence of single or multiple faults after k transactions.
Initialize the n-cell CA with all 0s.
For i=1 to n(2)Sit=0 and store it in Sti
For j=1 to k repeat (a) to (e):
Perform jth transaction (read/write).
For i=1 to n(3)CSi=Qi0⊕pi
Construct the CA (CSi sets the ith cell rule-
CSi=0 sets rule Ro and CSi=1 sets rule Rh).
Run the CA 1-step to compute St+1 from St.
Store St+1 to St.
Run the CA for (n-2) time steps considering St as the present state. It settles to an attractor state A. The LSB (1/0) of attractor indicates the presence/absence of fault(s).
Unlike CAVUFULL-DIR, the current design demands synthesis of uniform as well as hybrid CA. That is,
the CA formed should be of single length cycle attractor CA;
when all the transactions are nonfaulty, a uniform CA is formed with LSB of attractor “0” and the CA traces through the 0-basin of the attractor;
in presence of one or more faulty transactions, a hybrid SACA having LSB of attractor “1” should be formed (say, all 1s attractor) and the CA traces through the 1-basin;
the hybridization of uniform CA results in conversion of the CA into an SACA so that the occurrence of a fault is translated as a one-way switch from the 0-basin to the 1-basin. This helps in preserving the trace of a faulty transaction, if any.
For example, the CA of Figure 8 can be chosen for the nonfaulty transactions. It then traces through 0-basin (self-loop). In presence of a faulty transaction, the hybrid CA of type shown in Figure 13 is formed. The hybrid CA traces through the 1-basin, that is, attractor with LSB “1”.
State transition after hybridization.
CA254,255,254,254
CA254,255,255,254
Theorem 11.
The uniform TACA with rule R can be converted into an SACA, when hybridized with rule 255 at single or multiple positions (cells), if both RMTs 0 and 7 and either of RMTs 1 and 3 and one of RMTs 4 and 6 are passive in R.
Proof.
Let us consider the uniform TACA with rule R is hybridized with rule 255 at (i+1)th cell. Since the passive RMTs of rule 255 are 2, 3, 6, and 7, the passive RMTs for the ith cell can be RMTs 1 and 5 (for which the RMTs of (i+1)th cell are 2 and 3) and RMTs 3 and 7 (for which the RMTs of (i+1)th cell are 6 and 7). Since R is a TACA rule, RMT 5 of R cannot be passive (Property 4). Further, the RMTs on which the (i+2)th cell (configured with R) can change its state are 4, 6, and 7 (RMT 5 is active). The 0 branch (created with passive RMT 0 of R) of the RTA is stopped due to hybridization, as rule 255 does not have RMT 0 as passive. Now, for the continuity of the nonzero branch, either of RMTs 1 and 3 and one of RMTs 4 and 6 must be passive. With these sequences of passive RMTs of rule R and rule 255, only one path from root to (i+1)th node as well as only one path from (i+1)th node to the leaf is possible in the RTA of the CA; that is, the RTA is having only one path from root to leaf node which corresponds to the only single length cycle attractor. Further, it can be verified from the NSRT diagram [21] of the hybrid CA that it does not introduce any additional multilength cycle in hybridization. Hence, the CA resulting from hybridization is an SACA.
For example, the uniform CA constructed with TACA rules 174/244/254 when hybridized with rule 255 at single or multiple positions is transformed into SACA.
Corollary 12.
The uniform TACA with rule 254 (Ro) is converted into an SACA when hybridized with rule 255 (Rh) at single or multiple positions.
Proof.
Uniform TACA configured with rule 254 forms only two attractors, X1=00⋯0 and X2=11⋯1 (Theorem 9); that is, the RTA has only two branches leading to two leaf nodes (X1 and X2). The all 0s branch is followed on the self-replication of RMT 0. The passive RMTs of rule 254 are RMTs 0, 2, 3, 6, and 7 and those of rule 255 are 2, 3, 6, and 7. Now, if the uniform CA 254,254,…,254 is hybridized with rule 255 at any position, it blocks the “0” branch (all 0s attractor), thus leaving only the all 1s attractor. Further, it can be verified from the NSRT diagram [21] of the hybrid CA that it does not introduce any additional multilength cycle in hybridization. Hence, the resulting CA is an SACA.
For the current realization, we consider hybridization (with Rh=255) of a uniform CA formed with rule Ro=254. Such hybridization allows merger of attractor basins of the uniform CA (0-basin of uniform CA is merged with the 1-basin of hybrid CA). That is, for “NF” case of Section 4, the system traces through the X1-basin (0-basin, Figure 8) of the uniform CA and for a fault the hybrid CA is formed and the system traces through the merged X2-basin (15-basin, Figure 13) of hybrid CA.
Figure 14 describes the operation of the CAVUHIGH-SPEED. Let us consider a system of 4 caches and a set of 5 transactions on the caches among which the 1st transaction results in a single fault and the 2nd transaction results in a double fault (indicated by the 1s in the compatibility status). The 0th transaction is a nonfaulty transaction (compatibility status 0000, as shown in Figure 14). So, a 4-cell uniform CA 254,254,254,254 is formed. The CA is then run for 1 time step with all 0s initial seed. It produces the next state 0000. Since transaction 1 is a faulty transaction, it results in a hybrid CA configured with rules 254 and 255 (254,255,254,254). The hybrid CA, when running for 1-step from 0000 state, produces the next state 0100. Transaction 2, being a faulty transaction with double faults at cell 1 and cell 2, also results in a hybrid CA 254,255,255,254. This CA, when running 1-step from 0100 state, produces the next state 1110. Transactions 3 and 4 are nonfaulty transactions. Hence, the CA constructed for transactions 3 and 4 are the uniform CA with rule 254 (254,254,254,254). The CA for transaction 3, when running for 1-step from 1110 state, results in 1111 state. The CA for transaction 4 is then run for (n-1)=3-steps and it reaches the attractor 1111 state. The “1” in the LSB of the attractor indicates the presence of faulty transactions.
Functioning of the CAVUHIGH-SPEED.
9. Experimental Results
The performance of proposed verification unit is evaluated in Multi2Sim [22]. A module realizing the verification unit is developed and is augmented in Multi2Sim. The standard programs in SPLASH-2 benchmark suit [23] are used as the workload. The L1 cache in each core is unified for instructions and data and L2 is shared. The following test environment and parameters as described in Table 5 are considered for the experimentations. The benchmarks programs are run with input data sets as listed in Table 6:
We evaluate the percentage of memory references (load/store) that results in a state change in caches (Figure 15) to determine how frequently the verification unit needs to verify transactions. If the number of memory references resulting in state change increases, it points to system vulnerability even for a single fault in the system. Figure 15 denotes that the memory references increase with the number of cores. This is due to increase in memory traffic resulting from coherence misses and other aspects.
Percentage of memory references causing state change in caches.
Some faults in the system do not lead to error and hence remain hard to detect. Figure 16 depicts the percentage of faults turning into errors. For our evaluation, we have randomly injected fault at various parts of CC at an interval of 1000 and 10,000 clock cycles. The error coverage (percentage of errors detected) for fault injection interval of 10,000 cycles and 1000 cycles is reported in Figures 17 and 18, respectively. These show that the proposed CA based verification unit (CAVUFULL-DIR) ensures error coverage which is almost the same as that of the scheme reported in [2].
Percentage of faults leading to errors.
Error coverage with fault injection interval of 10000 clock cycles.
Error coverage with fault injection interval of 1000 clock cycles.
Table 7 reports the comparison of the CA based schemes with and without segmentation (Section 7). Column 1 notes the number of processor cores. Columns 2 and 3 report the number of computation steps and the number of check bits required to decide on the incoherency (without segmentation). The requirements in segmentation based scheme are shown in columns 4–6.
Performance in CA segmentation.
CAVUFULL-DIR without segmentation
CAVUFULL-DIR with segmentation
Number of cores
Number of comp. steps
Number of check bits
Number of segments
Number of comp. steps
Number of check bits
(1)
(2)
(3)
(4)
(5)
(6)
16
16
1
2
8
2
4
4
4
256
256
1
2
128
2
4
64
4
1024
1024
1
2
512
2
4
256
4
8
128
8
In columns 5–10 of Table 8, we report the area overhead (gate counts, FFs, and the 2-input NAND/XOR gates) of the CA based designs for the CMPs with 16 to 256 processor cores (column 1). The area computation follows units specified in mcnc.genlib [24]. The requirements for the design reported in [2] are given in columns 2–4 for comparison. Columns 5–8 show the overhead (gate count and area) of the verification logic for MSI (Section 4), without the consideration of segmentation of processor pool. The figures of columns 9 and 10 represent the area requirement for MESI and MOESI protocols, respectively. Comparison of the figures noted in columns 4 and 8 reveals the fact that the CA based verification logic achieves a considerable reduction in area.
Hardware requirements for CAVUFULL-DIR.
Scheme [2]
CAVUFULL-DIR
MSI
MESI
MOESI
Number of cores
Number of FFs
Number of NANDs
Area (units)
Number of FFs
Number of NANDs
Number of XORs
Area (units)
Area (units)
Area (units)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
16
12824
159
47824016
16
144
16
259840
259840
259840
32
21152
159
78737552
32
288
32
519680
519680
519680
64
41512
159
154313872
64
576
64
1039360
519680
519680
128
80432
159
298784912
128
1152
128
2078720
519680
519680
256
159288
159
591498384
256
2304
256
4157440
519680
519680
The overhead of the CAVUHIGH-SPEED is shown in Table 9. The gate counts are provided in columns 2, 3, and 4. The area computed as per [24] is reported in column 5. Column 6 is reproduced (for CAVUFULL-DIR) for comparison. It can be observed that the hardware overhead of CAVUHIGH-SPEED is the same as that of the CAVUFULL-DIR. However, the CAVUHIGH-SPEED can be better accepted when there is a need for verifying a set of transactions rather than individual transactions. For a system of n-processor cores, the design of CAVUFULL-DIR requires (n-1) computation steps to make a decision on a defect after each transaction whereas, for k number of transactions, the CAVUHIGH-SPEED ensures a correct decision on fault in exactly (k+n-2) computation steps ((k-1)-steps for the first (k-1) transactions and (n-1)-steps for the last transaction) of the CA hardware. The design thus achieves a speedup of(4)Sk=kn-1k+n-2over the design of CAVUFULL-DIR for a set of k transactions.
Hardware requirements for high-speed design (CAVUHIGH-SPEED).
CAVUHIGH-SPEED
CAVUFULL-DIR
Number of cores
Number of FFs
Number of NANDs
Number of XORs
Area (units)
Area (units)
(1)
(2)
(3)
(4)
(5)
(6)
16
16
144
16
259840
259840
32
32
288
32
519680
519680
64
64
540
64
1039360
1039360
128
128
1152
128
2078720
2078720
256
256
2304
256
4157440
4157440
10. Verification for Limited Directory
The design described in Sections 4 and 5 is tuned to full-map directory in which traditionally a sharing vector is maintained for a block B to indicate cached copies of B in the system. This sharing vector is of length (n+1) for a system of n-processor cores. As a result, the directory storage overhead is quite unacceptable for CMPs with thousands of processor cores. The alternative scheme that considers compact directory organization is the limited directory protocol [25, 26]. In such organization, the (n+1)-bit sharing vector of block B is replaced by the fixed number of pointers to the processors’ caches which are having a copy of B. In this section, we address the design of a verification logic for a system with limited directory, considering non-broadcast-based solution to handle the case of directory runs short of pointers [25].
In a system with limited directory protocol, the pointers indicating processor ids/caches need to be decoded. The structure of limited directory, shown in Figure 19, indicates that caches C2 and C3 (corresponding to processors P2 and P3) are currently sharing a block B and at most four processors can share block B simultaneously (as 4 fields are for the pointers). That is, the update of sharing status and the pointers involves a small and fixed number of entries. Hence, the CA based verification unit for limited directory (CAVULIM-DIR) can be realized with an r-cell CA (r≪n), where r is the maximum number of sharers allowed for a block and thereby reduces the time to decide the coherence status.
Sharing vector for limited directory.
Figure 20 describes an architecture of CAVULIM-DIR with four pointers. For each cache transaction, after the pointer updating, the pointer is decoded to access the corresponding cache and the status of the block is read from the cache. Depending on the number of pointers r (in Figure 20, r=4), an r-cell CA is formed. The MSB of status of a block (B) in the cache, pointed to by the ith pointer, is used to set the ith CA cell rule. If the MSB of status is “1”, the corresponding CA cell rule is set to 254; otherwise, it is set to rule 255. This differs from the rule selection for CAVUFULL-DIR. Once the CA cell rule is set, the r-cell CA is run for (r-1) time steps and the LSB of the attractor (“1”/“0”) indicates the presence/absence of a fault in limited directory entry.
Verification unit for limited directory.
The overhead of CAVULIM-DIR, considering a four-pointer representation of the limited directory (as shown in Figure 19), is illustrated in Table 10. Column 1 represents the number of cores and columns 2 and 3 note the gate counts (number of FFs and 2-input NANDs). Column 4 records the area overhead. The area overhead of the CAVUFULL-DIR has been reproduced in column 5 for comparison. The results show considerable reduction in gate count and the area overhead in CAVULIM-DIR compared to those of CAVUFULL-DIR.
Hardware requirements for CAVULIM-DIR.
CAVULIM-DIR (considering four pointers)
CAVUFULL-DIR (Section 4)
# cores
# FFs
# NANDs
Area (units)
Area (units)
(1)
(2)
(3)
(4)
(5)
16
4
114
345216
259840
32
519680
64
1039360
128
2078720
256
4157440
11. Related Work
The schemes ensuring coherency in CMPs with large number of cores are reported in [2–5, 11, 27]. These deal with the interactions between on-chip interconnection network and the cache coherence protocol. Liqun et al., in [3], propose a solution with interconnection network composed of wires with varying latency, bandwidth, and energy characteristics, and coherence operations are intelligently mapped to the appropriate wires of heterogeneous interconnection. Zhao et al. [28] propose an alternative novel L2 cache architecture, where each processor has a split private and shared L2 cache. When data is loaded, it is located in private L2 or shared L2 according to its state (exclusive or shared). This scheme efficiently utilizes the on-chip L2 capacity ensuring low average access latency. This scheme employs a snooping cache coherence protocol and verifies it with formal verification method. A network caching architecture was proposed to address the issues of on-chip memory cost of directory and long L1 cache miss latencies in [5]. The directory information is stored in network interface component thus eliminating the directory structure from L2 cache. A verification logic that can dynamically detect errors in coherence controller (CC) has been proposed in [2]. Ros et al. have proposed the coherence protocol Dico-CMP for tiled CMP architecture [11]. Dico-CMP avoids indirection by providing a block from the owner node instead of home node thus reducing the network traffic compared to broadcast-based protocols. Further, a scalable organization of directory based on duplicating tags has been proposed to ensure that the directory bank size is independent of the number of tiles. However, the verification of cache coherence, an important problem, has not been addressed properly, possibly due to lack of efficient verification tool [29]. A verification logic which caters to snoop based systems only is reported in [2].
12. Conclusion
A solution for quick determination of data inconsistencies in caches as well as in recording of the sharing vectors in a system realizing directory based cache coherence protocol is reported. It avoids rigorous computation and communication overhead assuring robust and scalable design, specially, for a system with thousands of processor cores. The design is proved to be highly flexible to cater to different cache coherence protocols, for example, MSI/MESI/MOESI. The introduction of segmentation of the CMPs’ processor pool ensures better efficiency in making decision on the inconsistencies in maintaining cache coherence. Further, the design has been modified to cope up with limited directory based protocol.
Competing Interests
The authors declare that they have no competing interests.
OlukotunK.HammondL.The future of microprocessor2005372629WangH.BaldawaS.SangireddyR.Dynamic error detection for dependable cache coherency in multicore architecturesProceedings of the 21st International Conference on VLSI Design (VLSI DESIGN '08)January 2008Hyderabad, India27928410.1109/vlsi.2008.682-s2.0-47649102282LiqunC.MuralimanoharN.RamaniK.BalasubramonianR.CarterJ. B.Interconnect-aware coherence protocols for chip multiprocessorsProceedings of the 33rd International Symposium on Computer Architecture (ISCA '06)June 2006Boston, Mass, USA33935010.1109/isca.2006.232-s2.0-33845889046AhmedR. E.Energy-aware cache coherence protocol for chip-multiprocessorsProceedings of the Canadian Conference on Electrical and Computer Engineering (CCECE '06)May 2006Ottawa, CanadaIEEE828510.1109/CCECE.2006.277390WangJ.XueY.WangH.WangD.Network caching for chip multiprocessorsProceedings of the IEEE 28th International Performance Computing and Communications Conference (IPCCC '09)December 2009Scottsdale, Ariz, USA34134810.1109/pccc.2009.54038302-s2.0-77951173345GongR.DaiK.WangZ.Transient fault recovery on chip multiprocessor based on dual core redundancy and context savingProceedings of the 9th International Conference for Young Computer Scientists (ICYCS 08)November 2008Hunan, ChinaIEEE14815310.1109/icycs.2008.2712-s2.0-58349113422YamawakiA.IwaneM.Coherence maintenances to realize an efficient parallel processing for a cache memory with synchronization on a chip-multiprocessorProceedings of the 8th International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN '05)December 200510.1109/ISPAN.2005.27FenschC.Barrow-WilliamsN.MullinsR. D.MooreS.Designing a physical locality aware coherence protocol for chip-multiprocessors201362591492810.1109/TC.2012.52MR30457742-s2.0-84875864995HennessyJ. L.PattersonD. A.20033rdMorgan KaufmannRosA.AcacioM. E.GarcíaJ. M.Scalable directory organization for tiled CMP architecturesProceedings of the International Conference on Computer Design (CDES '08)July 20081121182-s2.0-62649104787RosA.AcacioM. E.GarcíaJ. M.A direct coherence protocol for many-core chip multiprocessors201021121779179210.1109/TPDS.2010.432-s2.0-78149285497WolframS.1994Addison WesleyDaluiM.GuptaK.SikdarB. K.Directory based cache coherence verifivation logic in CMPs cache systemProceedings of the 1st International Workshop on Many-Core Embedded Systems (MES '13) in Conjunction with the 40th Annual IEEE/ACM International Symposium on Computer Architecture (ISCA '13)June 2013Tel-Aviv, Israel334010.1145/2489068.24890732-s2.0-84882247443DaluiM.SikdarB. K.Design of directory based cache coherence protocol verification logic in CMPs around TACAProceedings of the 11th International Conference on High Performance Computing and Simulation (HPCS '13)July 2013Helsinki, FinlandIEEE31832510.1109/hpcsim.2013.66414332-s2.0-84888068391KurianG.MillerJ. E.PsotaJ.EastepJ.LiuJ.MichelJ.KimerlingL. C.AgarwalA.ATAC: a 1000-core cache-coherent processor with on-chip optical networkProceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT '10)September 2010Vienna, Austria47748810.1145/1854273.18543322-s2.0-78149271070LenoskiD.LaudonJ.GharachorlooK.GuptaA.HennessyJ.The directory-based cache coherence protocol for the DASH multiprocessor199018214815910.1145/325096.325132Pal ChaudhuriP.Roy ChowdhuryD.NandiS.ChatterjeeS.19971Los Alamitos, Calif, USAIEEE Computer Society PressDaluiM.SikdarB. K.An efficient test design for verification of cache coherence in CMPsProceedings of the 9th IEEE International Conference on Dependable, Autonomic and Secure Computing (DASC '11)December 2011Sydney, Australia32833410.1109/DASC.2011.72DasS.NaskarN. N.MukherjeeS.DaluiM.SikdarB. K.Characterization of CA rules for SACA targeting detection of faulty nodes in WSNProceedings of the 9th International Conference on Cellular Automata for Research and Industry (ACRI '10)September 2010Ascoli Piceno, ItalyDasS.MukherjeeS.NaskarN. N.SikdarB. K.Characterization of single cycle CA and its application in pattern classification2009252181203DaluiM.2014Howrah, IndiaIIEST ShibpurUbalR.JangB.MistryP.SchaaD.KaeliD.Multi2Sim: a simulation framework for CPU-GPU computingProceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT '12)September 2012ACM33534410.1145/2370816.23708652-s2.0-84867504986WooS. C.OharaM.TorrieE.SinghJ. P.GuptaA.SPLASH-2 programs: characterization and methodological considerationsProceedings of the 22nd Annual International Symposium on Computer ArchitectureJune 199524362-s2.0-0029179077SantovichE. M.SinghK. J.LavagnoL.MoonC.SaladanhaA.SavojH.StephenP. R.MurgaiR.BraytonR.Sangiovanni-VincentelliA. L.SIS: a system for sequential circuit synthesis1992UCB/ERL M92/41Electronic Research LaboratoryAgarwalA.SimoniR.HorowitzJ.HorowitzM.An evaluation of directory schemes for cache coherenceProceedings of the 15th International Symposium on Computer Architecture (ISCA '88)May-June 1988Honolulu, Hawaii, USAMahmoudM.WassalA.Hybrid limited-pointer linked-list cache directory and cache coherence protocolProceedings of the 2nd International Japan-Egypt Conference on Electronics, Communications and Computers (JEC-ECC '13)December 2013Cairo, Egypt778210.1109/jec-ecc.2013.67663892-s2.0-84899001815JergerE.2008University of Wisconsin-MadisonZhaoX.SammutK.HeF.Formal verification of a novel snooping cache coherence protocol for CMPProceedings of the CMP-MSI: Workshop on Chip Multiprocessor Memory Systems and Interconnects2007PongF.DuboisM.Verification techniques for cache coherence protocols19972918212610.1145/248621.2486242-s2.0-0031084017