Adaptive Fault-Tolerant Routing in 2 D Mesh with Cracky Rectangular Model

This paper mainly focuses on routing in two-dimensional mesh networks. We propose a novel faulty block model, which is cracky rectangular block, for fault-tolerant adaptive routing. All the faulty nodes and faulty links are surrounded in this type of block, which is a convex structure, in order to avoid routing livelock. Additionally, the model constructs the interior spanning forest for each block in order to keep in touch with the nodes inside of each block. The procedure for block construction is dynamically and totally distributed. The construction algorithm is simple and ease of implementation. And this is a fully adaptive block which will dynamically adjust its scale in accordance with the situation of networks, either the fault emergence or the fault recovery, without shutdown of the system. Based on this model, we also develop a distributed fault-tolerant routing algorithm. Then we give the formal proof for this algorithm to guarantee that messages will always reach their destinations if and only if the destination nodes keep connecting with these mesh networks. So the new model and routing algorithm maximize the availability of the nodes in networks. This is a noticeable overall improvement of fault tolerability of the system.


Introduction
In the last decades, the goal of many researchers was to study communication operations in networks with fixed topologies, including modeling architectures and routing algorithm of parallel computers and cluster or middle area communication networks (such as metropolitan networks covering a town or a small region).The quality of such networks strongly depends on correct and efficient execution of communication operations.
Direct networks [1] become a popular architecture for communication networks, especially in massively parallel computer system.In direct networks, nodes (computers) are connected to only a few nodes, that is, its neighbours, according to the topology of the networks and communicate with each other by exchanging messages.Moreover, the mesh structure is one of the most important topology of direct networks.Especially, low dimensional mesh networks, due to its low node degree, are more popular than the high dimensional mesh networks.Currently most of architecture of parallel computers is based on two-dimensional mesh topology, for example, Seitz et al. 1988 [2], Intel Touchstone DELTA [3,4], and Intel paragon.
Several models based on direct networks have been studied ( [5][6][7][8][9]), especially the two-dimensional mesh ( [10][11][12][13][14][15][16], etc.) for communication operations.The purposes of these papers mainly focus on how to route messages in the two-dimensional mesh.Routing is the process to send messages from source nodes to destination nodes, passing some intermediate nodes.A very important aspect of message routing is its ability to route from a source node to a destination node, avoiding all faulty nodes or links.
Basically, there are two types of message routing: (1) deterministic routing that is routing in which the routes between given pairs of nodes are determined in advance of transmission, (2) adaptive routing that allows us to take any path between its source and its final destination; that is, the path is adaptively constructed in the process of routing.
The deterministic routing algorithms are simple and ease of implementation, this is the advantage for deterministic routing.However, adaptive routing can reduce network latency and increase network throughput and the most attractive point is that it can tolerant more faults than deterministic routing [17].Thus the latter one emerged as an attractive field.In most papers on this field, they often considered how to make a path between source and destination node pairs, avoiding the faulty nodes, and most work used the disconnected rectangular block fault model [11].The disconnected rectangular blocks are composed of the faulty nodes and their neighboring nonfaulty nodes with the principle of maintaining rectangular shape.As a result, adaptive routing can tolerate faulty nodes by bypassing these rectangles.However, in order to maintain its rectangular shape, the block has to group some nonfaulty nodes inside, called unsafe nodes in these papers.Of course, these unsafe nodes will never be used until their corresponding blocks recovery, and the messages will never be sent to these nodes, while they should be (as illustrated in Figure 1).
Chien and Kim [18] present a partially adaptive algorithm for mesh networks.The basic idea is to use the algorithm to circumfuse any convex faulty regions.If faulty regions are not naturally convex, good nodes and links are marked as faulty until the regions become convex.However, once the faults are located on a boundary, in order to tolerate faults, all nodes form that boundary will become faulty.Boppana and Chalasani [10] use -chain and -ring, which is an extension of disconnected rectangular block fault model, to route the messages around them, and -chain addresses the boundary problem in the Chien and Kim's paper.But the chain and -ring may connect with each other; this makes the routing algorithm more complex than [18].In [11], Su and Shin assume a node to be the basic fault element.They construct the blocks based only on the faulty nodes; thus they can only tolerate faulty nodes except the faulty links.Overall, the construction of these faulty regions is static; that is, once these regions are constructed, all nodes including the good ones in these regions cannot join in routing any more.The faulty regions are not self-adaptive; that is, if some of faulty nodes in these faulty regions are fixed well, then the faulty regions will be held as they were, but actually they can release some good nodes and become smaller ones keeping convex shape.
Adaptive fault-tolerance routing technologies are also using in WSN (Wireless Sensor Networks), MEMS (Micro-Electro-Mechanical Systems) and SoC (System on Chip) to increase the usability and robustness, as well as the whole performance.Most network topology adopted in those domains is 2D mesh.As a result, in recent years, there have been a number of researches focusing on fault-tolerance routing on wsn and Noc [19][20][21][22].
In this paper, we concentrate on the adaptive routing with fault-tolerant in two-dimensional mesh.Not only we do consider the situation of faulty nodes but also the situation of faulty links incident with any node.However, different from mentioned papers, the novel cracky rectangular block strategy introduced to tolerate faults can route messages both bypassing the cracky rectangular block and along the cracks in the rectangular block (just for a trope, actually they are routed along the connected links inside the faulty blocks).So we can route messages to the nodes both outside and inside the faulty blocks.This is a noticeable overall improvement of fault tolerability of the system.At the same time, the cracky rectangular block is fully self-adaptive.It can tolerate dynamic faults.For example, when some of faulty nodes or faulty links in a block are fixed well, the original block may become a smaller block or split to some smaller ones keeping their shape rectangular.Tolerating dynamic faults can enhance the run-time life of a multicomputer, thus increasing reliability.
The rest of this paper is organized as follows.Section 2 describes the basic routing algorithm in two-dimensional mesh.Section 3 introduces the cracky rectangular block strategy, including the cracky rectangular block model and the routing algorithm on it.This section also describes how the rectangular blocks adapt themselves depending on the situation of networks.Section 4 gives a proof that the message will be sent to any destination in the mesh as long as the mesh keep connecting.A conclusion will be given in Section 5, and it presents possible directions for future work.

A Basic Routing Function in Two-Dimensional Mesh.
Consider a network  = (, ), in each node V, for each message  with final destination V  , arriving on a link ⟨, V⟩; we denote by  V (, V  ) ⊂ Γ(V) the subset of V's neighbours bringing  closer to its destination if V  ̸ = V; otherwise, the message is absorbed by V. Actually it is a routing function, this kind of routing is said to be local because it is independent of what happened in the rest of the network and can be computed locally by each router.
The basic routing function is a classical greedy routing function   in the two-dimensional mesh ( 1 ,  2 ) as follows.Let  = ( 1 ,  2 ),  = ( 1 ,  2 ) be two different nodes in ( 1 ,  2 ).A message  with destination  received by the router of  arriving from node   is sent to a node of   (  , ), that is, a set of at most two nodes { 1 ,  2 } (and at least one node) defined as follows.There are at most two nodes , then the routing function will choose to send the message to  1 (resp.,  2 ).If this is not possible (e.g., the link incident with the chosen node is faulty), then the routing function tries to send the message to  2 (resp.,  1 ).Moreover, if both the links ⟨,  1 ⟩ and ⟨,  2 ⟩ are faulty, then the router of  can route  to any node of Γ() \ { 1 ,  2 }.

Blocking Situation and Its Traditional Solution.
Consider now that a unique message  is transmitted to the twodimensional mesh ( 1 ,  2 ).As we will show below, in case of some link faults which do not disconnect the network, using the basic two-dimensional routing function does not guarantee that  will reach its destination.It can be blocked in a part of the network.As shown in Figure 2(a), the basic routing function can unfortunately lead to blocking situations due to some properties of the structure of the faulty links.Clearly, in this example the message will always in the subgraph induced by nodes , , and  and will never reach its destination node .Actually, this message  is in livelock situation, which keeps a message moving indefinitely without reaching the destination.
It is well known that the adaptive routing may cause livelock problems.Therefore routing without livelock is one of the most important design issues for communication operations in multicomputer systems (note that we only consider the livelock situation in this paper, and we can solve the deadlock problem with some sophisticated methods [1,23,24]).Contemporary, this livelock situation is well addressed by the traditional disconnected rectangular block faulty model (rectangular block for short).However, the usability and robustness of the mesh network will gradually decrease, while the number of faulty nodes increases in this model.As [25]'s experiment shows that the distribution of faulty nodes has the tendency to make the whole mesh to be one "big block." It can be seen from the experiments that, with the rectangular model, there is only one faulty block left when the faulty rate of nodes is 15 percent and the size of two-dimensional mesh is 100 × 100.In consequence the whole mesh becomes useless because this big faulty block occupies the entire mesh region, and we call this as "big block" problem.The novel cracky rectangular faulty block strategy, which we will introduce in the next section, makes full use of nonfaulty nodes/links in the mesh.All the nonfaulty nodes/links that would have been included in original rectangular faulty blocks now can become candidate routing nodes/links.

Adaptive Fault-Tolerant Strategy with Cracky Rectangular Block
In order to solve the livelock situation and the big block problem, we propose a novel strategy for fault-tolerant routing.We use the cracky rectangular block to avoid livelock and traverse block's every connecting internal node if needed.Therefore, we can transmit each message to any node not only outside of a block but also inside of the block like Figure 2(b), and the message can reach the inside nodes , , and  which are forbidden in the original rectangular block.Formally, a rectangular block (( 1 , ℎ 1 ), ( 2 , ℎ 2 )) is a submesh (ℎ 1 −  1 + 1, ℎ 2 −  2 + 1) of the mesh ( 1 ,  2 ) induced by the nodes  = ( 1 ,  2 ) with   ≤   ≤ ℎ  , for each  ∈ {1, 2}.Let  = ( 1 ,  2 ) be a node.By definition, if, for each  ∈ {1, 2}, we have   − 1 ≤   ≤ ℎ  − 1, then  belongs to the inside part of the rectangular block .Else, if  = 1,  = 2 (or  = 2,  = 1) and   ∈ {ℎ  ,   } and   ≤   ≤ ℎ  , then  belongs to the border of the rectangular block.
A cracky rectangular block (cracky block for short) is a rectangular block with spanning forest internal induced by all the connecting nodes inside of this block, all the roots of that forest belong to the border of the cracky block, and the spanning forest connects all the internal nodes to their roots if and only if those nodes still keep connecting.
Figure 3 presents two instances of the cracky block in a two-dimensional mesh, which are  and , respectively. is a general cracky block, while  is a cracky block which is induced by the faulty links on the boundary of the mesh, and it is an incomplete cracky block.

Construction of the Cracky Rectangular Block. Each node's activities are based on message-driven mechanism.
There are two types of messages routed in mesh.One is entity message (message for short), which is routed between any node pair.The other one is system message, this type of message can only be sent between neighbours, and their contents are mainly about the status of themselves, such as the node's faulty degree and it's detailed situation of faulty links.The first one is the entity for computing or communication, and  the later one concentrates on maintaining the usability and robustness of networks; in other words, it is for constructing the cracky block when some faults occur in this mesh in order to avoid the livelock situation as mentioned above.
In the beginning, all the nodes work well; that is, there does not exist any faulty node or faulty link.Any node can both receive the message from any of its neighbours, and vice versa, of course, depending on the basic routing strategy.When some nodes or links are ruined because of some reasons, these failed nodes or nodes incident with failed links will judge their current status immediately, and then they send the system messages as soon as possible including their status to its connected neighbours to tell them what have happened in detail.For neighbours, once they receive the system messages, they judge their current status depending on their latest status and the received system messages at once.Of course, they will notice their connected neighbours about current status if and only if the current status is different from previous status.Finally, the construction of a stable cracky block is implemented by the above system messages exchange.
Before exposing our distributed algorithm to construct the cracky block, we will give some definitions first.For a two-dimensional mesh ( 1 ,  2 ), the faulty degree of a node  = ( 1 ,  2 ) is the number of failed links incident with , and we denote it by ().From the observation of a cracky block, there are three types of nodes in a mesh network: faulty node, good node, and border node.A faulty node  belongs to the interior of a cracky block and 0 ≤ () ≤ 4, and oppositely, a good node allocates outside of any cracky blocks and () = 0. Of course a border node belongs to the border of a cracky block with 0 ≤ () ≤ 1.For example, in Figure 3, , , and  in the cracky blocks  and  are faulty nodes,  Input: (V): faulty degree of node V. Output: (V): status of node V.
Table 1: Given a node V with (V) = 1, judging its (V) according to its incident failed links.

𝑠(V) 𝑓 ( V) = 1 {𝑁}
The link ⟨V, (V)⟩ failed {} The link ⟨V, (V)⟩ failed {} The link ⟨V, (V)⟩ failed {} The link ⟨V, (V)⟩ failed and  are border nodes, and any node outside of  and  is good node.Given any node , let  * () be the set of neighbors  of  such that the link ⟨, ⟩ is not faulty.Moreover, we set an order in  * () as follows:  0 ,  1 , . . .,  | * ()|−1 .Let   be the node who sends message to node , and let   be the node who will receive the message sent by .
We denote by () the status of , and () is one of the elements of status set  = {{}, {}, {}, {}, {, }, {, }, {, }, {, }}.The status of a node will indicate which type of node it is.In detail, there are two more status of , which are empty set 0 and universal set 1.And () = 0 shows that the node is a faulty node, () = 1 identifies a good node, and if () ∈ , then  must be a border node.For example, if () = {}, then the node  locates at the north border of a cracky block, like  in Figure 3, or if () = {, }, then  is at the northeast corner, just like .Totally, the system message can be sent to four neighbours, and   will be , , , and  according to (), (), (), and ().A system message sent by node  is () = {  } ∪ () which includes the destination neighbour and sender's current status.For example, () = {, } means that this message will send to () and () = .We define an operation to implement the status judgment of a node who receives a novel system message.The corresponding algorithm to update the status for any node in mesh network is given by Algorithm 1.
At the beginning of the construction, every node should run the procedure initial status respectively to make sure its status () according to its faulty degree ().After finishing the above procedure, node will run the procedure notice status to send system messages to neighbours according to its status ().Once a neighbour node receives this type of message, it will run the procedure update status to refresh its latest status.Actually, this process will be repeated until every node's status getting stable.Finally, there will emerge some cracky blocks in the mesh.For example, Figure 4 shows else if V is a faulty node then (6) (V) ←  (7) end if (8)   ∈  * (V) (9) check (  ) for all   until ∃ ∈  * (V) s.t.() = ℎ (10) (V) ← ℎ ( 11) pred(V) ←  (12) end procedure Algorithm 2: Let V be a node in a cracky rectangle block, making it hung when it is possible.a distributed process to construct a cracky block.We just pick up four nodes, , , , and , to describe how the algorithm performs.During the first phase, node  will initial its status () = 0 because of () = 2; meanwhile () = 1, () = 1, () = 1 as a result of () = 0, () = 0, and () = 0 separately.In the second phase, according to the algorithm only node  will send system message to its connected neighbours which are nodes  and .Finally node  receives two system messages and will refresh its new status by () = (1 ∩ {, }) ∩ {, } = {, }, so it will be the northwest of a cracky block for this moment.For nodes  and , () = 1 ∩ {} = {} and () = (1 ∩ {}) ∩ {} = 0 are a west border node and a faulty node.When the algorithm stops, there is a border of crack block as shown in Figure 4 by the bold line.The macroconstruction of a block depends on the microdistributed message exchange activities of relative nodes.
Then we will construct the spanning forest for the faulty nodes.We say that the faulty node is hung if and only if it chooses exactly one neighbor as predecessor.Denote by pred(V) the predecessor of V, and Succ(V) = {V  ∈ Γ(V) : V  is a faulty node and V = pred(V  )}.We consider an order over the elements of Succ(V).We denote by succ  (V) the th element of Succ(V), with 1 ≤  ≤  V = |Succ(V)|.A node V is said to be final if Succ(V) = 0.A node which is not hung is free.We denote by (V) these two boolean states {ℎ, }, which refers to the hung and free status for each node inside of the block.After running Algorithm 2, the spanning forest for the block will be accomplished; like  and  in Figure 3, every node inside of blocks  and  will find only one predecessor and be marked as hung.

The Cracky Rectangular Block Is Stable.
A node V is said to be stable if pred(V)'s status can never change to a free status, during the running time of the algorithm.In particular the nodes of the cracky block are stable.A cracky block  is said to be stable if all the nodes belonging to  are stable.
To prove that the cracky block is stable, we assume that there exists a set  of nonstable nodes.If  = {V}, then V is free and can never become stable during the running time of Algorithm 2, so in this case there is no stable node in the neighborhood of V because otherwise V will choose this node as predecessor.Using the same argument for each node  of  * (V), there is no stable node in the neighborhood of .But since the graph is connected, V is necessarily joined by a path to a node  of the border of the block.So we are in contradiction with the fact that the nodes of the border are stable.If || ≥ 2, then since the graph is connected there exists at least one node  in  which is adjacent to a node V ∉ .Clearly, V is stable.From the second loop of the algorithm and since there is an order in the neighbors of  for the choice of its predecessor, there exists a step in the algorithm which leads the node  to choose V as predecessor.After this step, let  ←  \ {}.Using the same arguments, after some steps of the algorithm, the set  would be empty.So all the nodes are stable.

Adaptive Routing with the Cracky Rectangular Blocks.
In this section, we will give the global fault-tolerant routing Input: V: node V routing messages, V  : the node who sends messages to node V. Output: V  : the node who will receive the messages.
(1) procedure Routing(V) (2) if V is a good node then (3) basic routing function with node V (4) else if V  = pred(V) then (10) V  ← succ 1 (V) (11) end if (12) else if V is a border node then (13) routing according to Table 2. (14) else if V is a node belongs to the border of mesh then (15) if V is final then (16) V  ← V  (17) else (18) V  ← succ 1 (V) (19) succ 1 (V) ← V  (20) end if (21) end if (22) end procedure Algorithm 3: The novel routing algorithm based on cracky rectangular blocks.strategy.Primarily, once a message encounters a cracky block, this message will bypass the cracky block, which encloses the faulty nodes/links, along its border node in a clockwise (or counter-clockwise) manner.Especially, the message should traverse the interior spanning tree rooted with the border node by Depth-First-Search, while it bypasses the cracky block.Finally, the message will leave the cracky block from one of its corners which is the nearest from the destination and keeps going with the basic routing function; otherwise the message will be absorbed by the interior node which must be the destination node.
We now give the complete local routing function we run in each node V of ( 1 ,  2 ), as shown in Algorithm 3.This algorithm is based on the basic routing function we have defined in Section 2.
The cracky rectangular block and the adaptive faulttolerant algorithm make up the fault-tolerant strategy, and we can use Algorithm 3 to send a message from any connected node to arbitrary connected node.For example, in Figure 3, the good node  wants to send a message to the node , but node  is a faulty node and locates interior the cracky block , the algorithm will send this message along the path shown in the figure, and the faulty node  sending message to another faulty node  also can be accomplished by the algorithm; if a good node   wants to communicate with another good node , the routing path will like the situation depicted in the figure .3.4.Self-Adaptive and Faulty Boundary Independency of Cracky Rectangular Block.For high performance and usability, the cracky blocks should be self-adaptive.As we know, the emergency of cracky blocks in a mesh is the result of nodes managing themselves distributedly and independently.The status of an isolated node is closely related to their neighbours.Therefore the size and shape of a block are dynamic according to faulty nodes.In other words, if some of the faulty nodes have been fixed, the original block may become a smaller one or split up into smaller ones.On the contrary, if some good nodes or links fail, there will be some new cracky blocks or some of the original cracky blocks grow huge as a result.
Given a two-dimensional mesh ( 1 ,  2 ), let V(V 1 , V 2 ) be a faulty node in cracky block (( 1 , ℎ 1 ), ( 2 , ℎ 2 )).Let (  , V 1 ) with  1 ≤   ≤ ℎ 1 , and let (V 2 ,   ) with  2 ≤   ≤ ℎ 2 .There is a fact that when V has been repaired such that (V) = 0, then we should make sure if () = 0 (resp., () = 0).If they are, then  (resp., ) may be cancelled from the block  and  will become four smaller ones at most.In addition, these new cracky blocks still keep stable.The cancelled row or column may becoming the new border belonging to those new cracky blocks, alternatively becoming the good ones outside any blocks, so they will still keeping hung, certainly their successors will also keeping hung.To implement the above, when a node with its incident links is fixed well, we just send a recovery signal to its four neighbours to rerun the procedure initial status in Algorithm 1. Recursively, the recovery signal will be sent to nodes which connected with the faulty nodes received the signal until it meets the good node outside the cracky block.
For example, Figure 5(a) shows a cracky block, and ,  are two faulty nodes with () = 2, () = 1, and () = 0.When the two nodes  and  have been repaired, they all changed to good nodes with () = () = 0.The cracky block will become like Figure 5  new border and the cracky still keeps stable (note that we do not give the detail because of the page limitation).
Our adaptive fault-tolerant routing strategy is faulty boundary independency; that is, if there exists a fault occurring on the boundary of the mesh, the strategy is still running.Lines 14 to 21 in Algorithm 3 give a solution to this situation.When some of the boundary nodes of mesh have failed, then the corresponding cracky block will be constructed like  in Figure 3.As shown, it is an incomplete cracky block.If   wants to send a message to , the message will first go to node ℎ according the basic routing function.Because ℎ is the border node of , the message will be sent to  which is a boundary node of the mesh.When the message traverses all the successors of , it will be rebound to the node ℎ to continue routing and finally find its destination.To sum up, the message will continue routing when it encounters a mesh boundary because of the rebound function.

The Cracky Rectangular Model Is Creditable
The next two propositions show that, with the above algorithm, each message will reach its destination, if a message arrives in a node of a cracky block ; (i) if its destination is in , it will reach it; (ii) if its destination is out of , it will leave  closer to its destination than before.This is shown by the next two results.We claim that, in the following routings, except the special case that  moves along a quasi-Hamiltonian cycle of a cracky block, the routing of  will never augment the -dimension distance.
To prove the claim, it is clear that by the routing algorithm, if  do not meet a cracky block, it cannot move from a node  = (

Figure 1 :
Figure 1: An example of disconnected rectangular blocks.Note that nonfaulty nodes, such as nodes  and , in a block will never be used in routing any more.

Figure 2 :
Figure 2: (a) Livelock situation in ( 1 ,  2 ): node  tries to send a message  to node , but  is in a livelock induced by nodes , , and  caused by the faulty links.(b) Cracky rectangular block solves the livelock problem, and the message  will be sent to its destination along the node sequences , , , , , , , ℎ, , , , , and .

Figure 4 :
Figure 4: Construction of a cracky block depends on system message exchange.The dash line represents faulty link, and the bold line makes up the border of a cracky block.The arrow refers to the system message.
(b), and , , and  become the Cracky block after nodes , , and  fixed well

Figure 5 :
Figure 5: A sample of self-adaptive block.
Figure 3:  and  are two cracky rectangular blocks. is a general cracky block, while  is a cracky block which is induced by the faulty links on the boundary of the mesh, and it is an incomplete cracky block.As nodes , , and  in the cracky blocks  and  are faulty nodes,  and  are border nodes, and any node outside of  and  is good node.With the cracky rectangular block, not only good node   keeps connecting with another good node  as general, but also the good nodes ,   can send messages to faulty nodes , , respectively; what is more, those faulty nodes can communicate with each other, such as nodes  and .

Proposition 1 .
Consider a cracky block .By using Algorithm 3, if a message  has its destinations  in  and if it arrives in a node of the cracky block, then it will reach its destination.Proof.Consider a subgraph  of the mesh induced by the nodes of a cracky block  = (( 1 , ℎ 1 ), ( 2 , ℎ 2 )).By definition of Algorithm 3, a message moving on  follows a circuit crossing each node at least once.Consider a message  moving in the mesh, with destination ( 1 ,  2 ) ∈ (), reaching a node   ∈ ().By definition of the routing function, since  1 ≤  1 ≤ ℎ 1 and  2 ≤  2 ≤ ℎ 2 , then  will never want to take an arc out from the .Thus, it follows the circuit of  induced by the algorithm.So,  will arrive in node .Consider a cracky block .By using the algorithm, if a message  has its destination  outside  and if it arrives in node of the cracky block, then it will leave  and be closer to its destination than before.Proof.Let () be a message with destination ( 1 ,  2 ).Suppose that () moves from a node ( 1 ,  2 ) to another node   = (  1 ,   2 ) by the dimension  according to the routing algorithm; that is, |  −  | = max{|  −  | : 1 ≤  ≤ 2} and |  −   | − 1 = |   −   |.