Appendix B

Appendix B A Hardware Implementation

It is a tradition of our laboratory to implement our designs as actual hardware prototypes, rather than limiting ourselves to software simulations. In many cases, this approach has led us to identify problems and possible improvements which were not apparent in simulation. Such was the case, for example, for the MicTree prototype described in subsection 2.2.4, which revealed the limitation of fixed-size cells. To a lesser extent, designing a prototype of our FPGA also revealed some ways to improve on our self-replication and self-repair mechanisms.

In this appendix, we will describe the logic layout of our MuxTreeSR (for MuxTree with self-repair) prototype (section B.1), and then cursorily describe the MuxNet (for network of MuxTreeSR elements) circuit (section B.2), an attempt to implement a 4x4 array of MuxTreeSR elements into a single commercial FPGA.

B.1 MuxTreeSR

In this section, we will describe in detail our prototype of the FPGA we developed in this thesis. After a brief overview (subsection B.1.1) of the system, we will present the logic layout of our circuit following a top-down approach: starting from the cellular automaton used for replication (subsection B.1.2), through the logic required for rerouting the array in the presence of faulty elements (subsection B.1.3) and the state machine which control the array's operation (subsection B.1.4), to the actual core of the MuxTreeSR element, consisting, as we have seen, of a programmable function (subsection B.1.5), a switch block for long-distance connections (subsection B.1.6), and a configuration register (subsection B.1.7).

B.1.1 Overview

Our laboratory's teaching activities include a number of laboratory sessions aimed at introducing the principles of logic design. In order to accomplish this task, we use a set of logidules(logic modules), plastic cubes which contain standard logic circuits (AND gates, RAMs, etc.). These modules can be connected together (not unlike a puzzle), automatically providing the circuits' power supply as well as the minimal connections between modules.

Aside from being an invaluable teaching tool, the logidules are also an interesting platform for the development of prototypes, as they can be easily exploited for the support circuitry which invariably surrounds any prototype. It is therefore not surprising that we decided to exploit the same approach in the design of the prototypes related to the Embryonics project, which we call Biodules (for biologically-inspired logic modules). The prototype of MuxTreeSR, the FPGA presented in this thesis, is no exception, and was embedded in a module called Biodule 603 (Fig. B-1).

Each Biodule 603 is designed to contain a single MuxTreeSR element, along with four copies of the cellular automaton used to define the blocks' membranes (the four copies were required in order to be able to use all the Biodules, as explained below in subsection B.1.2). The Biodule 603's displays include:

four 7-segment displays (in the four corners) which show the state of the corresponding CA elements;
a set of five 7-segment displays which show the value of the element's configuration;
a set of four LEDs (top and bottom) used to indicate the source of the signals (in case of reconfiguration);
two sets of LEDs which show the value of the functional outputs and the input and output values of the flip-flops of each of the two copies of the element's programmable function;
a single LED which lights up whenever a fault is detected in the element;
a three-color LED which denotes the state of the element (green for active, red for dead, yellow for spare or during the configuration and repair processes).

In addition to the displays, each Biodule 603 also contains a rotary encoder used to select one of ten possible faults to introduce in the element, and a push-button which activates the selected fault: a single push will introduce a temporary fault (which will disappear whenever the FPGA is reset), while a double push will introduce a permanent fault.

Invisible to the user, the Biodule 603 contains a small printed circuit board on which all the required circuitry is mounted. Aside from the displays mentioned above and the components (such as resistor nets) required for their control, the only "active" circuits are a Xilinx XC4013PQ240-4 FPGA [111], used to implement the MuxTreeSR element, and a 256k-bit EEPROM used to store the Xilinx's configuration bitstream.

Using a Xilinx FPGA with a locally-stored configuration to implement our elements allows us the opportunity of upgrading our design at a latter date without needing to build a different set of Biodules. In fact, the amount of programmable logic offered by the Xilinx chip is much greater than that required to implement our MuxTreeSR elements: our design uses approximately 15% of the available resources, a figure which includes all the circuitry required to control the displays. On the other hand, the XC4013 is the smallest Xilinx circuit available with enough I/O pins to satisfy our requirements: our design occupies only a fraction of the programmable logic, but uses 100% of the available pins.

Like all logidules, the Biodules 603 can be joined together to form a two-dimensional array (Fig. B-2), with the connections implemented either through the automatic contacts on the perimeter of the box (8 per side) or through four 16-bit-wide connectors. As of now, 18 modules are available, allowing us to build a 6x3 array of MuxTreeSR elements.

To operate the array, the following signals must be provided externally:

a configuration bitstream, including the input sequence for the cellular automaton, the test pattern, and the blocks' configuration;
two clocks, a faster one for the configuration, and a slower one for the operation of the array;
a functional reset, which resets the elements' flip-flops to the default value defined in their configuration;
a global reset, which resets the current configuration of the MuxTreeSR array;
a power reset, which resets the configuration of the Xilinx chips.

B.1.2 The CA Level

The topmost level in the logic layout of our element is, of course, the interface between the I/Os of our element and the pins of the Xilinx chip. With a few exceptions (e.g., the encoding/decoding of the signals controlling the 7-segment displays), this relationship is one-to-one (i.e., a pin for each I/O line), and is therefore of no particular interest.

The topmost relevantlevel in our layout is therefore the cellular automaton which defines the size of our blocks (Fig. B-3). As we have seen, the CA elements are placed outside of the MuxTreeSR elements (Fig. 3.29), and in theory a single CA element per Biodule would be sufficient to realize our system. However, since each Biodule contains a single MuxTreeSR element, and since the CA membrane must completely surround the active logic elements, we were forced, in order to avoid "wasted" modules, to introduce four CA elements into each Biodule. This redundancy, which is not crucial since we use only a fraction of the Xilinx's programmable logic, allows us to obtain an usable logic element for every available Biodule.

Our first implementation of the cellular automaton was functionally equivalent to the automaton described in section 3.5, which allowed a fairly trivial transition to hardware. We soon realized, however, that the limitations imposed by this automaton (notably, the requirement that blocks be perfect squares) were too restrictive, and we improved its functionality to that described in subsection 4.3.3. The new automaton allows both the definition of rectangular blocks and the assignment of spare columns of elements. Unfortunately, this improved functionality comes at the cost of augmented complexity, and the hardware implementation of the new automaton is far from trivial.

As we have seen in section A.2, the new automaton uses 23 states and a neighborhood of 5, with a total of 207 transition rules. A trivial implementation of this automaton would require a look-up table of 23⁵ 5-bit words, a 5-bit register to hold the present state, and eight sets of 5-bit-wide connections with the neighboring elements (four input and four output busses). Even if it were possible to reduce the size of the look-up table (since the automaton uses a fraction of all possible transition rules), such an approach is obviously out of the question, if not for the size of the register (a 5-bit register might be barely acceptable), at least for the size of the busses.

Fortunately, it is possible to enormously simplify the layout of the elements through two techniques:

serializing the connections, which reduces the size of the busses to a single line, and
hard-wiring the transition rules, which does away with the need for a look-up table.

These techniques, while not applicable to all cellular automata, are extremely useful for our automaton which, as we mentioned, was designed for exactly this kind of hardware implementation.

The resulting circuit (Fig. B-4) consists of four D-type flip-flops, a few logic gates, and four 1-bit-wide connections (inputs from the south and west neighbors, outputs to the north and east neighbors). Its operation, because of the hard-wired transitions, is not apparent at first glance, and requires some explaining.

The configuration stream for the automaton (Fig. 3.29) is stored in an EEPROM (Electrically-Erasable Programmable Read-Only Memory) outside of the array, and enters the array from the lower left corner element. The four possible states of the automaton are encoded using three bits:

Table B-1: Encoding of the states of the automaton.
State	Encoding
Horizontal wall	001
Spare column	011
Vertical wall	101
Junction	111

The states enter the array left-bit-first, so that the first bit of each state is always a 1. Each element receives information indifferently from the south or the west inputs. As long as the element receives only 0s, nothing happens, and the automaton remains in the quiescent state 0000. As the first 1arrives, it is stored in the first flip-flop ( Q3), and shifted from there to the second ( Q2) and the third ( Q1). At this point, when Q1=1 (since the first bit of an incoming state is always 1) and Q0=0 (since the element has not yet been configured), the automaton makes a decision:

if Q3=0(that is, if the incoming state is either a horizontal wall or a spare column) and the data is coming in from the south and not from the west, then the incoming state is ignored and the flip-flops reset, since horizontal walls and spare columns should only propagate horizontally;
if Q3=1and Q2=0(that is, if the incoming state is a vertical wall) and the data is coming in from the west and not from the south, then the incoming state is ignored and the flip-flops reset, since vertical walls should only propagate vertically;
in all other cases, the flip-flops are not reset and the incoming state is allowed to shift once more, so that Q0=1 and Q1and Q2uniquely identify the state of the element.

Once the element is configured (Q0=1), the shift chain is broken: Q2, Q1, and Q0remain fixed, thus setting the state of the element, while the incoming stream passes through Q3 and is propagated to the north (if the state is a vertical wall or a junction) and/or to the east (if the state is a horizontal wall, a spare column or a junction). The state itself is then decoded to provide the required signals which are used to redirect the flow of the FPGA's configuration.

This small circuit allows the user to very simply define rectangular membranes of any given size and to program the frequency and placement of the spare columns, and is thus functionally equivalent to the automaton described in subsection 4.3.3.

B.1.3 The Self-Repair Level

Immediately below the cellular automaton level we find a level dedicated to the rerouting of the connections for self-repair (Fig. B-5).

As we have seen in chapter 4, our self-repair mechanism reconfigures the array so as to avoid using the faulty elements, and reroutes the connections around them. The minimal set of connections that need to be rerouted to implement the pattern shown in Fig. 4.17 consists of:

on the north side, the input and output long-distance busses NIBUSand NOBUS, the short-distance output NO, and the input and output configuration lines C_NIand C_NO (that is, the lines used to propagate the configuration through all the elements in a block);
on the south side, the input and output long-distance busses SIBUSand SOBUS, the short-distance input SI, and the input and output configuration lines C_SIand C_SO;
on the west side, if the element is active the output long-distance bus WOBUSreceives the corresponding output of the internal switch block and the short-distance output WOreceives the short-distance input SI, while if the element is dead, WOBUS receives the input long distance bus EIBUSand WOreceives the short-distance input EI;
on the east side, if the element is active the output long-distance bus EOBUSreceives the corresponding output of the internal switch block, the short-distance output EOreceives the short-distance input SI, and the output configuration line C_EO receives the signal generated within the element, while if the element is dead, EOBUSreceives the input long distance bus WIBUS, EOreceives the short-distance input WI, and C_EO receives the input configuration line I_C_WI;

In addition, we put at this level one extra multiplexer whose function is to connect the global configuration line GL_C_O to the west input configuration line I_C_WIin entry point elements. Placing the multiplexer at this level is not strictly necessary, but simplifies the layout.

B.1.4 The Control Logic

Figure B-6 shows the core of the MuxTreeSR element. The three main subcircuits described in chapter 4 (the programmable function, the switch block, and the configuration register) are clearly visible. Also visible is a fourth subcircuit, labeled STATE, which has not so far been described.

In this subcircuit (Fig. B-7), we collected all of the control logic required to handle all the phases of the operation of our array. Its size, while consequent, is not quite as important as it might appear, since it also includes a considerable amount of logic required for the operation of the Biodule (logic which would then disappear in a VLSI implementation). Moreover, some of the signals generated in this subcircuit are required not by the system itself, but rather by the use of a Xilinx FPGA, which is somewhat limited in its versatility and has some peculiar requirements which complicate the design of our control system (for example, it has difficulty handling bidirectional busses crossing multiple chips, a non-trivial problem where our global control lines are concerned).

The heart of the control logic is a state machine, implementing the state graph of Fig. B-8. Each element of the array can be in one of 7 possible states, and the transitions are controlled by two signals:

A global hold signal G_H_I. This signal is used to "freeze" the execution of the circuit. It can be seen as a single global line which is usually pulled to 0by a pull-down resistor outside the array. Any element can pull it to 1by setting the G_H_O signal. It is set to 1while the register is being tested, while the configuration bitstream is propagating, and while the reconfiguration mechanism is active.
An internal fault detection signal F_SPOT. This signal is set to 1if and only if a fault is detected inside the element itself.

The seven states operate as follows:

S_TST( 000) is the initial state. In this state, the element is waiting for the test pattern ( subsection 4.2.5 ) to come in and check the register, and imposing a 1on G_H_O. If a fault is spotted during this test, the element immediately dies ( DEAD). If, on the other hand, no fault is spotted, the global hold signal is released. As soon as all the elements have released the hold signal ( G_H_I=0), the element starts to wait for the configuration to arrive ( S_CFG).
S_DIE( 001) is the state acquired by the faulty element while the repair process is occurring. As soon as the reconfiguration is complete ( G_H_I=0), the element dies ( DEAD).
S_FLT( 010) is a temporary state, introducing a one-clock-cycle delay between the detection of a fault and the beginning of the repair process. It is not in fact required for the operation of the array, but is a consequence of certain limitations of the Xilinx chip, and should disappear in a VLSI implementation.
S_REP( 100) is the state acquired by all non-faulty elements during the repair process. The elements go back to the active state ( S_ACT) as soon as the reconfiguration is finished.
S_CFG( 101) is the configuration state. Elements remain in this state while they are waiting for the array to be configured. In this state, they pull the global hold line to 1 as long as their configuration register is not full. Once configured, they release the global hold line. Once the global hold line goes to 0 (i.e., when all the elements have been configured), the array begins operation and all elements go to state S_ACT.
S_ACT( 110) is the active state. All elements, once configured, remain in this state and execute their function until the circuit is reset or the repair mechanism is activated.
DEAD( 111) is the state of faulty elements after the reconfiguration is complete. Elements in this state are not part of the operation of the array, and the only exit from this state is through a global reset of the FPGA.

Among the other signals generated in this subcircuit, we will mention:

G_K_O, a semi-global signal which propagates the KILL command (subsection 4.3.5) throughout a column of blocks;
R_COL_O, which propagates downwards the information that the column contains spare elements;
R_C_EO, which is set to 1if the current element is or has been involved in the repair process;
R_C_WO, which is set to 1 to signal the completion of the repair process;
V_DEVand H_DEV, which control the rerouting of the connections;
REG_EN, which enables the shifting of the array, either during the configuration phase or during the repair process;
REG_RST, which resets the configuration register.

Apart from the above, the subcircuit also contains some intermediary control signals and some control logic for the 7-segment displays in the Biodule.

B.1.5 The Programmable Function

The subcircuit implementing the programmable function (Fig. B-9) is a straightforward implementation of the layout shown in Fig. 4.14. It consists of two copies of the original function plus some test logic.

The left (Fig.B-10) and the right (Fig.B-11) copies are identical, except for the different positioning of the faults. The layout corresponds exactly to the one shown in Fig. 4-2, with the exception of a slightly modified input selector, which nevertheless performs the same function. Note the relatively complex control logic for the flip flop, required to allow it to be set to a default value stored in the configuration ( FF_DEF) and to be chained with the configuration register ( FF_CFG ).

As far as the test logic is concerned (Fig. B-12), it is a direct implementation of the circuit described in subsection 4.2.3, with the comparator for fault detection, the third copy of the flip-flop, and the 2-out-of-3 majority function FF_OUT.

B.1.6 The Switch Block

There is not much to say about the switch block (Fig. B-13), which is a straightforward implementation of the circuit described in subsection 4.1.2. It allows any one of the four output busses to carry the signal coming from the three other input busses or the output N_OUTof the element.

B.1.7 The Configuration Register

The hardware implementation of our configuration register led us to one relatively important modification to the propagation mechanism for the bitstream. This modification (which in no way alters our overall approach) concerns the mechanism which fills the registers with the appropriate value.

In the mechanism we described in chapter 4, the registers within a block were chained together to form one long shift chain. This approach, while perfectly feasible, requires however a dedicated global line to signal the end of the block's configuration, as well as some relatively important amount of additional logic to detect such an event.

In order to avoid this penalty, we altered our mechanism so as to fill the registers in order (Fig. B-14): rather than filling all registers in series, we fill the first, then the second, and so on until the last in the block. The new mechanism does away with the need to detect and propagate the end of the configuration at the expense of an additional bit in the configuration register. In fact, with this approach we require a way to detect that a register has been completely filled, so as to proceed to the next. We found that the least expensive way to obtain such behavior was to set the first (head) bit of the configuration of each element to 1: the end of the configuration is then signaled by the arrival of the 1to the head of the register.

Aside from this modification, the implementation of our configuration register (Fig. B-15) is identical to the design introduced in chapter 4. The register, which in this implementation is 24-bit wide⁴⁰, receives its configuration from one of three possible sources:

if the array is expecting the test sequence of subsection 4.2.5 to arrive, all the registers will receive it in parallel from the global configuration line;
if the element is on the bottom row of a block, or if the element is undergoing reconfiguration, the input will come from the west input line⁴¹;
if none of the above conditions is true, than the register will receive its configuration from the south input line.

Once the register is full ( C0=1), the propagation path is fixed according to the patterns of Fig. 4.5:

the output line to the north propagates the input line from the south unless the element is in the bottom row of the block, in which case it propagates the input from the west, or an entry point, in which case it propagates the global configuration line;
the output line to the south propagates the input line from the north unless the element is in the top row of the block, in which case it propagates the input from the south input line;
during configuration, the output line to the east propagates the input line from the north in all cases, while during self-repair it is used to shift the configuration of an element to its east neighbor.

Finally, the register subcircuit of Fig. B-15 contains the test logic described in subsection 4.2.5, slightly augmented to handle some Xilinx-related issues.

B.2 MuxNet

As we have seen, the Biodule 603 is the main hardware prototype we realized in order to test the validity of our mechanisms. However, the size of the circuit required to implement a single MuxTreeSR element prevents us from realizing systems which require more than a few logic elements. In the long term, we hope to overcome this difficulty through the realization of a dedicated VLSI circuit which will contain an important number of elements. In the short term, however, such a solution is not available to us.

To obtain a larger number of programmable elements, we investigated the possibility of exploiting a system based on an array of Xilinx FPGAs mounted on a single printed circuit board and configured so as to implement an array of MuxTreeSR elements. Such a system, while far from allowing the same density as a VLSI chip, would nevertheless allow us to obtain a much larger number of elements than an array of Biodules, particularly if we were not limited to a single MuxTreeSR element for each Xilinx chip.

The first step in the design of this system was therefore an analysis of the number of MuxTreeSR elements we can fit into a single Xilinx FPGA. To this end, we defined a layout consisting of a 4x4 array of our logic elements (Fig. B-16). Without attempting major optimizations in the layout of the elements⁴², we removed the logic dedicated exclusively to the Biodule displays, and tried to determine the smallest Xilinx FPGA capable of containing the whole array. Running our design through the Xilinx routing software, we determined that the smallest FPGA which can hold the entire array is a XC4025HQ240. A system based on an array of such chips could thus allow us to obtain a fairly large array of MuxTreeSR elements, and indeed would be an interesting intermediate step in the creation of our VLSI circuit, likely to be realized as soon as we arrive at a quasi-definitive version of our FPGA.

40.The extra bits are unused in the current implementation, but were introduced in view of possible future alterations in the design.

41.Note that the special case of the entry points was taken care of at the self-repair level (subsection B.1.3).

42.The optimal solution to this kind of design would be to create a "hard macro" of a MuxTreeSR element, that is, define by hand the configuration of the circuit. Such a solution, however, would be extremely time-consuming.