A Standard Cell Approach for MagnetoElastic NML Circuits

D. Giri*, M. Vacca†, G. Causapruno*, Wenjing Rao†, M. Graziano*, M. Zamboni*
* Politecnico di Torino, Department of Electronics and Telecommunications, Corso Duca degli Abruzzi, 24, 10129 Torino, Italy
† University of Illinois at Chicago, Electrical and Computer Engineering Department, 851 S. Morgan St, Chicago (IL)

Email: {marco.vacca, giovanni.causapruno, mariagrazia.graziano, maurizio.zamboni}@polito.it, wenjing@uic.edu

Abstract—Among emerging technologies Quantum dot Cellular Automata (QCA) plays a fundamental role. Its magnetic version, normally called NanoMagnet Logic (NML), is particularly interesting thanks to the ability to work at room temperature and to mix logic and memory in the same device. Magnetic circuits have also a potential very low power consumption. Unfortunately classic NML circuits are normally driven (clocked) with a current generating a clocked magnetic field, nullifying the possibility to actually obtain low power circuits.

We have recently developed a technology-friendly solution, the MagnetoElastic NML (ME-NML), where magnetic circuits are driven through an electric field, and not with a current, drastically reducing the power consumption. In this paper we start to explore the architectural consequences of this new magnetic technology. The analysis is performed using as a benchmark a Galois multiplier, a systolic architecture particularly suited for QCA and NML technologies. The layout is precisely described and the resulting circuit is modeled and simulated using VHDL language. The obtained results are remarkable. The circuit area is reduced by 4 times compared to classic NML approach. This, coupled with the intrinsic lower power consumption due to different clock, leads to a 50 times reduction of power absorption. Moreover the particular structure of magnetoelastic NML allows to define a library of standard cells that can be easily used by designers and automatic layout tools to design circuits, greatly improving future research in this field.

Index Terms—NanoMagnet Logic, MagnetoElastic Effect, Low Power Circuits, Galois Field Multiplier

I. INTRODUCTION

Among emerging technologies Quantum dot Cellular Automata (QCA) [1] has drawn in recent years a considerable amount of attention. Its magnetic implementation, NanoMagnet Logic (NML) [2], is particularly interesting because it offers unique features unavailable in current CMOS technology. The basic unit is a single domain nanomagnet. Thanks to its rectangular shape and sizes smaller than 100 nm, only two stable states are possible (Fig. 1.A) and they can be used to represent logic values [2]. Since the basic cell is a magnet, NML couples logic and memory in the same device [3]. Moreover it is one of the few emerging technologies that is feasible with current technological processes [2] and works at room temperature. These unique features make NML one of the most attractive technologies, alternative to classic CMOS transistors.

The distinctive characteristic of NML (and QCA) technology is the necessity to use a clock mechanism to successfully switch cells from one logic state to the other. Circuits are created placing magnets on a plane, as shown in Fig. 1.B. Theoretically information should propagate through the circuit thanks to magnetic interaction among neighbor magnets, but this interaction alone is not sufficient. Magnets must be forced in an unstable state through an external mean, like a magnetic field, lowering the barrier between the two stable states [4]. When the clock field is removed magnets are free to switch according to the input element, propagating therefore the information. Another limitation is that, due to thermal noise [5], only a limited number of elements can be cascaded, otherwise the error probability in information propagation increases greatly. For this purpose a multiphase clock system is adopted. Circuits are divided in areas, called clock zones, including a limited number of magnets (Fig. 1.B). At every clock zone different clock signals are applied. In [6] a 3-phases clock system is adopted, while in Fig. 1.C a 4 phases system is depicted. Signals are identical, just shifted of 90°. Thanks to the multiphase clock system, magnets of a clock zone switch according to neighbor magnets that are in a stable (HOLD) state. Magnets in the RESET state have no influence on signals propagation.

Clock represents one of the most important drawbacks of NML technology. Aside from the magnetic field clock [2], other mechanism were developed, like a STT-current clock [7]. Both these solutions use a current and therefore lead
to a high power consumption. Recently we have developed an innovative solution based on an electric field instead of a magnetic field, the magnetoelastic clock [8]. This solution is similar to the one presented in [9] but is technology-friendly and it allows to reach a very low power consumption also considering all power losses in the clock generation network. One of the positive side effects of our clock solution is that it leads to the definition of a limited amount of possible basic structures, defining therefore a set of Standard Cells. This predefined set of cells can be easily used to design circuits both with custom layout and using automated tools [10] greatly enhancing the development of NML technology. In this paper we propose a first analysis of the implications at circuit layout level of the magnetoelastic clock. The analysis is performed using as a benchmark a Galois multiplier, a systolic architecture particularly suited for NML (and QCA) technology. The results that we present here show that this clock solution allows for much more compact layouts, greatly reducing both circuits area and power consumption compared to magnetic field based NML.

II. MAGNETOElastic CLOCK

If a magnetic field is used as a clock mechanism, a current flowing through a wire placed under the magnets plane can be employed to generate it. Fig. 2.A shows the clock generation network. A wire is placed under the magnets plane. The current flowing through this wire generates a magnetic field parallel to the magnets short side, successfully forcing it in the RESET state. A ferrite yoke surrounds the wire, providing a better confinement of the magnetic flux lines. This clock solution gives to circuits a peculiar structure, where clock zones are made by parallel stripes (Fig. 2.B). Every stripe corresponds to the clock wire required to generate the magnetic field. This clock zones layout has important consequences on circuit architectures [11]. While this clock mechanism was demonstrated both theoretically and experimentally [2] its main drawback is the high power losses due to the Joule power dissipation inside clock wires.

To overcome this problem we proposed and studied [8] [12] a clock mechanism based on an electric field instead of a magnetic field. The general idea is depicted in Fig. 2.C. Magnets are placed on a piezoelectric substrate. The material chosen is PZT (Lead-Zirconate-Titanate), one of the best piezoelectric materials available. On both sides of the magnets two electrodes are used to generate the electric field when a voltage is applied to them. The electric field induces a strain in the piezoelectric substrate, and the correspondent mechanical deformation of magnets induces a variation in the magnetization thanks to the Magnetoelastic effect. Therefore the application of an electric field effectively forces magnets in the RESET state. However, since a voltage is used instead of a current, clock losses are orders of magnitude smaller. The structure is equivalent to a capacitor, and the only losses are due to charging and discharging of the capacitor. Losses can be evaluated as $CV^2$, but the value of capacitance $(C)$ is lower than $1 \text{ nF}$ and the value of voltage $(V)$ is equal to few hundreds of millivolt (more details are not reported for space reasons but can be found in [8]). The energy losses are therefore very small. An example of circuit layout is shown in Fig. 2.D. Every clock zone is based on a mechanical isolated cell. Clock zones sizes can vary between 3 and 5 magnets, depending on how strict the requirements of the lithographic process are. This because mechanical isolation is obtained through patterning of the PZT with lithography. Communication among magnets of clock zones is achieved through top and bottom borders, since electrodes are placed on both sides of the zone. Logic circuits are based on AND/OR gates as shown in [13].

III. STANDARD CELLS LIBRARY

The layout of MagnetoElastic NML (ME-NML) circuits is based on mechanical isolated islands of limited size, each one corresponding to a clock zone. This layout was chosen according to the fabrication process limitations [8] but it has an interesting consequence: The number of possible magnet patterns inside a clock zone is reasonably small. Thanks to this characteristic it is possible to define a library of Magnetic Standard Cells, each one corresponding to a particular magnets configuration inside a clock zone. Having defined a finite set of all the conceivable magnet patterns within a clock zone, any kind of circuit can be easily designed. The standard cells library is shown in Fig. 3. This approach is also particularly interesting in the perspective of a future ad hoc simulation and synthesis tool for this technology [10].

A. Standard Cells

Cells height and width can vary between three and five nanomagnets. The cell width and height must be chosen according to the logic requirements and the fabrication process limitations. A $3 \times 3$ layout is the most efficient because it has the lowest critical pattern, i.e. the lowest number of cascaded magnets between an input and an output. A smaller number of magnets in the critical path leads to an higher clock frequency and a lower error probability during signals propagation. With $3 \times 3$ cells, the electrodes width is equal to $40 \text{ nm}$, a value compatible with the minimum width of metal-1 wires in current CMOS technology [14]. The cell width and
height can be increased to five, to simplify the fabrication process (larger cells and electrodes are easier to fabricate) but at the cost of decreasing the clock frequency achievable and increasing the error probability in the switching process [8].

<table>
<thead>
<tr>
<th>Wire</th>
<th>Standard Cells</th>
</tr>
</thead>
<tbody>
<tr>
<td>AND</td>
<td></td>
</tr>
<tr>
<td>OR</td>
<td></td>
</tr>
<tr>
<td>'0'</td>
<td></td>
</tr>
<tr>
<td>'1'</td>
<td></td>
</tr>
<tr>
<td>Double wire</td>
<td></td>
</tr>
<tr>
<td>Inverter</td>
<td></td>
</tr>
<tr>
<td>Crosswire</td>
<td></td>
</tr>
</tbody>
</table>

Fig. 3. Standard cell library elements with size of 3×3 magnets.

Fig. 3 shows the layout of all possible cell types included in the standard cell library (3×3 case only is reported for space reasons). Each table row corresponds to a different type of cell. Cells are classified by type, for each type different orientations are possible. This means that all cells of a specific type can be obtained with an horizontal and/or vertical flip of the base cell. A cell can represent either a logic gate or a simple wire. The word wire, in the field of nanomagnetic logic, stands for a series of horizontally or vertically adjacent magnets. Wires can be single or double. “Double” means that two signals in parallel are routed through the cell. Single wires can have different lengths, depending if they connect an input and an output on the same cell side or if they connect inputs and outputs at opposite cell corners. Double wires have always the same length.

A crosswire cell [2] is used when two wires must cross each other without interference (NML at the time of writing is still a planar technology). The library we created uses AND, OR and INVERTER as logic gate set. The inverter is simply implemented by an even number of nanomagnets horizontally aligned, because an odd number results in no signal inversion. AND/OR logic gates are obtained cutting one corner of a magnet [13]. The different shape of those magnets gives them a preferential state, which they will leave only when both inputs, from above and below, are up or down, implementing as a consequence an AND/OR logic function.

### B. VHDL Model

To simulate circuits we developed a RTL (Register Transfer Level) model written using VHDL language for each standard cell [15]. The model helps to easily manage complexity and hierarchy. The multiphase clock system gives to NML (and QCA) circuits a peculiar behavior. In particular the propagation delay of a signal through a clock zone is equivalent to the behavior of a D-Latch. As can be understood from Fig. 1 every clock zone samples a new data at every clock cycle. As a consequence, every standard cell can be modeled by a register, to emulate signals propagation, and an ideal logic gate without delay, to represent the logic functions. This is true in case of AND, OR and inverters [15]. Wires are modeled simply by a D-Latch. Therefore the propagation delay of an NML wire depends on the number of clock zones the wire passes through. In other words, this kind of wire can be considered equivalent to a pipelined interconnect in standard CMOS. We choose a 4-phase clock system for our design (as it will be explained in Section IV) so a wire routed through \( N \) clock zones will need \( N/4 \) clock cycles for the information to pass through.

Each standard cell is represented by its correspondent RTL model. Every type of standard cell is identified by one only VHDL description. Various parameters are used to differentiate every cell of the same type. Parameters used are highlighted in Fig. 4: cell length and width, cell orientation, clock phase and cell position in the circuit layout. The model includes a hierarchical bottom-up evaluation of the occupied area and power dissipation, described in Fig. 4 [15]. The actual computation of area and power is at first performed by the lowest layer: Each standard cell computes its own area and power consumption. Then every processing element (PE) of the Galois multiplier, the test circuit described in this paper (see Section IV), computes its total area and power consumption. A PE is at a higher hierarchical level than standard cell, so it simply computes the total area and power consumption as the sum of the area and power of every standard cell. The total area and power of the whole Galois multiplier is then computed in the same way as the sum of the total area and power of each PE.

Every standard cell evaluates the total number of magnets starting from the height and length (in terms of magnets) received as input parameters. The occupied area is calculated multiplying the physical cell length and width, considering also the separation space among magnets and the area occupied by electrodes. In case of ME-NML circuits, magnets are 50x65nm², the separation space considered is 20nm. Electrodes are 40nm width in case of 3×3 cells, 70nm with bigger cells.

Power losses in NML circuits depend on two main components: Power dissipated by nanomagnets during their switching
phase, and power loss in the clock generation network. The switching power consumption, required to force magnets in the reset state, is equivalent to the height of the energy barrier between stable and reset state multiplied by the number of magnets and the switching frequency. This is true because in ME-NML, unlike Magnetic NML, adiabatic switching is not used, to achieve maximum clock frequency. Indeed, adiabatic switching allows to reduce the switching power consumption at the cost of reduced clock frequency. The energy barrier value is around $180 \times k_B T$. Every clock zone, together with its two electrodes behaves as a capacitor, the clock consumption for one cell corresponds to the energy needed for charging the electrodes capacitance ($C V^2$). The value of capacitance ($C$) and voltage ($V$) is calculated starting from the cell sizes and the materials selected.

IV. Galois Field Multiplier

To verify and validate the proposed approach we have used as case of study a Galois Field Multiplier (GFM). It is a highly scalable and regular architecture that has many applications in coding theory, computer algebra and cryptography. A Galois Field $GF(q)$ is a field consisting of a finite number of elements ($q$ elements) together with the description of two operations (addition and multiplication) that can be performed on pair of elements. A unique Galois Field exists only for any $q = p^m$ where $p$ is a prime number and $q$ a positive integer.

Binary Galois Field $GF(2^m)$ can be very efficiently implemented with VLSI gates. $GF(2^1)$ is the smallest possible field, it contains only the elements 0 and 1 and the two operations are performed modulo 2. Addition is obtained with a logical XOR, while the multiplication with a logical AND. When the value of $m$ in the binary field is greater than 1, ordinary modulo operations do not apply. Each element of the field can be uniquely represented with a polynomial of degree up to $m − 1$ with coefficients in $GF(2)$. The following algorithm illustrates how to multiply two polynomials $a(x)$ and $b(x)$, belonging to $GF(2^m)$, modulo an irreducible polynomial $p(t)$ of degree $m$.

```plaintext
r(t) := 0
for i = m-1 downto 0 do
    r(t) := t*r(t) + a_i*b(t)
    if degree(r(t)) = m then r(t) := r(t)−p(t)
return r(t)
```

The circuit schematic of the Galois Field Multiplier, for the case $GF(2^4)$, is shown in Fig. 5. Addition and multiplication symbols, inscribed in circles, correspond respectively to XOR and AND ports. The detail on the right side of Fig. 5 shows the implementation of the combinational logic using only AND, OR and INVERTER gates, which are the only ones available in our NML standard cell library. The serial input ($dataA$) and the feedback enabling the summation with the primitive polynomial represent critical paths, their length increases proportionally to the field degree. The pipeline is employed to break those paths reducing the multiplication time delay at the price of an increase of circuit area. Using a pipelined architecture with any parallelism of the multiplier, the critical path will be the same, but the serial bits of $dataA$ must be now fed one every two clock cycles.

As mentioned in Section III, NML circuits are intrinsically pipelined, and every consecutive 4 clock zones (assuming a 4 phase clock) signals acquire a propagation delay of 1 clock cycle. It is therefore important to use regular architectures like systolic arrays to avoid long interconnections and maximize performance. Systolic arrays are architectures composed of identical processing elements with a highly regular layout. The Galois Field Multiplier is one of those systolic architectures and is therefore highly suitable for NML technology. From Fig. 5 it is possible to identify each Processing Element (PEs). Beside the first and the last, which are slightly different, all the others are identical. Since this is valid for any parallelism of the multiplier, only three different processing elements need to be designed. Therefore a Galois Multiplier with any number of bits can be designed simply combining the first PE, the desired number of central PEs and the last PE.

A. Magnetoelastic GFM

Fig. 6 shows the circuit layout of a 4 bit GFM implemented with ME-NML technology. Different colors identify different clock zones. We have chosen a 4-phases clock system because it leads to a more regular layout with respect to a 3-phases clock. Inside every clock zone electrodes are not depicted for sake of clarity of the picture. In Fig. 6 signal patterns are highlighted with arrows. Every clock zone corresponds to one of the standard cells in Fig. 3. The central processing elements are identical, while the first and the last processing elements are slightly different. The result is an extremely compact and regular circuit layout. The GFM is also perfectly scalable, because adding more bits means to add more central processing elements which are all equal.

Due to the intrinsic circuit pipelining a new input can be given to signal $dataA$ every 6 clock cycles. As a consequence a multiplication can be completed in $6N$ clock cycles, where $N$ is the multiplier number of bits. To improve data throughput, signals interleaving can be adopted [3]. Six multiplications must be executed in parallel, so at every clock cycle a new data from a different multiplication must be fed to the circuit. In this way the throughput can be improved by 6 times. Moreover
to reach the highest possible throughput, the PE input and the feedback of the last PE have to be reset to zero whenever the first bit of a new operation arrives. A reset signal (rst) was therefore routed to the circuit and synchronized with incoming input signals.

B. Magnetic Clock GFM

To compare the layout obtained with the magnetoelastic clock, we designed the Galois Field Multiplier using the classic magnetic field clock. The layout of the 2 bits version is depicted in Fig. 7. The 4 bits multiplier is not shown because the schematic of the 2 bits version is easier to understand. Since the particular structure of the GFM requires feedback signals, a more complex structure is required with respect to the simple layout of Fig. 2.B. It is called snake clock and is thoroughly described in [6]. Clock phases are 3 and clock wires are alternatively placed above and under magnets plane. Placing clock wires above and under the plane was later suggested also in [2]. In NML, for a signal to propagate in a particular direction, clock zones must be crossed in the right order from 1 to 3. With the layout shown in Fig. 2.B signals can move only from left to right. To enable feedbacks and allows signals propagation also from right to left, phases 2 and 3 must be swapped. To permit this swap, the corresponding clock wires must be twisted. The correct order of clock phases to enable signals propagation in both directions is shown in Fig. 7, where the area represented by an X corresponds to the area where clock wires are twisted. Magnets cannot be placed in that area. More details on the snake-clock scheme can be found in [6].

Just like the implementation with the magnetoelastic clock, the GFM can be assembled using three different PEs only. To simulate and analyze this version of the GFM we have implemented also in this case a RTL model described with VHDL. Details on the model can be found in [15]. For sake of clarity we briefly report here how the area and power are evaluated in this model. The area is the rectangle circumscribed to the circuit. Power consumption is instead given by two components: Power dissipated by nanomagnets during their switching phase, and power dissipated by clock wires thanks to the Joule effect. The value of $30k_BT$ is chosen as average energy dissipated by the a single nanomagnet during the switching phase, since an adiabatic clock is used in this case [2]. The power consumption due to magnets switching is simply obtained multiplying this value of energy for the number of magnets and the frequency. The main contribution to the power consumption is however the dissipation due to Joule effect. A high value of current is necessary to generate a magnetic field strong enough to force a reset. This power component is simply evaluated as the power dissipation of a $3mA$ current [2] flowing in a copper wire long as all the clock zones put together and with a section wide as a clock zone and $400nm$ high.

As it will be clear from the results present in Section V, the area of this version of the GFM is bigger than the area of the one implemented with the magnetoelastic clock. This increases also the circuit latency so a new data must be fed to the circuit every 10 clock cycle instead of 6. Similarly 10 multiplications must be interleaved instead of 6 to reach maximum throughput.

V. RESULTS

Performances of the two implementations, in terms of throughput, area and power are put now side by side. Area and power consumption are summarized in Table I, varying the number of bits from 4 to 32. As discussed in Section IV, the latency, i.e. the number of clock cycles between one input and another is 6 in case of magnetoelastic clock and 10 in case of snake clock. As a consequence the throughput in case of the magnetoelastic clock (supposing to use the same clock frequency of 100MHz in both cases) is around 30% higher. Using data interleaving the throughput is maximized and it is equal for both cases, but for the magnetoelastic clock only 6 operations instead of 10 must be executed in parallel.
of circuits based on this clock solution, which greatly enhance NML technology. We are also conducting a detailed material analysis to further reduce power consumption.

VI. Conclusions

We have presented a detailed analysis of NML circuits based on magnetoelastic clock. A set of standard cells, covering all possible clock zones configurations, was developed and used to create the complete layout of an N-bit Galois multiplier. The circuit was modeled and then simulated using a RTL-level model written in VHDL. A power analyzer was embedded to create the complete layout of an N-bit Galois multiplier. Moreover, magnetoelastic layout is intrinsically more compact and with almost no wasted space, while in the snake-clock case there are many clock zones regions without magnets due to clock constraints. Regarding power consumption the gap between magnetoelastic and snake clock is much wider. The intrinsic power consumption due to magnet switching is higher for the magnetoelastic case but the biggest source of power dissipation are the losses in the clock generation network. As it can be seen from Table I, clock losses in the snake clock case are extremely high, and very small in case of the magnetoelastic clock. So, putting together the much smaller clock losses with the reduced area, the power consumption becomes 50 times lower in case of magnetoelastic clock, which is a remarkable result.

VI. Conclusions

We have presented a detailed analysis of NML circuits based on magnetoelastic clock. A set of standard cells, covering all possible clock zones configurations, was developed and used to create the complete layout of an N-bit Galois multiplier. The circuit was modeled and then simulated using a RTL-level model written in VHDL. A power analyzer was embedded inside the model allowing to evaluate exactly the circuit area and power consumption. Results show that the magnetoelastic clock allows to reduce the circuit area of 4 times and the total power consumption of 50 times.

As a future work we will continue to investigate the layout of circuits based on this clock solution, which greatly enhance NML technology. We are also conducting a detailed material analysis to further reduce power consumption.

Table I

<table>
<thead>
<tr>
<th>N of bits</th>
<th>4</th>
<th>8</th>
<th>16</th>
<th>32</th>
</tr>
</thead>
<tbody>
<tr>
<td>AREA (µm²)</td>
<td>Magnetoelastic</td>
<td>14.07</td>
<td>28.63</td>
<td>57.76</td>
</tr>
<tr>
<td></td>
<td>Snake</td>
<td>56.60</td>
<td>107.29</td>
<td>208.67</td>
</tr>
<tr>
<td>POWER (µW)</td>
<td>Magnetoelastic</td>
<td>0.072</td>
<td>0.148</td>
<td>0.299</td>
</tr>
<tr>
<td>Magnets Switching</td>
<td>Snake</td>
<td>0.023</td>
<td>0.046</td>
<td>0.092</td>
</tr>
<tr>
<td>Clock Generation</td>
<td>Magnetoelastic</td>
<td>1.196</td>
<td>2.435</td>
<td>4.913</td>
</tr>
<tr>
<td>Total Power</td>
<td>Magnetoelastic</td>
<td>1.27</td>
<td>2.58</td>
<td>5.21</td>
</tr>
<tr>
<td>Power</td>
<td>Snake</td>
<td>69.67</td>
<td>132.02</td>
<td>256.76</td>
</tr>
</tbody>
</table>

Area of the magnetoelastic GFM results to be four times lower than the snake-clock GFM. The reasons are twofold. Nanomagnets have different sizes: $50 \times 65 \text{nm}^2$ for magnetoelastic GFM, $60 \times 90 \text{nm}^2$ for the snake-clock implementation. Moreover magnetoelastic layout is intrinsically more compact and with almost no wasted space, while in the snake-clock case there are many clock zones regions without magnets due to clock constraints. Regarding power consumption the gap between magnetoelastic and snake clock is much wider. The intrinsic power consumption due to magnet switching is higher for the magnetoelastic case but the biggest source of power dissipation are the losses in the clock generation network.