[Article] VLSI Implementation of a Non-Binary Decoder Based on the Analog Digital Belief Propagation

Original Citation:

Availability:
This version is available at: http://porto.polito.it/2518906/ since: October 2013

Publisher:
IEEE / Institute of Electrical and Electronics Engineers

Published version:
DOI:10.1109/TSP.2014.2330804

Terms of use:
This article is made available under terms and conditions applicable to Open Access Policy Article ("Public - All rights reserved") , as described at http://porto.polito.it/terms_and_conditions.html

Porto, the institutional repository of the Politecnico di Torino, is provided by the University Library and the IT-Services. The aim is to enable open access to all the world. Please share with us how this access benefits you. Your story matters.

(Article begins on next page)
VLSI Implementation of a Non-Binary Decoder Based on the Analog Digital Belief Propagation

Muhammad Awais, Guido Masera, Guido Montorsi and Maurizio Martina
Politecnico di Torino, Department of Electronics and Telecommunication, Italy

Abstract—This work presents the VLSI hardware implementation of a novel Belief Propagation (BP) algorithm introduced in [1] and named as Analog Digital Belief Propagation (ADBP). The ADBP algorithm works on factor graphs over linear models and uses messages in the form of Gaussian like probability distributions by tracking their parameters. In particular, ADBP can deal with system variables that are discrete and/or wrapped. A variant of ADBP can then be applied for the iterative decoding of a particular class of non binary codes and yields decoders with complexity independent of alphabet size $M$, thus allowing to construct efficient decoders for digital transmission systems with unbounded spectral efficiency. In this work, we propose some simplifications to the updating rules for ADBP algorithm that are suitable for hardware implementation. In addition, we analyze the effect of finite precision on the decoding performance of the algorithm. A careful selection of quantization scheme for input, output and intermediate variables allows us to construct a complete ADBP decoding architecture that performs close to the double precision implementation and shows a promising complexity for large values of $M$. Finally, synthesis results of the main processing elements of ADBP are reported for 45 nm standard cell ASIC technology.

Index Terms—Belief propagation, APP estimation, iterative decoding, non binary LDPC, Analog Digital Belief Propagation, VLSI decoder, decoder architecture.

I. INTRODUCTION AND MOTIVATION.

Since their rediscovery by MacKay, LDPC codes have been extensively adopted in both next-generation wired and wireless standards due to their near-Shannon limit performance. Moreover, LDPC codes over non binary alphabets of size $M$ can show better performance over their binary counterparts with proper encoding design and code length [2]. However, the significant improvement comes along with the penalty of high decoding complexity. The locally optimal, yet the most complex, iterative decoding algorithm of non binary LDPC codes is the belief propagation (BP) algorithm. Since the size of messages varies with the size of the alphabet $M$, a straightforward implementation of BP results in memory and complexity requirements of the order of $O(M)$ and $O(M^2)$ respectively. In order to reduce the complexity of non binary decoding, several suboptimal decoding schemes have been proposed in recent years.

The first straightforward simplification is obtained at check nodes by replacing the discrete convolution of messages, with complexity $O(M^2)$, with the product of the message Fourier transforms. The use of FFT brings down the complexity to $O(M \log M)$. In [3], the authors introduce a log-domain version of this approach that has advantages in terms of numerical stability.

Some further simplifications have been proposed in [4] with the Extended Min Sum (EMS) algorithm, where message vectors are reduced in size by keeping only those elements in the alphabet with higher reliability. In [5] the same authors propose a hardware implementation of the EMS decoding algorithm for non-binary LDPC codes.

In [6] the Min-Max algorithm is introduced with a reduced complexity implementation called selective implementation, which can reduce by a factor 4 the operations required at the check nodes; however, complexity is still in the order of $O(M^2)$.

Several studies on VLSI implementation of non binary decoders based on the previous algorithms have been presented in literature [7]–[13]. The results of such studies confirm that all non binary decoders require complexity growing with the size of the alphabet $M$.

The ADBP proposed in [1] represents a breakthrough in the reduction of the complexity and memory requirements with respect to previous proposed algorithms, as for ADBP both complexity and memory requirements are independent of the size $M$ of the alphabet. The main simplification of ADBP is due to the fact that messages are not stored as vector of size $M$ containing the likelihood of the discrete variables (or equivalently their log-likelihood ratios-LLR) but rather as the two moments, or related quantities, of some suitable predefined class of Gaussian-like distributions. ADBP can be casted into the general class of expectation-propagation algorithms described by Minka [14]. The main contribution in [1] is the definition of a suitable class of distributions for the messages and the derivation of the updating equations for the message parameters at the sum and repetition operations of the graph.

It should be noticed that ADBP cannot be applied to all types of linear codes over $GF(M)$ as multiplication by field elements different from $\pm 1$ is not allowed in the graph. This ensemble of codes has been analyzed in [15] and [16], where it is shown that it is capacity achieving as its distance spectrum approaches that of random codes as the underlying graph connectivity grows.

The exact ADBP updating equations however are not suitable for a straightforward implementation due to the
presence of complex non linear operations. Some simplifications to the updating equations have been presented in [17]. In this paper we tackle the problem of VLSI implementation of ADBP with a systematic approach. In section II we start by reporting the exact updating equations of ADBP and consider its special application to the decoding of non binary codes. In section III we introduce some simplifications to the updating equations and evaluate their impact on the performance of the decoder. In section IV we present the results of the fixed point implementation of ADBP obtained by optimizing the bit width of input, output and intermediate quantities of the decoder processing elements. In section V we report the architecture of the designed core processors and the synthesis results for $M = 10$ up to $M = 64$. The provided results confirm that implementation of ADBP is feasible with small complexity and more importantly that complexity is independent of $M$.

II. The ADBP algorithm.

ADBP is a particular version of the BP algorithm that allows to perform in a very efficient way the BP for linear systems where variables can be either discrete or wrapped or both.

From the complexity point of view ADBP is equivalent to Gaussian BP over linear system as described in [18].

Let us define the class of Gaussian messages as

$$
G \triangleq \left\{ G(\mu, K, x) \propto e^{-\frac{1}{2}(x-K)^T K^{-1} (x-K)}, K \in \mathbb{R}^+, \mu \in \mathbb{R} \right\}
$$

where $\mu$ and $K$ denote respectively the mean and concentration of a Gaussian message. With continuous variables and messages belonging to $G$, the following simple updating rules for linear real systems can be derived corresponding to the sum, repetition and axis scaling operations.

$$
\text{Sum} \rightarrow G_1 * G_2 = G(\mu_1 + \mu_2, (K_1^{-1} + K_2^{-1})^{-1}) \in G \quad (1)
$$

$$
\text{Repetition} \rightarrow G_1 \cdot G_2 = \frac{\mu_1 K_1 + \mu_2 K_2}{K_1 + K_2} \in G \quad (2)
$$

$$
\text{Scaling} \rightarrow aG = G(a\mu, K/a^2) \in G. \quad (3)
$$

In (4), the symbol $*$ indicates the convolution operation and $s_M(x) \triangleq \sum_i \delta(x - i\Delta)$ is the train of pulses. The discretization of system variables on the other side, induces a sampling of their corresponding messages, leading to the introduction of the class of sampled gaussian messages: $S \triangleq S[G]$ where we have defined the sampling operator

$$
S[L(x)] \triangleq s_M(x) \cdot L(x),
$$

Notice that Fourier transforms of sampled gaussian messages are wrapped gaussian messages with imaginary mean.

When the wrapping period $M$ is a multiple of the sampling interval, wrapping and sampling operators commute $(S[W[L]] = W[S[L]])$, so that it is possible to introduce the class of Digital messages or $D$-Messages, which consists of messages that are both wrapped and sampled:

$$
D \triangleq W[S[G]] = S[W[G]].
$$

Examples of linear systems that use discrete and wrapped variables i.e. integers in the range $[0, M - 1]$ are linear non-binary encoders, whose non binary code-words $c$ satisfy

$$
H_C = 0
$$

where the mod $M$ sum is assumed in the equation and the coefficients of the parity check matrix $H$ are bounded to be in the set $\{1, 0, -1\}$.

ADBP using D-Message can then be applied for the iterative decoding of members of this code ensemble, yielding decoders with complexity independent from the cardinality of the alphabet $M$.

In [1] it is shown that, in contrast to what happens for gaussian messages (equations (1-3)), wrapped messages are not closed under multiplication and sampled gaussian messages are not closed under convolution. As a consequence $D$-messages are not closed w.r.t repetition and sum operations.

In particular, while the updating equation (2) is reliable also for wrapped messages for large concentration of the messages, it fails to provide accurate results when the concentration is small. This is due to the non negligible effect of the aliasing on the replicas introduced by wrapping. For the same reason, equation (1) fails to provide accurate results for sampled gaussian messages with high concentration.

However, accurate approximations of the output messages belonging to the same class of the inputs can be found in both cases. This is obtained by exploiting the correspondence between members of the class of wrapped gaussian messages of period $M$ with those of the class of Von Mises or Tychonov messages with the same period:

$$
V \triangleq \left\{ V(\mu, K_0, x) \propto e^{K_0 \cos(A(x - \mu))}, K_0 \in \mathbb{R}^+, \mu \in [0, M] \right\},
$$
Where $A \triangleq \frac{2e}{\pi}$. The mapping between the two distributions preserves the mean and transforms the concentration according to

$$K_{vi} \triangleq f^{-1} \left( \frac{K_{vi}}{A^2} \right), \quad i = 1, 2$$

$$f(K_v) \triangleq \left( \frac{2\log \left( \frac{I_0(K_v)}{I_1(K_v)} \right)}{K_v} \right)^{-1}$$

Where $I_n(x)$ denotes the modified Bessel functions of order $n$.

In summary, by skipping the details (mentioned in [1]), in the repetition operation, the output distribution $W[G_1] \cdot W[G_2]$ can be approximated for low message concentrations as

$$K_{vi} \triangleq f^{-1} \left( \frac{K_{vi}}{A^2} \right), \quad \mu = (l_i + \alpha_i), \quad \gamma_i = \alpha_i.K_{vi}, \quad i = 1, 2$$

$$K_{v3} = 1/f \left( \sqrt{K_{v1}^2 + K_{v2}^2 + 2K_{v1}K_{v2} \cos(\mu_1 - \mu_2)} \right)$$

$$\mu_3 = \frac{1}{A} \arctan \left( \frac{K_{v1} \sin(\mu_1) + K_{v2} \sin(\mu_2)}{K_{v1} \cos(\mu_1) + K_{v2} \cos(\mu_2)} \right)$$

Similarly, a good approximation of the message $S[G_1] \cdot S[G_2]$ for the sum operation, which is valid for large message concentrations can be obtained by exploiting the same correspondence, but in the transform domain, yielding:

$$K_{vi} \triangleq f^{-1} \left( \frac{1}{K_{vi}} \right), \quad \mu = (l_i + \alpha_i), \quad \gamma_i = \alpha_i.K_{vi}, \quad i = 1, 2$$

$$K_{v3} = 1/f \left( \sqrt{K_{v1}^2 + K_{v2}^2 + 2K_{v1}K_{v2} \cos(\gamma_1 - \gamma_2)} \right)$$

$$2\gamma_3 = \log \left( \frac{K_{v1}e^{\gamma_1} + K_{v2}e^{\gamma_2}}{K_{v1}e^{-\gamma_1} + K_{v2}e^{-\gamma_2}} \right), \quad l_3 = l_1 + l_2$$

where the real quantity $\mu$ is expressed by separating its integer ($l$) and fractional part ($\alpha$) in $[-0.5, 0.5]$. The ADBP algorithm which uses exactly the updating rules (7)-(8) is named as exact-ADBP algorithm and denoted in the following with the acronym eADBP.

### III. SIMPLIFICATIONS OF THE UPDATING EQUATIONS.

Although ADBP algorithm introduces the fundamental complexity breakthrough of making the iterative decoding of non binary codes independent from the alphabet size $M$, equations (7) and (8) for updating the D-message parameters are still too complex for a hardware implementation. In this section we introduce some simplifications of the ADBP updating equations for the specific purpose of using it for decoding of non binary codes in a digital transmission system.

The considered full transmission system for high spectral efficiencies, shown in Fig.1, consists of an mod-$M$ LDPC encoder, an M-PAM modulator, a wrapped AWGN channel and the ADBP decoder.

The encoder is a regular LDPC encoder with $K = 3000$ input $M$-ary symbols, and $N = 5000$ output symbols. The constant variable degree is $d_v = 2$ and the check node degree is $d_e = 5$.

The $M$-ary output symbols $c$ of the encoder are transmitted using a $M$-PAM constellation, with a natural order mapping $x = c - (M - 1)/2$. The outputs of the wrapped gaussian channel are obtained wrapping the output of a regular AWGN channel in the interval $[-M/2, M/2]$. Wrapping reduces the channel capacity but at the same time make it input symmetric, so that transmission of the all zero sequence can be assumed with the employed linear encoder.

Furthermore, the likelihoods at the output of the wrapped gaussian channel with $M$-PAM inputs take naturally the form of digital messages with parameters

$$\mu = y + (M - 1)/2 \mod M$$

$$K = \frac{1}{\sigma^2} = N_0 M^2 - 1$$

so that it can be easily interfaced with ADBP.

The ADBP decoder takes as input messages the pairs $(\mu, K)$ and performs 10 iterations using the flooding schedule. The Tanner graph of the code is reported in Fig.2.

#### A. Simplifications of the repetition update equations

As pointed out in the Fig. 2, at repetition nodes one of the involved messages is always the channel message. The standard deviation of this message, related to $E_s/N_0$

$$1$$

The use of ADBP in conjunction with the regular AWGN channel, as well as with other types of modulation set requires some modifications of the computation of the input messages parameters. Since in this paper we focus primarily in the implementation issues, we decide to use the wrapped AWGN to simplify the analysis. Notice that the loss of the capacity induced by wrapping becomes negligible for large value of $E_s/N_0$. 

![Figure 1: Block diagram of high spectral efficiency transmission system employing ADBP](image1)

![Figure 2: Tanner graph of LDPC code adopted in proposed ADBP algorithm](image2)
over the channel, is typically much smaller than the wrapping period $M$. In this case the exact expression of ADBP (7) can be replaced by the simpler expression (2) that neglects the aliasing effect of replicas of the wrapped gaussian. Notice however that for the proper computation of the output mean in (2) one should consider, among all possible replicas associated to the wrapped gaussian distribution, those that have closest $\mu_i \mod M$. As a consequence the following approximation can be used to obtain the mean of $W[G1] \cdot W[G2]$.

$$\mu_3 \approx \frac{\mu_1 K_{sw,1} + \mu_2 K_{sw,2}}{K_{sw,1} + K_{sw,2}} \mod M$$  \hspace{1cm} (9)

Where $\mu_1 = (\mu_1 + k_1 M)$ and $\mu_2 = (\mu_2 + k_2 M)$ , and the integers $k_1$ and $k_2$ should be chosen so as to minimize $|\mu_1 - \mu_2|$.

Eq. (9) requires 2 multiplications, one sum and one division. In order to simplify it, we first derive the indexes associated to the maximum and minimum values of the concentrations.

$$L = \arg \max_i (K_{sw,i}), \quad l = \arg \min_i (K_{sw,i})$$

we then write (9) as

$$\mu_3 = \mu_L + A(\mu_l - \mu_M/2) \mod M$$  \hspace{1cm} (10)

where

$$A \triangleq \frac{K_{sw,l}}{K_{sw,1} + K_{sw,2}}$$

and the notation $(\mu_l - \mu_M/2)$ means that the difference $\mu_l - \mu_M$ should be taken mod $M$ in the range $[-M/2, M/2]$.

Expression (10) only requires two sums and the multiplication by the number $A$, which is always bounded in $[0,1]$.

B. Simplifications of sum update equations

In sum update the use of the correct expression (8) instead of (1) is required as the concentration of messages actually increases during iterations. In this case the simplification of (8) is obtained by considering the following approximation for the function $f(x)$

$$f(x) \approx -(2 \log(x/2))^{-1}$$

which is valid for $x < 1$.

Using this approximation we can write the concentration of the characteristic functions as

$$K_{sw,i} = 2 \exp(-K_{sw,i}/2)$$

and approximate the output message concentration as follows:

$$K_{sw,3} = -2 \log \left( \frac{1}{2} \sqrt{K_{sw,1}^2 + K_{sw,2}^2 + 2K_{sw,1}K_{sw,2} \cosh(\gamma_1 - \gamma_2)} \right) \approx -\log \left( \frac{1}{4} \left( K_{sw,1}^2 + K_{sw,2}^2 + 2K_{sw,1}K_{sw,2} \cosh(\gamma_1 - \gamma_2) \right) \right) \approx -\log \left( e^{-K_{sw,1}} + e^{-K_{sw,2}} + 2e^{-K_{sw,1}+K_{sw,2}} \cosh(\gamma_1 - \gamma_2) \right) \approx \min^* \left( K_{sw,1}, K_{sw,2}, \frac{K_{sw,1} + K_{sw,2}}{2} - |\gamma_1 - \gamma_2| \right)$$  \hspace{1cm} (11)

where we introduced the familiar operator $\max^*(a,b) = \log(e^a + e^b) = \max(a,b) + \log(1 + e^{-|a-b|})$ and the derived operators $\min^*(a,b) = -\max^*(-a,-b)$ and $|x|^* = \max^*(x, -x)$.

Similarly we can derive the following expression for the output mean (3rd equation in (8))

$$2\gamma_3 = \log \left( \frac{K_{sw,1}e^{\gamma_1} + K_{sw,2}e^{\gamma_2}}{K_{sw,1}e^{-\gamma_1} + K_{sw,2}e^{-\gamma_2}} \right) \approx \max^* \left( \frac{K_{sw,1}}{2} + \gamma_1, \frac{K_{sw,2}}{2} + \gamma_2 \right) - \max^* \left( \frac{K_{sw,1}}{2} - \gamma_1, \frac{K_{sw,2}}{2} - \gamma_2 \right)$$  \hspace{1cm} (12)

A further simplification to the min‘ that neglects the correction term $\log(1 + e^{-|a-b|})$ and obtains the true min is considered. The ADBP algorithm that uses the updating rules of (10), (11) and (12) with true min approximation is named as the simplified-ADBP and denoted by the acronym sADBP.

IV. FIXED POINT MODEL

The fixed point (FP) model of proposed algorithm is implemented in C language. All variables in the decoding algorithm are represented in 2’s compliment notation as $[I,F]$ with $I$ bits for integer part (which includes the sign bit unless stated otherwise) and $F$ bits for fractional part. The performance of the decoder is expressed as symbol error rate (SER) as a function of signal to noise ratio ($E_s/N_0$).

At first, the approximations used in Eq.(10)-(12) are validated by simulating in double precision (DP), both eADBP and sADBP algorithms. Fig. 3.a shows the results obtained from this step for $M = 10, 16, 32$ and $64$. For all the four cases of $M$ reported in the figure, the simulation results show that the sADBP algorithm performs close to the eADBP algorithm and therefore, can be safely investigated for FP implementation.

The initial step of FP implementation of sADBP decoder is the determination of appropriate number of $I$ and $F$ bits for input quantities (i.e. $\mu$ and $K$ ). This acts as a starting point to implement the internal datapath of decoder keeping in view the performance and complexity constraints. To achieve this, we analyze the DP performance offered by the $s$ADBP$ algorithms when quantized inputs are applied. This analysis is carried out in two steps.

Quantization of $\mu$ only: In this step, only the $\mu$ is quantized as $[I,F]$ and applied to the input of sum and repetition functional units whereas, the $K$ and internal data path is in DP. As discussed in section II, the $\mu$ of D-messages with wrapping period $M$ is a real number that lies in the range $[0,M]$. Therefore, the 2’s compliment representation of integer part of $\mu$ requires $\lceil \log_2 M \rceil$ bits for the magnitude and 1 bit for the sign. Since the input $\mu$ is always positive, most significant sign bit is always ‘0’ and neglected. Whereas, the number of fractional bits of $\mu$ is a design parameter that could be changed to
(b) DP performance of $eADBP$ algorithm with quantized $\mu$ only

Figure 3: ADBP Simulation Results (a) DP performance of $eADBP$ and $sADBP$ algorithms (b) (c) (d) Effects of input quantizations on the DP implementation of $sADBP$ algorithm.

Figure 4: Complete FP simulation results for $sADPB$.

which shows the $sADBP$ performance with quantized $\mu$ with fixed $\lceil \log_2 M \rceil$ integer and variable (2 to 5) fractional bits. For the three cases of $M$ shown in Fig. 3.b, the simulation results demonstrate that at least 4 fractional bits are required for $\mu$ in order to obtain performance within 0.3 dB to the DP implementation.

**Quantization of $\mu$ and $K$:** In this step, both inputs i.e. $\mu$ and $K$ are quantized as $[I, F]$ and applied to functional units. $\mu$ is quantized as obtained from the above step whereas, for $K$ we simulate various combinations of integer and fractional bits. The simulation results of this step are shown in Fig. 3.c. Since the magnitude of $K$ increases with $E_s/N_o$, the schemes with large $I$ and $F \geq 3$ are required to achieve low SER at high $E_s/N_o$. For example, $K[5,3]$ shows superior results for both $M = 10$ and $32$ while $K[4,3]$ also shows comparable results with fewer number of bits. From the simulation results, we conceived that considerable improvement in SER performance could be achieved at high $E_s/N_o$ by first scaling down the magnitude of $K$ before applying

...
to the input of functional units. Whereas, the updated \( K \) at the output of the units is scaled up by the same factor. This scaling helps to represent a large value of \( K \) with lower number of integer bits and avoids overflow during intermediate calculations thus achieving better numerical stability. In addition, it is also conceived that scaling factor of 16 when used with [4,3] quantization scheme provides a good performance-complexity trade off. The results of this step are shown in Fig. 3.d for \( M = 10 \) and 32 which shows significant improvement in SER performance of sADBP algorithm achieved with the help of \( K \) scaling. It may be noted that the same [4,3] quantization for \( K \) and scaling factor of 16 is also applicable for \( M = 16 \) and 64.

In the next step, the internal datapath of decoder is optimized for FP implementation using the quantized inputs (\( \mu \) and \( K \)). The signals involved at each intermediate step of computation are optimized by simulating various combinations of integer and fractional bits. Figure 4 shows the complete FP simulation results of the proposed decoder. The curves in the figure show that for each case of \( M \), the FP implementation of sADBP performs very close to the DP implementations of sADBP and eADBP algorithm. The hardware architecture and datapath complexity details specific to the FP simulation curves of Fig. 4 are discussed in the next section.

V. Hardware architecture and Synthesis Results.

The ADBP decoding architecture is similar to standard non binary (NB) LDPC decoders. As discussed in Sec. III and shown in Fig. 2, the Tanner graph for ADBP consists of two kinds of computation nodes i.e. sum nodes and repetition nodes that operate on \( d_c \) and \( d_o \) messages respectively. The decoding process involves a flooding schedule in which a single decoding iteration consists of two steps: a) Horizontal scan: in which all sum nodes update their output messages and b) Vertical scan: in which all repetition nodes update their output messages. As discussed before, a single message in ADBP decoding is a vector containing two values i.e. \( \mu \) and \( K \) of the corresponding message class.

In this section, we will first discuss the digital architecture for binary (i.e. two input) sum and repetition nodes and then we will propose a generalized processing element (PE) that implements the extension of these operations over \( d_c(d_o) \) messages.

**Binary Repetition Node**

The repetition node implements the function \( W[G1]*W[G2] \) using simplified version (10). This equation is implemented as shown by the steps in Algorithm 1.

Algorithm 1 Repetition Node Update for sADBP Algorithm

1. Inputs: \( K_{wi}, \mu_i = (l_i + a_i), i = 1,2 \)
2. Outputs: \( K_{w3}, \mu_3 \)
3. Scaling: \( K_{xj} = \frac{K_{wi}}{16}, i = 1,2 \)
4. \( K_{w3} \)-Computation:
   5. \( K_{x3} = K_{x1} + K_{x2}, K_{w3} = K_{x3} \times 16 \)
6. \( \mu_3 \)-Computation:
   7. if \( (K_{x1} > K_{x2}) \) then
      8. \( \mu_1 = \mu_2 - \mu_1, K_1 = K_{x2}, \mu_2 = \mu_1 \)
   else
      9. \( \mu_1 = \mu_1 - \mu_2, K_1 = K_{x1}, \mu_2 = \mu_2 \)
11. end if
12. if \( (\mu_1 > \frac{w}{2}) \) then
13. \( \mu_3 = \mu_1 - M, \mu_4 = \mu_3 \)
14. else if \( (\mu_1 < -\frac{w}{2}) \) then
15. \( \mu_3 = \mu_1 + M, \mu_4 = \mu_3 \)
16. else
17. \( \mu_4 = \mu_1 \)
18. end if
19. \( v_3 = \frac{K_{x3}}{K_{w3}}, a_1 = K_1 \times v_3 \)
20. \( a_2 = a_1 \times \mu_4 \)
21. \( \mu_3 = \mu_2 + a_2 \)

Fig. 5.A shows the datapath architecture of binary repetition node functional unit (RN-FU). CMP1-CMP3 are binary comparators, A1-A4 are 2’s complement binary adders, M1-M2 are binary multipliers and Mux1-4 are the 2 input multiplexers. The figure also shows the input, output and intermediate variables of Listing 1. \( \mu_i \) and \( K_{wi}, i = 1,2 \) are the input mean and scaled concentrations respectively. In FP implementation, the wrapped message concentration \( K_{wi,j} \) at the input of the node is represented in \([I,F] \) form as \([8,0] \) which is scaled down by 16 by simply moving the decimal point to left by 4 places i.e. \( K_{x} = [4,4] \). Reverse is true for scaling up by 16 at the output of the node. In order to decrease the hardware complexity and propagation delay, we adopted division via multiplication by reciprocal technique where the quantity \( v_3 = \frac{1}{K_{w3}} \) is implemented using a look up table (LUT). The repetition node is implemented as a fully pipelined structure with pipelining depth \( L = 3 \) (shown in dotted lines in Fig.5.A).

**Binary Sum Node**

The binary sum node implements the function \( S[G1]*S[G2] \) using (11) and (12) and the main computational steps are shown in Algorithm 2.

The hardware architecture reported in Fig.5.b consists of binary comparators Cmp1-Cmp5 that compute the true min of two inputs, 2’s complement adders A1-A16, multipliers M1-M3 and a LUT to implement \( v_3 = \frac{1}{K_{w3}} \). The sum node is also a fully pipelined structure with \( L = 3 \) pipeline stages.

It may be noted that although the generalized architecture for both sum and repetition node remains the same for all values of \( M \), the complexity as well as decoding performance is dependent upon the \([I,F] \) representation of all variables. For the FP simulation results shown in Fig. 4, the binary sum and repetition nodes datapath
Algorithm 2 Sum Node Update for sADB Algorithm

1. Inputs: $K_{i,j}, \mu_i = (l_i + \alpha_i), i = 1, 2$
2. Outputs: $K_{s,3} \mu_3$
3. Scaling: $K_{s,3} = 2^{16}, i = 1, 2$
4. $K_{s,3}$-Computation:
5. for $i = 1$ to $2$
6. $l_i = [\mu_i + 0.5], \alpha_i = \mu_i - l_i, \gamma_i = \alpha_i \times K_{s,3}$
7. end for
8. $\gamma_{12} = \gamma_1 - \gamma_2, \gamma_{21} = \gamma_2 - \gamma_1$
9. $\gamma_s = \max(\gamma_{12}, \gamma_{21})$
10. $y_1 = \frac{K_{x1}}{2} + \frac{K_{x2}}{2}$
11. $y_3 = \min(K_{s,1}, K_{s,2})$
12. $y_2 = y_1 - \gamma_s$
13. $K_{s,3} = \min(y_3, y_2), K_{s,3} = 16 \times K_{s,3}$
14. $\mu_{s,3}$-Computation:
15. $x_1 = y_1 + \frac{K_{x1}}{2}, x_2 = y_2 + \frac{K_{x2}}{2}$
16. $x_3 = y_1 - \frac{K_{x1}}{2}, x_4 = y_2 - \frac{K_{x2}}{2}$
17. $x_5 = \max(x_3, x_4), x_6 = \min(x_1, x_2)$
18. $\gamma_3 = \frac{1}{2}(x_5 - x_6), \delta_3 = \frac{1}{2}x_5 - \frac{1}{2}x_6$
19. $\alpha_3 = x_3 \times \gamma_3, l_3 = l_1 + l_2$
20. $\mu_3 = l_3 + \alpha_2$
21. $\mu_3 = (\mu_3 + M) \mod M$

detail is given in Tab. I.a and I.b. The left most column of both tables shows all computation steps involved in the updates of Algorithm 1 and 2 whereas, the next columns show for $M = 10$ to 64, the quantization detail $[I, F]$ of operands as well as of the result of each step. The quantities $\mu_i, K_{s,3}, K_{s,3}, K_{s,3}, l_i : i = 1, 2, 3$ are always positive therefore, MSB of their magnitude part is always '0' and hence excluded from $[I, F]$ representation. For all other quantities the $[I, F]$ includes the sign bit. In some cases where the operands have unequal number of I and F bits; arithmetic shift left(right) operation is performed on one or both of them in order to align their decimal points. In addition, the tables also show the hardware modules (of Fig. 5.A and 5.B) involved in each step along with their complexity in terms of bitwidth.

The details of Tab. I.a and I.b reveal that moving from $M = 10$ to 64 requires a 1 bit increase in the bitwidth of those hardware modules which involve $\mu_i$ in computation. This results in a slight but affordable increase in overall complexity of sum and repetition functional nodes.

**Processing Element**

The sum and repetition functions discussed above are associative i.e. for three inputs $I_1, I_2, I_3$

\[ \oplus(I_1, I_2, I_3) = \oplus(\oplus(I_1, I_2), I_3) \]
### Table I: Sum and repetition functional nodes complexity detail specific to FP performance curves of Fig. 4

<table>
<thead>
<tr>
<th>Operation / Signals</th>
<th>Quantization Details ([I,F])</th>
<th>Hardware Modules</th>
<th>Bitwidth</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>(M=64)</td>
<td>(M=16)</td>
<td>(M=10,16)</td>
</tr>
<tr>
<td>(\mu_1, \mu_2)</td>
<td>[6,4]</td>
<td>[4,4]</td>
<td>&lt;[4,4]</td>
</tr>
<tr>
<td>(K_{11}, K_{12}, K_{23})</td>
<td>[8,0]</td>
<td>[8,0]</td>
<td>[8,0]</td>
</tr>
<tr>
<td>(l_i = m_i + \frac{3}{2}, i = 1, 2)</td>
<td>[6,4]</td>
<td>[4,4]</td>
<td>[4,4]</td>
</tr>
<tr>
<td>(a_i = \frac{\mu_1 - l_i}{l_i}, i = 1, 2)</td>
<td>[4,4]</td>
<td>[4,4]</td>
<td>[4,4]</td>
</tr>
<tr>
<td>(y_i = l_i + K_{22}, i = 1, 2)</td>
<td>[4,4]</td>
<td>[4,4]</td>
<td>[4,4]</td>
</tr>
<tr>
<td>(y_1 = y_1 - y_2)</td>
<td>[2,7]</td>
<td>[2,7]</td>
<td>[2,7]</td>
</tr>
<tr>
<td>(y_2 = y_1 - y_1)</td>
<td>[2,7]</td>
<td>[2,7]</td>
<td>[2,7]</td>
</tr>
<tr>
<td>(y_3 = \max(y_1, y_2))</td>
<td>[3,6]</td>
<td>[3,6]</td>
<td>[3,6]</td>
</tr>
<tr>
<td>(\frac{K_{31}}{2}, i = 1, 2)</td>
<td>[4,3]</td>
<td>[4,3]</td>
<td>[4,3]</td>
</tr>
<tr>
<td>(x_i = \gamma_i + \frac{K_{32}}{2})</td>
<td>[2,7]</td>
<td>[2,7]</td>
<td>[2,7]</td>
</tr>
<tr>
<td>(x_3 = \gamma_1 - \frac{K_{11}}{2})</td>
<td>[2,7]</td>
<td>[2,7]</td>
<td>[2,7]</td>
</tr>
<tr>
<td>(x_4 = \gamma_2 - \frac{K_{12}}{2})</td>
<td>[2,7]</td>
<td>[2,7]</td>
<td>[2,7]</td>
</tr>
<tr>
<td>(x_5 = \min(x_3, x_4))</td>
<td>[5,3]</td>
<td>[5,3]</td>
<td>[5,3]</td>
</tr>
<tr>
<td>(x_6 = \min(x_1, x_2))</td>
<td>[4,3]</td>
<td>[4,3]</td>
<td>[4,3]</td>
</tr>
<tr>
<td>(\gamma_1 = \frac{(x_3 - x_5)}{2})</td>
<td>[5,3]</td>
<td>[5,3]</td>
<td>[5,3]</td>
</tr>
<tr>
<td>(v_3 = \frac{K_{33}}{4})</td>
<td>[2,4]</td>
<td>[2,4]</td>
<td>[2,4]</td>
</tr>
<tr>
<td>(a_3 = c_1 \times y_3)</td>
<td>[2,4]</td>
<td>[2,4]</td>
<td>[2,4]</td>
</tr>
<tr>
<td>(b_3 = a_1 + f_2)</td>
<td>[7,0]</td>
<td>[7,0]</td>
<td>[7,0]</td>
</tr>
<tr>
<td>(\mu_3 = \frac{y_2 + a_3}{2})</td>
<td>[3,6]</td>
<td>[3,6]</td>
<td>[3,6]</td>
</tr>
<tr>
<td>(\mu_3 = (\mu_3 + 3) \mod M)</td>
<td>[6,4]</td>
<td>[6,4]</td>
<td>[6,4]</td>
</tr>
</tbody>
</table>

#### (a) Sum Node Functional Unit

#### (b) Repetition Node Functional Unit
Where ⊕ operator denotes the binary sum or repetition function. In order to extend these functions to process more than two input messages, the following update rule must be satisfied

\[ I_{y,i} = \oplus^n_{j \neq i}(I_{x,j}) \]  
(13)

\( I_{x,j}(I_{y,j}) \) denote the input(output) \([\mu, K]\) message vectors and \( n \) denotes the node degree \((d_v \text{ or } d_c)\). The above equation states that the output message corresponding to edge \( i \) of a node (sum or repetition) is the result of ⊕ operator over all input messages except the input message \( i \). The update rule of (13) is implemented in this work by adopting a forward-backward (FB) strategy with serial read and write. Fig. 5.c. shows the generalized architecture of proposed FB processing element (PE) which consists of three functional units (FUs) that implement the ⊕ operator and two last in first out (LIFO) memory units which hold the intermediate results. The functionality of PE explained here in context of CN processing holds equally applicable for VN processing.

Thanks to the pipelined implementation of FUs, the PE is able to process one edge of Tanner graph per clock cycle. The number of parity check equations in pipeline is \( L = 3 \). In the first \( L \) clock cycles PE receives the first message of \( L \) parity equations, in the next \( L \) cycles it receives the second message of \( L \) equations and so on. This process continues until all \( d_e \) messages of \( L \) parity check equations have been received. At the output, the messages are produced in reverse order i.e. starting from message number \( d_e \) of equation number \( L \) up to message number 1 of equation number 1.

The width (i.e. number of bits per row) of LIFO1 and LIFO2 is equal to sum of total number of bits for FP representation of \( \mu \) and \( K \), whereas the depth (i.e. number of rows) is given as

\[ d_{lifo} = (d_e - 1)L \]  
(14)

Both sum and repetition PE’s have a throughput of one edge per clock cycle. Therefore, if the clock frequency is \( C \), the PE’s for both sum and repetition operator will provide the full set of updated edges \( E = N \eta \) in \( 2N \eta(Nd_e + L')/C \) seconds, where \( L' \) is an additional constant that takes into account the latency of the two processors.

More generally, with \( P \) parallel processors the throughput is given as

\[ T = C \frac{PR_c \log M}{2N \eta(d_e + \frac{L}{N})} \approx C \frac{PR_c \log_2 M}{2N \eta d_e P} \]  
(15)

Where \( R_c \) denotes the code rate and \( d_e \) is the average variable degree of LDPC code, \( N \eta \) is the number of decoding iterations.

The synthesizable IP core of the ADBP decoder is written in VHDL and synthesis is performed on 45nm standard cell ASIC technology using Synopsys Design Vision tool at a target clock rate \( C = 300 \) MHz.

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>4</td>
<td>2540</td>
<td>7/10</td>
<td>9</td>
<td>1200</td>
<td>0.0106</td>
<td>0.0155</td>
<td>18</td>
<td>3.5</td>
<td>7.5</td>
</tr>
<tr>
<td>5</td>
<td>2540</td>
<td>7/10</td>
<td>12</td>
<td>2000</td>
<td>0.0115</td>
<td>0.0150</td>
<td>18</td>
<td>2.5</td>
<td>5.5</td>
</tr>
<tr>
<td>6</td>
<td>2540</td>
<td>7/10</td>
<td>14</td>
<td>2000</td>
<td>0.0111</td>
<td>0.0149</td>
<td>18</td>
<td>1.5</td>
<td>3.5</td>
</tr>
<tr>
<td>7</td>
<td>2540</td>
<td>7/10</td>
<td>21</td>
<td>2000</td>
<td>0.0124</td>
<td>0.0169</td>
<td>18</td>
<td>11.5</td>
<td>9.5</td>
</tr>
<tr>
<td>8</td>
<td>2530</td>
<td>20/50</td>
<td>9</td>
<td>1700</td>
<td>0.0110</td>
<td>0.0195</td>
<td>22.5</td>
<td>10.5</td>
<td>8.5</td>
</tr>
<tr>
<td>9</td>
<td>2530</td>
<td>20/50</td>
<td>14</td>
<td>2000</td>
<td>0.0111</td>
<td>0.0191</td>
<td>22.5</td>
<td>10.5</td>
<td>8.5</td>
</tr>
<tr>
<td>10</td>
<td>2530</td>
<td>20/50</td>
<td>15</td>
<td>2000</td>
<td>0.0112</td>
<td>0.0190</td>
<td>22.5</td>
<td>11.5</td>
<td>10.5</td>
</tr>
<tr>
<td>11</td>
<td>2530</td>
<td>20/50</td>
<td>21</td>
<td>2000</td>
<td>0.0124</td>
<td>0.0179</td>
<td>22.5</td>
<td>12.5</td>
<td>10.5</td>
</tr>
<tr>
<td>12</td>
<td>3010</td>
<td>30/50</td>
<td>9</td>
<td>2200</td>
<td>0.0119</td>
<td>0.0168</td>
<td>27</td>
<td>10.5</td>
<td>11.5</td>
</tr>
<tr>
<td>13</td>
<td>3010</td>
<td>30/50</td>
<td>22</td>
<td>2000</td>
<td>0.0149</td>
<td>0.0168</td>
<td>27</td>
<td>12.5</td>
<td>12.5</td>
</tr>
<tr>
<td>14</td>
<td>3010</td>
<td>30/50</td>
<td>24</td>
<td>2000</td>
<td>0.0150</td>
<td>0.0168</td>
<td>27</td>
<td>14.5</td>
<td>13.5</td>
</tr>
</tbody>
</table>

Table II: Synthesis Results of ADBP processing nodes.

Table II shows the synthesis results of the main processing elements of ADBP algorithm for various values of \( M \). Following notations are adopted in table to represent the area figures.

- \( A_{\text{Sum}} \) and \( A_{\text{Rep}} \) denote respectively the area in \( \mu \)m\(^2\) of a two input sum and repetition node FU.
- \( A_{\text{LiFO}} \) denote the area in \( \mu \)m\(^2\) of a single LIFO for a given node degree \( n \).
- \( A_{\text{CN}} \) and \( A_{\text{VN}} \) denote respectively the area in \( \mu \)m\(^2\) of a single sum and repetition PE of degree \( n \).

The table also reports the gate count (G.C) i.e. total number of 2-input Nand gates for both CN and VN PEs for various values of \( M \) and node degrees \( n \). The throughput for the whole decoder is also reported in table which is calculated using (15) for full serial architecture i.e. \( P = 1 \) with \( R_c = \frac{3}{5} \) and \( d_v = 2 \).

The exact comparison in terms of performance and hardware complexity of complete ADBP decoder with the state of the art NB LDPC decoders is difficult due to different design choices e.g. cardinality of Galois fields, CMOS technology process, operating frequency, parallelism, different block lengths, code rates and variable (check) node degree distributions. However, the complexity comparison at the PE level is possible and presented here. One recent work in the domain of non binary LDPC decoders is [13] in which the authors propose an \( M = 32 \) decoder with Trellis based implementation of forward backward check node. The main processing core of [13] consists of an iterative decoder processor (IDP) which implements the combined functionality of a single CN with \( d_e = 27 \) and VN with \( d_v = 4 \). The IDP is synthesized on 90 nm technology and has a gate count G.C of 5 Million gates. In case of ADBP, the combined area of sum and repetition PE’s of degrees \( d_e = 27 \) and \( d_v = 4 \) respectively is 0.038 mm\(^2\) at 45nm. This area is multiplied by 4 to obtain equivalent area at 90nm technology. Finally, the result is divided by the area of a single 2-input nand gate to obtain the G.C at 90nm which is 0.06 Million gates. This clearly demonstrates the logic area advantage of ADBP over the existing NB decoders. In addition, the non binary LDPC decoder of [13] achieves a throughput of 234 Mbit/s with parameters \( N \eta, C, K \) and \( R_c \) equal to 5, 250MHz, 726 and 0.86 respectively. For the same parameters, the proposed ADBP decoder is able to achieve a throughput of 833Mbits/sec.
which is almost 3.5 times higher than the decoder in [13]. However, the simple regular mod-M encoder used in this paper actually does not provides performances competitive with those of LDPC encoders constructed on fields. The exceptional complexity reduction achieved from using the ADBP however motivates for further research effort in the design of good LDPC encoders within the class.

The synthesis results of Tab. II clearly demonstrate that the ADBP algorithm achieves a comparable throughput with affordable complexity. In addition the complexity increases very slightly moving to higher cardinalities which shows feasibility of this algorithm for high values of M.

VI. Conclusions.

After almost two decades of research, binary LDPC codes have gained a wide diffusion in several fields and excellent implementations exist for their decoding. However, the problem of developing efficient decoders for non binary LDPC codes is far from being solved. In particular, the implementation complexity of non binary LDPC decoders tends to grow rapidly with the cardinality of the symbol alphabet, which severely limits the achievable spectral efficiency. In the first part of this paper, we introduce several simplifications in the previously proposed Analog Digital Belief Propagation algorithm. Provided simulation results show that these approximations do not affect significantly decoding performance, but enable the practical implementation of ADBP. In the second part of the paper, the fixed point model of ADBP is developed, showing that between 5 and 10 bits must be allocated to represent external and internal quantities with limited or null effect on performance. Finally, in the last part of the work, we propose and detail the implementation architecture of key processing nodes. Synthesis results obtained for multiple sizes of the symbol alphabet prove that: (i) the required area to implement processing nodes is affordable, and (ii) the complexity grows very weakly with the size of the alphabet.

References