[Article] Multiplierless, Folded 9/7 - 5/3 Wavelet VLSI Architecture

Original Citation:

Availability:
This version is available at: http://porto.polito.it/1788971/ since: April 2008

Publisher:
IEEE

Published version:
DOI:10.1109/TCSII.2007.900354

Terms of use:
This article is made available under terms and conditions applicable to Open Access Policy Article ("Public - All rights reserved"), as described at http://porto.polito.it/terms_and_conditions.html

Porto, the institutional repository of the Politecnico di Torino, is provided by the University Library and the IT-Services. The aim is to enable open access to all the world. Please share with us how this access benefits you. Your story matters.

(Article begins on next page)
Multiplierless, folded 9/7 - 5/3 wavelet VLSI architecture

Maurizio Martina, Member IEEE, Guido Masera, Member IEEE

Abstract—This paper proposes a multiplierless VLSI architecture for the famous 9/7 wavelet filters. The novelty of this architecture is the possibility to compute the 5/3 wavelet results into the 9/7 data-path with a reduced number of adders compared to other solutions. The multiplierless architecture has been characterized in terms of performance through simulations into a JPEG2000 environment and compared to other solutions. Implementation on a 0.13 µm standard cell technology shows that the proposed architecture compared to other multiplierless architectures requires a reduced amount of logic with excellent performance.

Index Terms—JPEG2000, Wavelet, Multiplierless Implementation, Filter Bank, VLSI

I. INTRODUCTION

The Discrete Wavelet Transform (DWT) is widely employed in many image and video compression systems due to its excellent decorrelation properties [1]. In particular, many famous coders have been proposed to effectively compress images or frames processed via the DWT [2]. Besides, the DWT is employed in JPEG2000, the new standard for image compression, where the 9/7 [3] and the 5/3 [4] wavelet filters are employed as the default filters for lossy and lossless compression respectively [5].

Several recent publications describe efficient implementations of JPEG2000 encoders and decoders [6]. Moreover, many research works have faced the problem of reducing the DWT complexity. This issue has been investigated mainly from two perspectives: i) reducing the memory access overhead [7], [8], [9]; ii) reducing the DWT computational complexity [9], [10]. Recently, reduced complexity solutions, involving multiplierless implementations either through filter banks [11], [12], [13], [14] or lifting scheme [15], [16], [17] have been proposed.

The aim of this paper is to embed the 5/3 wavelet computation into the 9/7, in order to exploit as much as possible the 5/3 results to achieve the 9/7 ones. To the best of our knowledge, this is the first work where not only hardware resources, but also the output values obtained via the 5/3 wavelet filters are used to compute the 9/7 results. In [11], [12] and [14] the 9/7 filters symmetry is exploited by adding input samples (xi) together to obtain wi = xi+i+1 + xi+i, w2 = xi+i+2 + xi+i−2 and so on. Some multiplierless solutions (e.g. [11], [12]) are based on butterfly structures where the wi values are first added together, then partial results are shifted and combined to obtain the output values. On the other hand in [14] wi are first partially shifted and then added. To obtain the 5/3 wavelet results different shift amounts have to be applied to each wi, as detailed in section II. As a consequence, the solution proposed in [14] is more suited to embed the 5/3 calculation and reuse partial results instead of adding resource to support both the 9/7 and the 5/3 wavelet filters. The proposed architecture, that is derived from the multiplierless 9/7 filter bank (FB) proposed in [14], embodies the 5/3 FB and further reduces the number of adders required from 21 to 19, granting the same performance. It is worth pointing out that, even if it is not a goal of this work, the proposed architecture allows on-the-fly switching between the 5/3 and 9/7 wavelet filters.

II. THEORETICAL DERIVATION

Let’s consider the filter bank shown in Fig. 1, where

\[ H(z) = \sum h[i]z^{-i} \] and \[ G(z) = \sum g[i]z^{-i} \] are the low pass and high pass analysis filters with length k and l respectively, and \[ \tilde{H}(z) = \sum \tilde{h}[i]z^{-i} \] and \[ \tilde{G}(z) = \sum \tilde{g}[i]z^{-i} \] the low pass and high pass synthesis ones with length \( k \) and \( l \).

In [14] it has been proved that the 9/7 wavelet filters can be decomposed into two stages. The first stage involves only rational factors, whereas the second one requires multiplications with 5 constants:

1. Considering the two analysis filters \( J_1 = (a+b+3/2)/2a, K_2 = (b+1)/8a, K_3 = 1/32a, J_1 = (r + 1/2)/2r \) and \( J_2 = 1/8r \), where \( r \) is the real solution of the third order equation

\[ 1/20 + 4/20x + 10/20x^2 + x^3 = 0 \] (1)

and \( a \) and \( b \) are the product and the sum of the two complex conjugate solutions respectively.

As suggested in [14], accepting a small loss in terms of Peak Signal to Noise Ratio (PSNR) the 5 constants can be approximated with few power of 2 values: \( K_1 \approx 2 + 1/16, K_2 \approx 1, K_3 \approx 1/8 + 1/16 + 1/32, J_1 \approx 1 + 1/4 \) and \( J_2 \approx 1/4 + 1/8 \). Considering the two analysis filters \( h_{0.7}^{9.7}[n] \) and \( g_{0.7}^{9.7}[n] \) as vectors \((h_{0.7}^{9.7} \text{ and } g_{0.7}^{9.7})\), we can represent them as the product of a matrix and a vector, where the matrix contains the rational factors and the vector contains the constants - see (2) and (3). Besides, \( h_{0.7}^{9.7}[n] \) and \( g_{0.7}^{9.7}[n] \) symmetry suggests, for the sake of simplicity, to concentrate only on taps with index
For the low-pass coefficients we have (see Table I):

\[ h^{9,7} = M \cdot K \]

and

\[ g^{9,7} = N^{R_5,C_3} \cdot J \]

where

\[ M = \begin{pmatrix}
\frac{6}{8} & \frac{2}{8} & \frac{4}{8} \\
\frac{2}{8} & \frac{8}{8} & \frac{4}{8} \\
0 & 0 & \frac{2}{8}
\end{pmatrix}, \quad K = \begin{bmatrix}
K_1 \\
K_2 \\
K_3
\end{bmatrix}, \quad J = \begin{bmatrix}
J_1 \\
J_2
\end{bmatrix} \]

and

\[ N = \begin{pmatrix}
\frac{6}{8} & \frac{2}{8} & \frac{2}{8} \\
\frac{2}{8} & \frac{4}{8} & \frac{2}{8} \\
0 & 0 & \frac{2}{8}
\end{pmatrix}, \quad N^{R_5,C_3} = \begin{pmatrix}
\frac{6}{8} & \frac{2}{8} & \frac{2}{8} \\
\frac{2}{8} & \frac{4}{8} & \frac{2}{8} \\
0 & 0 & \frac{2}{8}
\end{pmatrix} \]

Since \( G(z) = \tilde{H}(-z) \), odd terms have different sign and \( N \) can be directly derived from \( M \) changing the sign of odd terms (see Table I \( n \in \{1, 3\} \)). As it can be observed \( N^{R_5,C_3} \) is obtained removing the 5th row and the 3rd column of \( N \).

Let’s focus on the first stage defined by \( M \) and \( N^{R_5,C_3} \); grouping together some of the first stage elements we obtain the 5/3 filter. For the sake of simplicity, we exploit the 9/7 and the 5/3 filters symmetry to group together samples that ought to be multiplied by the same tap \( h \). So that we have

\[ w_0 = x_i, \quad w_1 = x_{i+1} + x_{i-1}, \quad w_2 = x_{i+2} + x_{i-2} \]

and so on.

For the low-pass coefficients we have (see Table I):

\[ y^{9,7}_i = h^{9,7}[0]x_i + h^{9,7}[1](x_{i+1} + x_{i-1}) + \cdots = \left( \frac{6}{8}K_1 - \frac{8}{8}K_2 + \frac{2}{8}K_3 \right)w_0 + \cdots \]

\[ = \left( \frac{6}{8}w_0 + \frac{4}{8}w_1 + \frac{1}{8}w_2 \right)K_1 + \cdots \]

where

\[ p1 = \frac{6}{8}w_0 + \frac{4}{8}w_1 + \frac{1}{8}w_2 \]  

\[ p2 = \frac{8}{8}w_0 + \frac{7}{8}w_1 + \frac{4}{8}w_2 + \frac{1}{8}w_3 \]  

\[ p3 = \frac{2}{8}w_0 + \frac{2}{8}w_1 + \frac{2}{8}w_2 + \frac{1}{8}w_3 \]  

Since for the 5/3 low-pass coefficient holds true:

\[ y^{5,3}_i = \frac{6}{8}w_0 + \frac{2}{8}w_1 - \frac{1}{8}w_2 \]

we can rewrite \( p1 \) as

\[ p1 = \frac{6}{8}w_0 + \frac{2}{8}w_1 + \frac{2}{8}w_2 - \frac{1}{8}w_2 \]

\[ = \frac{6}{8}w_0 + \frac{2}{8}w_1 - \frac{1}{8}w_2 + \frac{2}{8}w_2 \]

\[ = y^{5,3}_i + \frac{2}{8}w_1 + \frac{2}{8}w_2 \]  

With similar considerations we derive for the high-pass coefficient:

\[ y^{5,3}_h = q1J1 - q2J2 \]  

where

\[ q1 = \frac{6}{8}w_0 - \frac{4}{8}w_1 + \frac{1}{8}w_2 \]  

\[ q2 = \frac{8}{8}w_0 - \frac{7}{8}w_1 + \frac{4}{8}w_2 - \frac{1}{8}w_3 \]

Then

\[ y^{5,3}_h = \frac{8}{8}w_0 - \frac{4}{8}w_1 \]  

So that we can build a folded architecture that exploits the 5/3 coefficients to obtain the 9/7 ones rewriting \( p2 \) and \( q1 \) as

\[ p2 = z^{5,3}_i + \frac{3}{8}w_1 + \frac{4}{8}w_2 + \frac{1}{8}w_3 \]  

\[ q1 = z^{5,3}_h - \frac{2}{8}w_1 + \frac{2}{8}w_2 \]

with

\[ z^{5,3}_i = \frac{8}{8}w_0 + \frac{4}{8}w_1 \]  

\[ z^{5,3}_h = \frac{6}{8}w_0 - \frac{2}{8}w_1 - \frac{1}{8}w_2 \]

III. PERFORMANCE AND EXPERIMENTAL RESULTS

The multiplierless FB described in section II allows to obtain lossless compression when the 5/3 wavelet is selected. On the other hand when the 9/7 is employed the performance detailed in [14] are achieved. In order to compare the proposed FB architecture with the multiplierless LS 9/7 solution proposed in [17], simulations inside the JPEG2000 image coding standard [5] framework have been performed.

A free implementation of a JPEG2000 codec written in C language, openjpeg [18], that is Class-1 Profile-1 compliant with the standard, has been employed for our tests. Five standard images have been used for the test: ‘Lenna’ \( 256 \times 256 \) (I=1), ‘Barbara’ \( 512 \times 512 \) (I=2), ‘Boat’ \( 512 \times 512 \) (I=3), ‘Goldhill’ \( 512 \times 512 \) (I=4) and ‘Fingerprint’ \( 512 \times 512 \) (I=5) [19]. The number of wavelet decomposition levels \( L \) has been varied from 1 to 3 for 256 \( \times 256 \) images and from 1 to 4 for 512 \( \times 512 \) images. Different compression rates have been imposed, namely 1, 0.5, 0.25 and 0.125 bit per pixel (bpp), precinct and code-block size are the encoder default values [5].
The results of our experiments are summarized in Table II where in column $A$ the PSNR values obtained with the original openjpeg implementation are shown. In column $B$ we give the results of the multiplierless 9/7 filter bank described in [14]. These results are obtained implementing the multiplierless 9/7 FB described in [14] at the encoder side and decoding the bitstream with the standard 9/7 LS decoder. In column $C$ the results obtained with the multiplierless 9/7 LS described in [17] are shown. These results are obtained implementing the multiplierless 9/7 LS described in [17] at the encoder side and decoding the bitstream with the standard 9/7 LS decoder. We can observe that $B$ and $C$ grant very close performance compared to the ones available with the original openjpeg implementation ($A$). As detailed in [14], [15] and [17] this relevant figure is achieved thanks to the filtering zeros position that is extremely close to the original 9/7 filters one. Furthermore, in some cases non linearities caused by the quantization and the optimal truncation performed by EBCOT produces slightly better results in terms of PSNR with $B$ or $C$ than with $A$. Since in some other cases $A$ is better than $B$ or $C$, we can conclude that the difference among the three models is very limited (from few cents to some fractions of dB). This confirms that the hardware simplification achieved through [14] and [17] multiplierless solutions does not worsen the DWT performance.

### IV. Proposed Architecture

The proposed architecture is based on the multiplierless data-path (Fig. 2) that implements (4) and (10), embedding the 5/3 calculation. The main idea is to consider that the downsampling required by filter bank implementation (see Fig. 1) allows to alternatively generate a low pass output sample and a high pass output sample. Considering (8) and (18), we can observe that $y_{l_{i}}$ and $z_{h_{i}}$ differ only in the sign of the $w_{1}$ term. The same consideration can be extended to $y_{h_{i}}$ and $z_{l_{i}}$ observing (13) and (17). As a consequence, exploiting these properties, during the low pass cycle both $y_{h_{i}}$ and $z_{l_{i}}$ are produced, whereas during the high pass cycle $y_{l_{i}}$ and $z_{h_{i}}$ are generated.

In order to reduce the complexity of the proposed architecture to generate the 9/7 results, we can rewrite (4) and (10) as a function of $y_{l_{i}}$, $z_{l_{i}}$, $y_{h_{i}}$, $z_{h_{i}}$ and $w_{1}$ with $i \in \{0, 1, 2, 3, 4\}$:

$$
y_{l_{i}}^{9,7} = \left( 2 + \frac{1}{16} \right) y_{l_{i}}^{5,3} - z_{l_{i}}^{5,3} + \frac{1}{4} \left( \frac{1}{4} + \frac{1}{8} \right) (w_{0} + w_{2}) + \left( \frac{1}{2} - \frac{1}{16} \right) w_{4} + \left( w_{1} - \frac{1}{16} w_{3} \right) + \frac{1}{2} w_{2} \right) \right] \right] \right] \right]$$

$$
y_{h_{i}}^{9,7} = \left( 1 + \frac{1}{4} \right) y_{h_{i}}^{5,3} - \left( \frac{1}{4} + \frac{1}{8} \right) y_{l_{i}}^{5,3} + \frac{1}{4} \left( \frac{1}{4} + \frac{1}{2} + \frac{1}{4} w_{3} \right) + \left( w_{1} - \frac{1}{16} w_{3} \right) + \frac{1}{2} w_{2} \right) \right] \right] \right] \right]$$

In (19) and (20), the terms required to generate $y_{l_{i}}^{9,7}$ and $y_{h_{i}}^{9,7}$ are split in three main contributions written on three different lines. As an example the last contributions of $y_{l_{i}}^{9,7}$ and $y_{h_{i}}^{9,7}$ (third line) differ only in the sign of the first term. Thus the proposed architecture can be obtained by combining the three contributions and adding few multiplexers to account for $y_{l_{i}}^{9,7}$ and $y_{h_{i}}^{9,7}$. In Fig. 2, the proposed architecture is shown where solid lines represent 16 bits wide data and dashed lines represent control signals. The gray box on the top of Fig. 2 contains the adders required to generate the $w_{1}$ terms. The light-gray box labeled with 5/3 contains the programmable adders/subtracters required to compute $y_{l_{i}}^{5,3}$, $z_{l_{i}}^{5,3}$, $y_{h_{i}}^{5,3}$ and $z_{h_{i}}^{5,3}$ as described in (8), (17), (13) and (18). The three gray boxes in the center of the figure implement the three contributions of (19) and (20) employing three, four and two adders respectively and few multiplexers to select the low pass or high pass contributions. In particular, the lo_hin signal
The architecture described in section IV has been described in VHDL resorting to only 19 adders. The logical synthesis results on a 0.13 \( \mu \)m standard cell technology with a target clock frequency of 200 MHz confirmed the reduced amount of logic required (2.69 kgares) and its low power consumption (9.74 mW). In Table III, several FB and LS architectures are considered and their performance (PSNR and throughput), complexity (kgares) and power consumption summarized. If we consider multiplierless implementations [11] requires 43 adders, [13] requires 32 adders, [12] requires 27 adders, [14] requires 21 adders, whereas the proposed architecture requires only 19 adders. As far as multiplierless FB architectures are concerned, logical synthesis results summarized in Table III take into account the combinational logic (including the adders) and the register placed on the data-path output (\( y \)). On the other hand, even if the multiplierless LS architectures require only 19 [15] or 15 [17] adders, the LS serial nature imposes to use a greater number of registers compared to FB architectures, leading to an increase in terms of complexity. As a consequence, we can observe that the proposed multiplierless FB architecture with embedded 5/3 computation shows a computational complexity reduction with respect to other multiplierless solutions, together with excellent performance in terms of PSNR and a reduced power consumption. Moreover, the inherent flexibility of the proposed architecture, that supports both the 9/7 and 5/3 wavelet filters, has been obtained with no penalties in terms of complexity or performance.

It is known that several systematic methods to design complexity-aware multiplierless filters have been proposed in the literature (e.g. [20], [21], [22], [23]). Even if this work does not make use of these techniques, it is interesting to compare the number of adders required by the proposed architecture (19 adders) to the number of adders achieved with a systematic method. In [22] a heuristic search algorithm based on a genetic algorithm, called CSDC, is developed. However, said \( N \) the number of taps, CSDC shows to achieve best results on long filters, \( N \geq 60 \). Moreover [23], where the KMSD algorithm is proposed, proves that the n-Dimensional Reduced Adder Graph (RAG-n) algorithm [21] is best for short filters \( N < 12 \). Since this is the case of the proposed architecture, we investigate the number of adders required to implement (19) and (20) resorting to the Canonical Signed Digit (CSD) representation [20] and to the RAG-n algorithm [21]. In order to make the comparison as fair as possible we consider the 5/3 block outputs and the \( w_i \) values as inputs (\( \hat{x} \)) and we apply the CSD and the RAG-n methodologies to the equivalent low pass (\( \hat{h} \)) and high pass (\( \hat{g} \)) filters \( \hat{y}_h^g = \hat{h} \cdot \hat{x} \) and \( \hat{y}_h^g = \hat{g} \cdot \hat{x} \) with \( \hat{h} = [33/16 -1 7/128 1/4 23/128 -1/64 7/256] \) (\( \hat{h} \) length is 7) and \( \hat{g} = [5/4 -3/8 0 -11/64 1/8 3/64 0] \) (\( \hat{g} \) length is 5). This allows to obtain a multiplierless architecture able to support both the 9/7 and the 5/3 wavelet filters. To simplify the comparison we split the number of required adders in four contributions: 1) 4 adders to generate the \( w_i \) values, 2) 4 adders required by the 5/3 block, 3) 6 adders to combine the 7 elements produced by \( \hat{h} \), 4) \( m \) adders required by the filters coefficients when CSD or RAG-n is employed (\( m_{\text{CSD}} \) or \( m_{\text{RAG-n}} \)). Since the low pass and the high pass outputs are alternatively generated, the adders required by \( \hat{h} \) and \( \hat{g} \) can be shared introducing few multiplexers. It can be easily obtained that \( m_{\text{CSD}} = 6 \) and \( m_{\text{RAG-n}} = 7 \), if adders sharing is employed \( m_{\text{CSD}} = 7 \), leading to 21 adders. On the other hand exploiting RAG-n precomputed table [24], we obtain \( m_{\text{RAG-n}} = 5 \) and \( m_{\text{RAG-n}} = 5 \), thus \( m_{\text{RAG-n}} = 5 \), leading to 19 adders. It is worth observing that \( \hat{h} \) and \( \hat{g} \) can be represented as cost 0, cost 1 and cost 2 RAG-n elements, so the heuristic part of the RAG-n algorithm is not employed. As a consequence the result obtained with the RAG-n algorithm is optimal. Since the proposed architecture requires the same number of adders required by the RAG-n solution, the proposed architecture is Pareto-optimal.

In order to have an accurate estimation of the power consumption figure of the proposed architecture, switching activity values are required. To this purpose, the proposed architectures drives the multiplexers to select low pass or high pass output values whereas the \( S97_{53n} \) signal allows to select between the 9/7 and the 5/3 results.

V. IMPLEMENTATION RESULTS AND COMPARISON

<table>
<thead>
<tr>
<th>Arch.</th>
<th>Type [dB]</th>
<th>PSNR</th>
<th>M</th>
<th>A</th>
<th>R</th>
<th>kgares</th>
<th>P [mW]</th>
<th>T</th>
</tr>
</thead>
<tbody>
<tr>
<td>[10]-I</td>
<td>FB</td>
<td>9</td>
<td>14</td>
<td>9</td>
<td>14.65</td>
<td>35.66</td>
<td>1</td>
<td></td>
</tr>
<tr>
<td>[13]</td>
<td>FB</td>
<td>0</td>
<td>32</td>
<td>9</td>
<td>-</td>
<td>-</td>
<td>1</td>
<td></td>
</tr>
<tr>
<td>[11]</td>
<td>FB</td>
<td>0</td>
<td>33</td>
<td>9</td>
<td>5.39</td>
<td>33.90</td>
<td>1</td>
<td></td>
</tr>
<tr>
<td>[12]</td>
<td>FB</td>
<td>0</td>
<td>27</td>
<td>9</td>
<td>4.17</td>
<td>15.15</td>
<td>1</td>
<td></td>
</tr>
<tr>
<td>[14]</td>
<td>FB</td>
<td>0</td>
<td>21</td>
<td>9</td>
<td>2.81</td>
<td>12.38</td>
<td>1</td>
<td></td>
</tr>
<tr>
<td>[10]-II</td>
<td>LS</td>
<td>A</td>
<td>4</td>
<td>8</td>
<td>6</td>
<td>12.45</td>
<td>29.43</td>
<td>2</td>
</tr>
<tr>
<td>[16]</td>
<td>LS</td>
<td>A</td>
<td>4</td>
<td>8</td>
<td>4</td>
<td>10.10</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>[9]</td>
<td>LS</td>
<td>A</td>
<td>2</td>
<td>4</td>
<td>20</td>
<td>-</td>
<td>-</td>
<td></td>
</tr>
<tr>
<td>[15]</td>
<td>LS</td>
<td>C</td>
<td>0</td>
<td>19</td>
<td>14</td>
<td>7.43</td>
<td>11.36</td>
<td>2</td>
</tr>
<tr>
<td>[17]</td>
<td>LS</td>
<td>C</td>
<td>0</td>
<td>15</td>
<td>14</td>
<td>3.79</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
<td>Our</td>
<td>FB</td>
<td>B</td>
<td>0</td>
<td>19</td>
<td>9</td>
<td>2.69</td>
<td>9.74</td>
<td>1</td>
</tr>
</tbody>
</table>

Figure 2. Proposed folded 9/7 and 5/3 FB architecture – 16 bits wide data-path
architecture has been interfaced to an external memory and a micro controller through a FIFO and a simple control unit as depicted in Fig. 3. LINE_LEN is the number of processed samples (i.e. the current row or column length) and start allows to start the elaboration, whereas a done is generated when LINE_LEN pixels have been elaborated. Besides, the S97_53n signal allows to switch between the 9/7 and the 5/3 filters. Finally, the architecture validates every correct output value (y) through a valid signal.

The complete design flow (including place and route) has been performed for the proposed architecture with its control unit. Annotating the design activity on 256 samples of the standard image 'Lenna' 256 × 256 into Cadence-Encounter power estimation environment with fclk=200MHz, we obtained an average power consumption of 10.27 mW.

Through VHDL simulation we observe that the proposed architecture requires 264 clock cycles to elaborate 256 samples (i.e the current row or column length) and to generate 128 low-pass and 128 high-pass results at standard image 'Lenna' 256 × 256 at 200 MHz, being able to sustain 30 frames per second of HD video sequences (1440 × 1080) with an average power consumption of 10.27 mW.

VI. CONCLUSIONS

In this paper, a novel multiplierless 9/7 wavelet VLSI architecture has been presented. The novelty of the paper stems from the use of the 5/3 wavelet results to decrease the 9/7 architecture complexity. The proposed architecture has been compared in terms of performance and complexity with other multiplierless 9/7 filter banks and lifting scheme solutions showing lower complexity with comparable performance. Finally, it is noticeable that the proposed architecture can run at 200 MHz, being able to sustain 30 frames per second of HD video sequences (1440 × 1080) with an average power consumption of 10.27 mW.

REFERENCES