Pipelined Viterbi Decoder Using Fpga

Convolutional encoding is used in almost all digital communication systems to get better gain in BER (Bit Error Rate) and all applications needs high throughput rate. The Viterbi algorithm is the solution in decoding process. The nonlinear and feedback nature of the Viterbi decoder makes its high speed implementation harder. One of promising approaches to get high throughput in the Viterbi decoder is to introduce a pipelining. This study applies a carry-save technique, which gets the advantage that the critical path in the ACS feedback becomes in one direction and get rid of carry ripple in the " Add " part of ACS unit. In this simulation and implementation show how this technique will improve the throughput of the Viterbi decoder. The design complexities for the bit-pipelined architecture are evaluated and demonstrated using Verilog HDL simulation. And a general algorithm in software that simulates a Viterbi Decoder was developed. Our research is concerned with implementation of the Viterbi Decoders for Field Programmable Gate Arrays (FPGA). Generally FPGA's are slower than custom integrated circuits but can be configured in the lab in few hours as compared to fabrication which takes few months. The design implemented using Verilog HDL and synthesized for Xilinx FPGA's.


INTRODUCTION
Convolutional codes are preferred over block codes due to simple decoding using the Viterbi algorithm with soft-decisions.The convolution coder is often used in many digital transmission systems (deep space communication, satellite communication, cellular and most wireless communication) where the signal to noise ratio is low.It is also used in storage devices as hard disks and compact disks to enhance retrieving data and lengthen the life time of the components (Lin and Costello, 1982).
The convolution coder achieves error free transmission by adding enough redundancy to the source symbols.The choice of the convolution code depends mostly on the application which is a matter of compromise between complexity, power, performance and bandwidth.In this design, an implementation code of a pipelined Viterbi decoder is achieved which supports a convolutional code based on a rate 1/2, constraint length K = 7 and sequence generating functions g 0 = 133 8 and g 1 = 171 8 .
This study will study the effect of the quantizer limits and the number of the soft-decision bits on the gain of the Viterbi decoder and the increase in speed of the Viterbi decoder using pipelining technique with complexity in hardware, all this carried out using MATLAB simulation and FPGA cards implementation.
The classical method to realize the Viterbi decoder is an iterative calculation and memory trace-back, which is effective for a moderate decoding speed and long constrain length.For a high-speed decoder, iterative calculation and memory trace back becomes Fig. 1: Viterbi decoder main units the bottlenecks for the throughput which also depends on the word length.This study presents a pipelined method to realize the Viterbi decoder using an FPGA, in which the critical path is broken down saving time and increasing the speed of the decoder.Other effect on the Viterbi decoder which could increase the gain of the decoder and enhancing the Signal-to-Noise Ratio (SNR) which is affected by the number of the softdecision bits and quantizer limits with all these the throughput can reach 4 times faster than the normal Viterbi decoder.

Viterbi decoder architecture:
The Viterbi Decoder can be implemented with three basic units as in Fig. 1.The Branch Metrics Unit (BMU) calculates the branch metrics for each incoming bit.The Add-Compare Select Unit (ACSU) adds the branch metrics to the path metrics to calculate the new path metrics.Then it compares between the path metrics, the minimum or maximum, according to the implementation, to select the best path.The Survivor Memory Unit (SMU) processes the outputs of the ACSU to produce the decoded bits.
The operational speed of the Viterbi decoder is limited by the ACS unit.To implement high speed Viterbi decoder, we can introduce pipelining.It is easily to do that in the BMU and SMU since they are Taking a closure look to the state diagram where the ACSU is showing as in Fig. 3. High throughput-rate is achieved as we make the critical path very short.The critical path of a synchronous-circuit is defined as the path between two buffers (e.g., flip-flops) with the largest propagation delay and hence determines the maximum achievable clock frequency of the circuit.As shown in Fig. 4 the critical path here is in the feedback loop.
At Bit level, we can show the critical path in detail, assuming that the word length is 4-Bits.As shown in Fig. 4 (Fettweis and Meyr, 1991), the critical path extends for 4 adders and 4 maximum (or minimum) selections.This is for the carry ripple adders, where the carry should propagate from the LSB to the MSB.The compare decision should start from the MSB of the maximum (or minimum) selector down to the LSB hence, which depends on the word length.

Carry-Save (CS) representation:
In carry save representation, the carry does not propagate to the next adder; instead it is saved with the sum of the next adder as shown in Fig. 5b. Figure 5a shows the carry ripple representation, here the second adder should wait for the carry of the first adder so the word length affects the speed of the adder, where in carry save mode the two adders takes the same time.
In CS the carry and the sum are combined to a new value, v i = s i +c i , which can take on the values v i  {0, 1, 2} where the S = Σ i (v i ) 2 i = Σ i (s i +c i ) 2 i .
Carry-Save representation has redundancy entries, this comes from either (c i = 1 and s i = 0) or (c i = 0 and s i = 1), where their sum equals to 1.This is shown in Table 1.(Wicker, 1995;Ciletti, 1999) is to select the maximum of the inputs A or B, which are in carry save format.So the output is G = max (A, B), where ∑ 2 , ∑ 2 and ∑ 2 , , , ∈ 0, 1, 2 .
The carry save maximum selection starts at MSB and there is three cases for a MSB and b MSB where, MSB is w-1 assuming the word length is w-bits.Figure 7 shows two bits of w-1 and w-2.
Case-1: a w-1 -b w-1 = 2: In this case the minimum value of A is 2 w this by setting all a i = 0 for i<w-1.And the maximum value of B by setting all b i = 2 for i<w-1, is Case-2: a w-1 -b w-1 = 0: Here a w-1 = b w-1 and g w-1 = a w-1 or = b w-1 , so no decision made here till see the next stage.And no problem if g w-1 assigned to a w-1 or b w-1 .

Case-3: a
In this case the minimum value of A is 2 w-1 this by setting all a i = 0 for i<w-1 and The difference A min -B max = 2-2 w-1 .This case is called a pre-decision for A, since the difference can be greater than, less than or equal to, no final decision can be made on this bit level at this case; the decision can be made at the next level.Where there are three possible cases: Case-3.1: b w-2 = 0: In this case the minimum of A is 2 w-1 and the maximum of B is 2 w-1 -2, the difference Amin-B max = 2, so that A is the maximum and G = A.
Case-3.2: a w-2 = 0, b w-2 = 1: In this case the minimum of A is 2 w-1 and the maximum of B is 2 w-1 -2+2 w-2 , the difference A min -B max = 2-2 w-2 , again no final decision can be made at this level as in case 3 above, we should look at the next level.So to this level A equal B and the pre-decision of A is removed and the decision procedure starts on the next bit-level.But there is no error as we assign g w-1 = a w-1 and gw -2 = a w-2, since there summation is equal.
These cases are shown in the flow chart for branch An in Fig. 8 and for branch B in Fig. 9. Where, The design of the Carry-Save Maximum selection circuit was proposed in Gemmeke et al. (2002) and Gierenz et al. (2000) the design based on local comparison, which is for each branch at each state there is a CSM selection circuit that compare it's metric with the maximum of the other branch metrics.
Figure 8 and 9 shows that the pre-decision (d p ) flag determines the maximum bit a i or b i at the current bit.So by inhibiting the current bits of each branch with its pre-decision flag and O-ring the two inhibited branches we get the maximum branch bit at the current bit level, this illustrated as a function in ( 1) and ( 2):

CS redundancy suppression:
The first step in designing of the Carry-Save Maximum selection is the redundancy suppression circuit (Gemmeke et al., 2002;Gierenz et al., 2000) to eliminate the redundancy entry in the Table 1 to simplify and reduce the complexity of the maximum selection circuit.This is shown In Fig. 10.  2.  To take into consideration the following bits of the word length, the difference of the following bits should be inhibited by the decision flags.Table 3 shows the possible values of the input current decision flags (dp i a , df i a , dp i b , df i b ).The input difference of the current bits δi = {ca, sa}-{cb, sb} and the next output decision flags (dp i-1 a , df i-1 a , dp i-1 b , df i-1 b ) for the five states {(A = B), (A>B), (A<B), (pre-decision A), (pre-decision B)}, Where Table 2 represents the truth table of Fig. 11.
The channel model consists of an additive white Gaussian noise source to add noise to the modulated bits according to the SNR setup.The receiver demodulates the data to produce 4 bit quantized soft decision provided to the Viterbi decoder.A bit error rate comparator compares the decoded data with the source one to get the bit error rate.The interface panel for the MATLAB simulation program is shown in Fig. 13 the user can enter all parameters where this simulation environment can be modified to evaluate their effect on the Viterbi decoder performance.
The first entry box lets the user enters the number of soft decision bits that quantizes the received data to 2^ (soft-bits) levels.The second entry gives the choice of truncation of the received bits, that is, determination The blocks of the Simulink are shown in Fig. 14.The first block generates packets of random binary data of fixed length using Bernoulli model.These bits feed a pad block that appends zeros to each packet to flush the encoder that is, ending at state zero for each frame.A convolutional encoder encodes the packet according to the given sequence functions.Then these bits are modulated as BPSK.At the receiver a quantizer with given number of soft bits and truncation point quantize the data.Viterbi decoder decodes the quantized data.Then bit error rate calculations are done on the decoded data and the source data.
Since the convolution code is standardized by the 802.11committee, its performance should be acceptable to that of multi-path channels.Our verification consisted of exact performance of the bit level pipelined implementation to that of standard implementation.This is sufficient for testing using one channel model.We expect the performance of our implementation to match the standard for any channel type.Since hardware level simulations take considerably larger times, we did not perform multipath simulations at hardware level.

Coding gain vs. soft-decision bits:
The number of soft-decision bits affects the gain of the code.The  Figure 16 shows the variation of the soft-decision bits on the error rate vs. signal -to-noise ratio of the UMTS convolutional encoding of generating sequences g 0 = 557 9 , g 1 = 663 9 , g 2 = 711 9 of R = 1/3 and constraint length = 9.It also shows good enhancement in BER for 4-bits soft-decision over the 3-bits.Increasing softdecision bits more than 4-bits gives little gain as shown.

Coding gain vs. peaks points (limits of the quantizer):
The limiting of the incoming data affects mostly the BER.This value is determined by the peak value.Figure 17 shows the quantizer with limits at the constellation points +1, -1, where Fig. 18 shows the limits of the quantizer at +2, -2 points, with constellation points at the +1, -1.For the two cases the range is divided into the same number of levels are kept.
Figure 19 shows the effect of the quantizer limits on the 3-bits and 4-bits soft-decision bits.From the Fig. 19 we notice an approximately 0.5 dB gain for peak = 2 over peak = 1.
Viterbi decoder circuit implementation: The Viterbi decoder design was captured with Vierlog HDL and simulated by Verilogger and Modelsim programs and synthesized by ISE Xilinx on a Virtex_II FPGA platform.A high level model was written in MATALB and Simulink to carry out all the necessary simulation performance for different parameters.

SYNTHESIS RESULTS
After the Viterbi decoder is verified by Simulink, it was synthesized by Xilinx (2000) ISE 5.1i development tool to prototype on an FPGA chip.We synthesis the two versions of Verilog HDL code that of carry-ripple and carry-save techniques.Figure 20 shows timing report of the two techniques; it shows the speed improvement of the pipelined carry-save technique over the carry ripple one, by a factor of 4.09.
Figure 21 shows the hardware complexity of the two versions before and after pipelining.Pipelining needs more hardware to accommodate the cut sets, redundancy suppression circuits and latency of the pipelining.
Many designs of the Viterbi decoder are implemented without taking the pipelining technique into consideration, while Santhi et al. (2008)  (limts of the quantizer) 3-bits at peak point 2 3-bits at peak point 1 4-bits at peak point 2 4-bits at peak point 1

Fig. 2 :
Fig. 2: State diagram purely feed forward paths.The problem holds in the ACSU due to the feedback path.In this unit the paths metrics are updated each time step, this cause the feedback.This can be seen from the state diagram in Fig. 2.Taking a closure look to the state diagram where the ACSU is showing as in Fig.3.High throughput-rate is achieved as we make the critical path very short.The critical path of a synchronous-circuit is defined as the path between two buffers (e.g., flip-flops) with the largest propagation delay and hence determines the maximum achievable clock frequency of the circuit.As shown in Fig.4the critical path here is in the feedback loop.At Bit level, we can show the critical path in detail, assuming that the word length is 4-Bits.As shown in Fig.4(Fettweis and Meyr, 1991), the critical path extends for 4 adders and 4 maximum (or minimum)

Fig. 3 :
Fig. 3: A state diagram showing adders and maximum selection

Fig. 4 :
Fig. 4: State 00 of the state diagram in bit level with word length equal 4, in carry-ripple format

Fig. 8 :
Fig. 8: The maximum selection flow chart, indicating the deference a i -b i , which is local for branch A

Fig. 10 :
Fig. 10: Redundancy suppression circuit, with input and output table Figure 8 shows the two required indicators for each branch.One indicates the pre-decision (d p ) and the other indicates the final decision (d f ).These flags should represent the five states in Fig. 8 and 9, this shown in Table

Fig. 11 :
Fig. 11: Block diagram of current and next decision flags and current bits of branch A and B df i-1 a = {(dp i a (df i a (C' a +S' a ))) + (dp i a (df i a (C' a S' a ))) + (dp i a (dp i b (C' b S' b ))') + ((df i a (C' a S' a )) (dp i b (C' b S' b ))'

Fig. 12 :
Fig. 12: Simple block diagram of the coding and decoding system

Fig. 14 :
Fig. 14: Simulink block diagram of limits of the quantizer, to be at the constellation points or beyond them.Signal-to-Noise ratio (SNR = E b /N o ) is entered in dB as a parameter in the third entry.SNR dB = log (E b /N o ), from this Eb = N o *10^( SNR dB) and for BPSK there is one bit per symbol so E b = E s .The packet size can be varied in the fourth entry, where the last entry box for the generated sequences to be entered.The MATLAB code was written in two versions.One implements the carry-ripple Viterbi decoder and the other implements the pipelined carry-save were the Simulink simulate the Viterbi decoder in carry-ripple architecture.All architectures show exact simulation results for the same parameters.The blocks of the Simulink are shown in Fig.14.The first block generates packets of random binary data of fixed length using Bernoulli model.These bits feed a pad block that appends zeros to each packet to flush the encoder that is, ending at state zero for each frame.A convolutional encoder encodes the packet according to the given sequence functions.Then these bits are modulated as BPSK.At the receiver a quantizer with given number of soft bits and truncation point quantize the data.Viterbi decoder decodes the quantized data.Then bit error rate calculations are done on the decoded data and the source data.Since the convolution code is standardized by the 802.11committee, its performance should be acceptable to that of multi-path channels.Our verification consisted of exact performance of the bit level pipelined implementation to that of standard implementation.This is sufficient for testing using one channel model.We expect the performance of our implementation to match the standard for any channel type.Since hardware level simulations take considerably larger times, we did not perform multipath simulations at hardware level.

Table 1 :
Redundancy in CS-representation c i s i

Table 2 :
Indication of decision flags d a