Jacobi Solver: A Fast FPGA-based Engine System for Jacobi Method

: The classical Jacobi method is widely used for solving linear systems. This method is considerably time-consuming to compute millions upon millions of linear equations. In this study, we design a novel FPGA-based Jacobi Solver. The kernel of the Jacobi Solver is a pipeline-friendly iteration algorithm which can eliminate the data dependence between iteration steps. This algorithm is suitable for pipeline-friendly hardware architecture. The experimental results show that the Jacobi Solver can solve more than 6.5 million of linear equations in one second and achieves up to 341x speedup compared to a single-thread CPU version.


INTRODUCTION
With abundant computational resource being available, scientific computing are growing in size and complexity.Many scientific computing problems lead to the demand of solving large dimension linear systems, e.g., >10 7 (Young, 2003).Numerous methods have been proposed to solve linear systems, such as Direct Solution method, Jacobi method, Gauss-Seidel method, Conjugate Gradient method, Multigrid method and Saad and Van Der Vorst (2000).Among all these methods, Jacobi Method is relatively simple and independent of the numbering of the unknowns (Cavallaro and Luk, 1988).The obvious advantage seen was that rounding errors would not be accumulated, they are restricted to the last operation.However, Jacobi method is very time-consuming to compute millions upon millions of linear equations.
Fortunately, Jacobi method has much more potential for parallelization.For example, O' Leary and White (1985) introduce a parallel scheme based on multi-splittings of the coefficient matrix and they prove the convergence of this method with some sufficient conditions on the coefficient matrix and on its splittings (Bru and Fuster, 1990).Morris and Prasanna (2005) implemented a FPGA-based single iteration of Jacobi method without considering convergence and achieved 1.7x faster than CPU implementation in one iteration.Wang et al. (2009) present a GPU-based implementation of Jacobi method.Their experimental results show that the performance of GPU-based Jacobi method algorithm increases linearly along with the scale of matrix increasing and finally achieved 3x faster than Barrachina et al. (2008) GPU-based dense linear system.
Our study also focuses on modifying Jacobi method to make it suitable for FPGA architecture.We design a FPGA-based Jacobi Solver which mainly includes a new pipeline-friendly iteration algorithm.The main idea of the pipeline-friendly iteration algorithm is eliminating the data dependence between iteration steps.To achieve this goal, we have to reconstruct the work flow of the classical Jacobi method with algorithmic transformations.After enabling pipeline of different computation stages, we finally implement the Jacobi Solver in hardware logic and successfully make it run on Vitex-6 SX475T FPGA.In order to evaluate the performance of our FPGA-based Jacobi Solver, we implement the other three different CPU-based Jacobi method solutions.These CPU-based solutions are all fully optimized.The Experimental results show that our FPGA-based Jacobi Solver is significantly faster than all CPU-based solutions.Our Jacobi Solver achieves a maximum of 341x speedup over a single-thread CPU-based solution, 115x speedup over a multi-thread CPU-based solution and 35x speedup over a MPI-based solution.

Fundamental of Jacobi method:
The Jacobi method is used to solve linear systems with matrix that without zeros along its main diagonal.Each diagonal element is solved for and an approximate value is plugged in.Referring to the expressions in Bronshtein et al. (1985), we given a system of n linear equations: where, Then A can be decomposed into a diagonal component D and the remainder R: where, The solution is then iteratively solved by: ) ( The Eq. ( 5) can be converted into an element-based formula: where, k is the iteration count.The ‫ݔ‬ will be computed iteratively until the solution x = ‫ݔ(‬ ଵ , ‫ݔ‬ ଶ , … , ‫ݔ‬ ) ் reaches convergence.The condition for the convergence is that for a given precision ε, for example, ε = 10 -9 , it satisfies following condition: Based on Eq. ( 6), we write the pseudo-code of Jacobi method in Algorithm 1. Line 7 to Line 12 in Algorithm 1 continuously update component xi in current (k+1) th iteration and δ in Line 9 is the accumulating result of ∑ ܽ ‫ݔ‬

() ஷ
. When all components ‫ݔ(‬ ଵ , ‫ݔ‬ ଶ , … , ‫ݔ‬ ) ் for solution x are updated, the algorithm starts to check whether the result is convergence or not.If solution x = ‫ݔ(‬ ଵ , ‫ݔ‬ ଶ , … , ‫ݔ‬ ) ் reaches convergence at the ݅ ௧ iteration, then the iteration process will finish.Otherwise the algorithm starts next iteration step.
It is obvious that there is data dependency between δ and ‫ݔ‬ when new iteration starts.This kind of data dependency makes the classical Jacobi method impossible to be implemented in hardware logic (Weaver et al., 2003).We need to design a pipelinefriendly algorithm for FPGA architecture.
Algorithm 1: The Classical Jacobi Method: Choose an initial guess value x 0 ; k = 0; Check if convergence is reached; While convergence not reached do for i: = 1 step until n do δ = 0; for j: = 1 step until n do if j ≠ i then δ = δ + ܽ ‫ݔ‬ () ; end for Check whether convergence is reached; k = k + 1; end while The target hardware: We use Maxeler's MAX3 Acceleration card to implement our engine system of Jacobi method.MAX3 acceleration card consists of one Vitex-6 SX475T FPGA and 24 GB DDR3 onboard memory.Figure 1 illustrates the architecture of our used Maxeler acceleration system.The host application runs on the conventional CPU and manages the interaction between the host and the FPGA accelerators.On the FPGA, kernels are the hardware designs implementing the arithmetic and logical computations needed within an algorithm.The manager is the collective term for the FPGA logic that orchestrates data flow between kernel and off-chip I/O.
Dividing an application into kernels and a manager lets us separate computation from communication.This is beneficial because it enables deeply pipelined kernels without control flow constraints, which is key to achieving high performance.Managers use a streaming model for off-chip I/O to PCI Express and DDR3 RAM memory and are optimized to achieve high use of available bandwidth in off-chip communication channels, allowing kernels to run at peak performance (Lindtjorn et al., 2011).
Additionally, the MAX3 acceleration card is also attached with a compiling platform named MaxCompiler provided by Maxeler Technologies.Using MaxCompiler, we only need to program the algorithm of Jacobi method in Java and the MaxCompiler will compile and build the Java code into a configurable bitstream file through the Vendor's CAD software that are integrated in MaxCompiler.The generated bitstream file can be loaded into Maxeler acceleration card at runtime.In this section, we design a pipeline-friendly algorithm to eliminate the data dependence between iteration steps.We divide the complete large scale linear systems into blocks, in which one block consists of a small number of linear equations.All equations in one block will be simultaneously solved when the pipeline-friendly algorithm finishes iterations.
We can see from Line 9 in Algorithm 1 that updating the δ value is one important step of Jacobi method.This step needs all components ‫ݔ‬ (j = 1, 2, •••, n) that have been calculated in previous iteration.If we just simply migrate the step of updating δ to FPGA hardware logic, the computing process will fail because the calculation for ‫ݔ‬ needs to pass through a long pipeline stages in FPGA.When a new iteration step starts, a number of ‫ݔ‬ (j = k, k + 1, k + 2, •••, n) are still non-value.It will lead to an invalid updating for ߜ.
In order to solve the above problem, we reconstruct the classical Jacobi method to a pipeline-friendly Jacobi method which is shown in Algorithm 2. The main improvement of using block is solving a number of equations simultaneously during one iteration step, while the original method is only solving a single equation during one iteration step.
We assume T FPGA clock cycles are needed before all components ‫ݔ(‬ ଵ , ‫ݔ‬ ଶ , … , ‫ݔ‬ ) ் are figured out for all the linear equations in a block.Before updating the δ value, we calculate every x i in parallel as following steps: • Step 1: In t 1 cycle, the algorithm starts to calculate x i for the 1 st equation in block.Algorithm 2: The Pipeline-Friendly Jacobi Method: Choose an initial guess solutions ‫ݔ‬ ଵ , ‫ݔ‬ ଶ ,…, ‫ݔ‬ ் to all equations in block; k = 0; check if all equations in block have reached convergence: while all equations in block have reached convergence do for t : = 1 step until T do ߜ ௧ = 0; end for for i : = 1 step until n do for t : = 1 step until T do for j : = 1 step until n do if j ≠ i then end for end for check whether T equations in current block have reached convergence; k = k + 1; end while After T FPGA clock cycles elapse, all components ‫ݔ(‬ ଵ , ‫ݔ‬ ଶ , … , ‫ݔ‬ ) ் for the 1 st equation are figured out.Therefore the updating for the first equation's δ can be correctly performed in new iteration.When the second cycle starts in new iteration, all components ‫ݔ(‬ ଵ , ‫ݔ‬ ଶ , … , ‫ݔ‬ ) ் for the 2 nd equation are also figured out since it also has elapsed T FPGA clock cycles when calculation on xi for the 2 nd equation starts, so updating δ for the 2 nd equation can also be correctly performed.The process is the same for updating δ for other equations in block when following cycles start in new iteration.
Using this pipeline-friendly Jacobi method, we can overlap multiple computing stages for updating δ and results in eliminating the data dependency between consequent iteration steps.This method is easy to be implemented by FPGA hardware logic.
Since Algorithm 2 needs to solve T equations simultaneously during the iterations, we change the terms δ, ‫ݔ‬ , ‫ݔ‬ , ܾ in Algorithm 1 to ߜ ௧ , ‫ݔ‬ ௧ , ‫ݔ‬ ௧ and ܾ ௧ in Algorithm 2 respectively, where: • ߜ ௧ is the corresponding δ for t th equation in block.
• ‫ݔ‬ ௧ and ‫ݔ‬ ௧ are the corresponding component ‫ݔ‬ and ‫ݔ‬ respectively for t th equation in block.
• ܾ ௧ is the corresponding ܾ for t th equation in block.

RESULTS AND DISSCUSSION
We implement the pipeline-friendly Jacobi method on Maxeler MAX3 acceleration card with one Vitex-6 All above four implementations of Jacobi method are tested according to eight datasets, which are generated randomly with different linear dimension from small to large.The linear dimension of eight datasets is 2, 4, 8, 16, 32, 64, 128 and 200, respectively.Limited by the quantity of hardware resources provided by Vitex-6 SX475T FPGA on one Maxeler MAX3 acceleration card, the maximum linear dimension that Jacobi Sovler can solve is 200.If more hardware resources are available, the maximum linear dimension can be extended.
Note that the amount of computation to be performed within Jacobi method is proportional to linear dimension.Therefore we keep the amount of linear equations at 1 million in every dataset for performance comparison.
We present the number of processed equations per second for different versions in Table 1 and the speedup of FPGA version over other versions in Table 2.For simplification, we plot the Fig. 2 to compare the speedup in detail.
As shown in Table 1, the average throughput of FPGA-based implementation is high.For two dimensions scenario, it can solve 6,544,444 equations in one second.For two hundred dimensions scenario, it still can solve about 297,474 equations in one second.These peak performance results are much higher than the other three CPU versions.
As shown in The pipeline-friendly Jacobi method enables us to deploy sufficient FPGA functional units and is easy to be implemented in deep pipeline FPGA hardware logic.Furthermore, if more hardware resources on FPGA platform being available, we can deploy more than one pipeline for this pipeline-friendly Jacobi method in FPGA which will result in further acceleration.

CONCLUSION
This study presents Jacobi Solver, a fast FPGAbased design of Jacobi Method.It is useful to solve large-scale linear systems.We design a pipelinefriendly Jacobi method, which eliminates the data dependency between iterations and make Jacobi method not only successfully be implemented with FPGA hardware logic, but also achieve significant acceleration.Our experimental results show that Jacobi Solver is more efficient than other CPU versions significantly.

Fig. 1 :
Fig. 1: Architecture of maxeler acceleration card Design of FPGA-based Jacobi solver:In this section, we design a pipeline-friendly algorithm to eliminate the data dependence between iteration steps.We divide the complete large scale linear systems into blocks, in which one block consists of a small number of linear equations.All equations in one block will be simultaneously solved when the pipeline-friendly algorithm finishes iterations.We can see from Line 9 in Algorithm 1 that updating the δ value is one important step of Jacobi method.This step needs all components ‫ݔ‬ (j = 1, 2, •••, n) that have been calculated in previous iteration.If we just simply migrate the step of updating δ to FPGA hardware logic, the computing process will fail because the calculation for ‫ݔ‬ needs to pass through a long pipeline stages in FPGA.When a new iteration step starts, a number of ‫ݔ‬ (j = k, k + 1, k + 2, •••, n) are still non-value.It will lead to an invalid updating for ߜ.In order to solve the above problem, we reconstruct the classical Jacobi method to a pipeline-friendly Jacobi method which is shown in Algorithm 2. The main improvement of using block is solving a number of equations simultaneously during one iteration step, while the original method is only solving a single equation during one iteration step.We assume T FPGA clock cycles are needed before all components ‫ݔ(‬ ଵ , ‫ݔ‬ ଶ , … , ‫ݔ‬ ) ் are figured out for all the linear equations in a block.Before updating the δ value, we calculate every x i in parallel as following steps:

Table 1 :
Performance comparison (solved equations per second)

Table 2
, with the increasing of dimension in each dataset, the speedup is obviously keeping rising except MPI version.In our test cases, most of the best speedups are achieved on the largest dataset (i.e., two hundred dimensions scenario) with the single-thread CPU version, 115x speedup over the M-CPU version.Comparing with the MPI-based solution, FPGA solution achieves maximum 35x speedup when dimension is 32.When dimension increases to 200, the FPGA solution only achieves 12x faster than MPI-based solution.The main reason for this result is the time saved in computation process is greater than time consumed in the overhead of forking 64 MPI processes.Our studies in this case are still limited to the number of MAX3 card.If more MAX3 cards are available, we can compare multi-FPGA version with MPI version further.