Parallel Image Processing Techniques , Benefits and Limitations

The aim of digital image processing is to improve the quality of image and subsequently to perform features extraction and classification. It is effectively used in computer vision, medical imaging, meteorology, astronomy, remote sensing and other related field. The main problem is that it is generally time consuming process; Parallel Computing provides an efficient and convenient way to address this issue. Main purpose of this review is to provide the comparative study of the existing contributions of implementing parallel image processing applications with their benefits and limitations. Another important aspect of this study is to provide the brief introduction of parallel computing and currently available parallel architecture, tools and techniques used for implementing parallel image processing. The aim is to discuss the problems encountered to implement parallel computing in various image processing applications. In this research we also tried to describe the role of parallel image processing in the field of medical imaging.

image processing to enhance and extract information from images.Several Authorscategorized image processing into three groups (Soviany, 2003).Low level image processing, Intermediate Level Image Processing, High Level Image Processing (Aashburner and Friston, 2005).
Low level image processing: It usually converts image data into image data.Example Contrast Enhancement, Noise Reduction, Filter Transformations, Calculations of features of input images like contours, histogram etc.
Intermediate level image processing: These are more complex operations which derive abstractions from the image pixels like region labeling and object tracking.
High level image processing: This is knowledge based processing which concerns the interpretation of the information extracted from the intermediate level processing for example Pattern Recognition, Object Classification etc. (Soviany, 2003).Basic flow diagram of the different steps of Image Processing given as following (Drakos, 2014) in which different steps of Image Processing such as Image Acqutition, Image Preprocessing etc are in Fig. 1.
In this section we have seen about the different basic steps of image processing now we are going to give the brief description of parallel computing and its importance in image processing.

PARALLEL COMPUTING AND ITS ENVIRONMENT
Parallel computing or processing is the process of simultaneous uses of various compute resources to solve a computational job/task/work (Saxena et al., 2013b).Main principle of parallel computing is to divide a task in such a way that the task executes in minimum time with maximum efficiency.To implement parallel computing there can be several kind of parallel machine like a cluster of computers which is having multiple PCs combined together with an elevated speed network; a shared memory multiprocessor by connecting multiple processors to a single memory system, a Chip Multi-Processor (CMP) contains multiple processors (called cores) on a single chip (Saxena et al., 2013c;Fung and Mann, 2008;Edelman et al., 2006;Barney, 2014;Huang et al., 2011).There are several application of high performance or parallel computing in various fields describes in Barney (2014).There are several application area of parallel computing image processing, Atmosphere, Earth, Environment, Applied Physics, Nuclear, condensed matter Computer Science, Mathematics, Electrical Engineering and Many more discussed in Barney (2014)and Slabaugh et al. (2010).
Basic concepts of parallel computing: Barney (2014) gave the basic terminology which are generally used in parallel computing.
Node: It is an individual "computer in a box".Typically it is comprised of numerous CPUs/Cores/Processors, network interfaces, memory, etc.These are networked simultaneously to encompass a supercomputer.
CPU/Processor/Core: Previously, a Central Processing Unit was a particular execution part for a computer.After that multiple CPUs were included into a node.After that individual CPUs were subdivided into numerous cores, each being an exclusive effecting unit.
Task: This is a logically distinct section of computational effort.This is normally a program or set of commands which is executed by a core/processor.A parallel program, that consists of numerous tasks running on many processors.
Pipelining: It is the breaking of a task or job into steps performed by dissimilar processing units, in which inputs streaming through like an assembly stripe.

Shared Memory:
As per hardware point of view, it is just like a computer architecture in which all cores/processors have straight access to regular physical memory.For programming point of view, it is a model in which concurrent tasks are having the simmiler picture of memory and it can directly address and access the similar logical memory locations in spite of the place where physical memory really exists.

Symmetric Multi-Processor (SMP):
It is a hardware architecture in which several processors share a solitary address space and having capability to access all resources; shared memory computing.

Distributed memory:
For hardware point of view, it is just like a network based memory access used for Communications: Parallel jobs classically need to swap data.It can be accomplished by many ways, like via a shared memory bus otherwise over a network, though the real event of data exchange is normally referred to as communications in spite of the method employed.
Synchronization: It is defined as the synchronization of parallel jobs in genuine time, very frequently linked with communications.Habitually implemented by establishing a coordination point within an application where a job may not carry on further until a further tasks reach the same or logically comparable point.Synchronization generally consists of waiting by at least one job and can therefore cause a concurrent application's execution time to increase.

Granularity:
In terms of parallel computing, It is a qualitative measure of the proportion of computation to communication.
Parallel overhead: That is the amount of time necessary to coordinate parallel jobs, as contrast to doing useful work.

Scalability:
It refers to a parallel system's which is having ability to divulge a balanced increase in parallel speedup with the addition of additional compute resources.
Performance measures: It is having a set of metrics that is used for quantifying the quality of an algorithm (Navarroet al., 2014).If we talk about the quality of sequential algorithms it is usually evaluated in terms of time and space (Rajaraman and Siva Ram Murthy, 2006;Dougherty, 2009).However for the quality of the parallel algorithm it depends on the parallel architecture and the number of processors employed.Here we are going to describe the metrics and measures for analyzing the performance of the parallel computing system described by several authors.

Parallel run time:
This is the time taken by a program which is executed on an n-processor parallel computer.When n = 1, T(1) denotes the sequential run time of the program in single processor (Rajaraman and Siva Ram Murthy, 2006).

Speedup (Navarroet al., 2014):
This is the important measures of parallel computing.Basically it measures how much faster a parallel algorithm runs with respect to the best sequential one.For a problem of size n, the expression for speedup is: where, Ts(n,1) is the time of the best sequential algorithm (i.e., Ts(n,1) ≤ T(n,1)) and T(n,p) is the time of the parallel algorithm with p processors, both solving the same problem.Navarroet al. (2014) described when speedup is linear, when it is super linear and the different models of speedup like fixed time speedup, fixed size and scaled speed up.Now we are going to discuss speedup performance laws ie Amdahl's law, Gustafson's Law, Sun and Ni's law this is also known as laws of speedup.
Amdahl's law: It is often used to forecast the theoretical highest speed up using numerous processors.
According to this Law "The speedup of a program using numerous processors in parallel computing is restricted by the sequential portion of the program".For example, for a program if 95% of that can be parallelized, then the theoretical greatest speedup using parallel computing would be 20 times as shown in the following Fig. 2, it doesn't matter the number of processors are used.
Gustafson's Law: This law says that increase of the problem size for larger machines can retain scalability with respect to the number of processors (Zhou et al., 2012;Rajaraman and Siva Ram Murthy, 2006).

Sun and Ni's Law:
This one is referred to as a memory bound model.It turns out that when the speedup is computed by the problem size limited by the available memory in n-processor system, it leads to a generalization of Amdahl's and Gustafson's law (Zhou et al., 2012;Rajaraman and Siva Ram Murthy, 2006).expressed by the following equations (Zhou et al., 2012;Rajaraman and Siva Ram Murthy, 2006): where, Sp is the speedup, Ep is the efficiency of an algorithm with p processors and Ts(n,1) is the time of the best sequential algorithm (i.e., Ts(n,1) ≤ T(n,1)) and T(n,p) is the time of the parallel algorithm with p processors, both solving the same problem.

Need of parallel computing in image processing:
There are several studies have been till now which describe the requirements of parallel computing in image processing.As we have already discussed that processing of a gray scale image of size 1024 X 1024 requires a CPU to make more than one million operations for color image it multiplied by number of channels (Olmedo et al., 2012).So efficiently implementation of parallel computing can reduce the processing time.
Several techniques of image processing requires parallel computing described by different researchers like working with Images with High resolution in Fung and Mann (2008) authors described that Images of size 10000 X 10000 pixels requires sufficient computing power to perform operations with in time.Akgün (2013) described about the performance evaluations for parallel Image filter on multi-core architectures using Java Threads in this authors have developed image convolution filters, Basic image processing techniques like contrast enhancement, brightness improvement also need high computation power as these are having several time consuming steps, Alda Kika and Greca (2013) also discussed applications of image processing using java threads and many more researchers have been done researches in this field till now.In the following comparison of benefits and limitations we have also discussed our previous developed approaches (Saxena et al., 2013a).
In the next section we are providing a benefits and limitations of the above given techniques by different authors.

Following
Table 1 illustrates the benefits/advantages and Comments/improvement area of the different contributions giving different researches.
According to Table 1 we can see that till now there have been developed numerous approach for implementing parallel image processing using GPU, CUDA, Hadoop, OpenCV, OpenCL and many more.Some of them are very useful and informative.However, there are some other methods also implemented by different researchers which are not described in Table 1.Now we are going to give the brief comparison of different parallel implemented image processing algorithms GPU and CUDA in terms of time  (2012) consumption in CPUs as well as in GPUs.Kaur (2013) have been already done this study.In the following Table 2 we are just adding study of some more algorithms, which will be very helpful for the researchers to study different image processing algorithms in terms of speed up.

PARALLEL ARCHITECTURE, TOOLS AND TECHNIQUES AVAILABLE FOR IMPLEMENTING PARALLEL IMAGE PROCESSING
In previous section we have seen that several authors have implemented different techniques or algorithms parallally using different architectures and tools like MATLAB, CUDA, Hadoop and Many more.Now we are going to give brief description of these techniques with their advantages and disadvantages (Hadoop Advantages and Disadvantages, 2015).

GPU (Graphical Processing Unit):
It is a graphical processing unit.A CPU contains few cores while GPU contains thousands of cores.As it is shown in the following Fig. 3.
It is also known as Visual Processing Unit (VPU).GPU has hundreds of cores while newest CPU's contain 4 or 8.At present a major challenge in image processing is that several applications of it need high computational power to attain high precision and realtime performance which is not easy to achieve by using CPU.Every NVIDIA GPU has 8 to 240 parallel cores, each core are having four units named floating point unit, logic unit (for add, sub, mul, madd), move and compare unit, branch unit.Cores in GPU are managed by Thread manager which can manage 12,000+ threads per core.GPU has been developed into a very bendable and controlling processor, which can be implemented by using high level languages.GPU supports 32-bit and 64-bit floating point IEEE-754 precision and offers lots of GFLOPS (Applications 2014).Srinivasan (2009) 8 series GPU deliver 25 to 200+ GFLOPS on compiled parallel C applications which are available in laptops, desktops and clusters.It is noticed that GPU parallelism is doubling every year.GPU provide high computational density (uses 100s of ALUs) and memory bandwidth (100+GB/s) (Nickolls, 2007).Where GPU executes kernel code and CPU executes serial code in the program.This reduces the execution time of the program.In this way while doing calculations by GPU, CPU time cycles can be used for other high priority tasks (Kaur and Nishi, 2010).
Advantages/Benefits: Benefits of using GPUs given in (Kaur and Nishi, 2010;Tariq, 2011) 2010) and Ruetsch and Oster (2008) there are some drawbacks are given as following: • Gaining this speedup requires that algorithms are coded to reflect the GPU architecture and programming for the GPU differs significantly from traditional CPUs.In particular, incorporating GPU acceleration into pre-existing codes is more difficult than just moving from one CPU family to another; a GPU-savvy programmer will need to dive into the code and make significant changes to critical components.• Incorporating GPU hardware into systems adds expense in terms of power consumption, heat production and cost.Some job mixes may be served more economically by systems that maximize the number of CPUs that can be brought to bear.

CUDA (Computed Unified Device Architecture):
It is scalable parallel programming model and a software environment specifically used for parallel computing (Inam, 1994).CUDA is a parallel programming standard which is released in NVIDIA ( 2007).Generally, it is used to develop software that are used for graphics processors and is used to build up a diversity of general purpose applications for GPUs that are tremendously parallel and run on hundreds of GPU's processors or cores.It uses a language that is very analogous to C language and has a high learning curve.It has some extensions to that language to use the GPU-specific features that include new API calls and some new type qualifiers that apply to functions and variables.It has some definite functions, which is called as kernels.It can be a function or a full program invoked by the Central Processing Unit.It also provides common memory and synchronization among threads.
It is supported only on NVIDIA's GPUs based on Tesla architecture.The graphics cards that support CUDA are GeForce 8-series, Quadro and Tesla (Kaur and Nishi, 2010;Inam, 2010).Heterogeneous architecture of CUDA is given in Inam (2010) and Saxena et al. (2014b).Working details of CUDA is given in Kaur and Nishi (2010).Limitations/Drawbacks (Kaur and Nishi, 2010): • It is constrained to NVIDIA GPU's only.
• It runs its host code through a C++ compiler so it doesn't support the full C standard.• Texture rendering is not supported in it.

Multithreading using Java:
A thread is a dispatchable unit of work.Threads are light-weight processes within a process.A process is a collection of one or more threads and associated system resources.Java supports thread-based multitasking.Multithreading is the conceptual programming concepts when a program (process) is divided into two or more sub programs that can be implemented at the same time.A multithreaded program is having two or more parts that can run concurrently.Each part of such program is called thread.A thread is a dispatchable unit of work.Threads are light-weight processes within a process.A process is a collection of one or more threads and associated system resources.Java supports thread-based multitasking (Chapter Multithreaded Programming, 2015;Jain, 2015).Life cycle of thread is shown in the following Fig. 4.
In java we can construct single-thread as well as multi-thread application with it.A multi-threaded program in java has many entry and exit points, which are run concurrently with the main () method.Imageprocessing Applications can be implemented using single thread approach and multithreading approach by different contributors.In [Article] the multithreading approach the shared memory in which the threads operate is the matrix of the image pixels.It can be used the Java packages to grab the pixel matrix of the image that has to be processed.Then different threads manipulate different parts of the matrix depending on the algorithm.The work task and the part of the matrix that each thread has to manipulate are determined by the main thread.The time that is necessary to manipulate the entire matrix either by a single thread or by all the threads is registered.
Advantages/Benefits: Benefits of using Multithreading are given below defined in Jain (2015): • Threads share the same address space Parallel computing tool box in MATLAB (MATLAB Intro):MATLAB is extensively used for developing/prototyping algorithms.It is having several toolbox as image processing, signal processing, neural network toolbox and many more.Matlab 2010a onwards finally enables the "Parallel Computation Toolbox" for student use.By using this we can solve computationally and data-intensive tasks using multicore processors, GPUs and clusters of computer.We can parallelize applications of MATLAB without using CUDA or MPI programming.It contains High level constructs for example parallel for-loops, particular array types and parallelized arithmetical algorithms.The toolbox lets us use the full processing power of multicore desktops by executing applications on workers (MATLAB computational engines) that run locally.Without changing the code, we can execute the same applications on a computer cluster or a grid computing service (MATLAB Distributed Computing Server™).We can run parallel applications interactively or in batch.Built-in Parallel Computing Support in MathWorks Products MATLAB Distributed Computing Server for Amazon EC2-Early Adopter Program (Mathworks, Parallel Computing Toolbox, 2015).

Advantages/Benefits:
• It contains Parallel for-loops (parfor) that is used for running task/job parallel algorithms on numerous cores/processors.

Drawbacks/Limitations (article of MATLAB):
• Due to high level nature of MATLAB, it uses a lot of system resources.• MATLAB is built on Java and Java is built upon C.
So when we run a MATLAB program, our computer is busy trying to interpret all that MATLAB code.That consumes extra time.
OpenCL: Open Computing Language (OpenCL) is an open and royalty free parallel computing API designed to enable GPUs and other co processors (Article of Open CL) It is a standard for large scale parallel processing, it can help image processing but it is very low level and is designed for simplify the way to take advantage of many cpu cores and GPU stream processors.

Advantages/Benefits [Guide Open CL][Intro of
Open CL]: • Cross vendors software portability • It Provides substantial acceleration in parallel programming.

Disadvantages/Limitations [Guide Open CL][Intro of Open CL]:
• It is not trouble-free to be trained.

OpenCV:
OpenCV is an Image Processing library created by Intel and maintained by Willow Garage (Smith,2014).OpenCV is a library for computer vision, includes a lot of generic image processing routines and high level functions to support face recognition etc.It is available for C, C++ and Python.Several algorithms of image processing can be easily implemented by using this.
Parallel Image Processing plays a very vital role in Medical Imaging.It is a rapidly growing interest in parallel computation application in various medical imaging applications.This inclination is estimated to carry on as more sophisticated and challenging medical imaging and high-order data visualization problems.Till now there have been done several research of parallel image processing in different medical image modalities like MRI, CT, PET, X-Ray, Ultrasound and Optical tomography as processing of these images requires numerous image processing algorithms like diffeomorphic mapping, image denoising, image reconstruction, motion estimation, deformable registration and modeling.Kadah et al. (2011) summarizes various parallel implementation of image processing techniques like an accelerated algorithm for brain fiber tracking, a new 3D deformable registration algorithm for mapping brain datasets, low computational efficiency of the conventional active shape model (ACM) algorithm and exploitation of the potential acceleration achieved when ACM is implemented on a parallel computation architecture, investigation of the potential of parallel computation in accelerating the image algebraic reconstruction techniques, a GPU-accelerated finite element solver for the computation of light transport in scattering media, investigation of the different throughput-oriented architectures can benefit Compressed Sensing (CS) MRI reconstruction algorithm and what levels of acceleration are feasible on different modern platforms, implementation of a four-dimensional denoising algorithm on a GPU, an accelerated automated process for creating complete patient specific pediatric dosimetry phantoms from a tiny set of segmented organs in a child's CT scan, solution of nonlinear Partial Differential Equations (PDEs) of diffusion/advection type, fundamental most problems in image analysis, mapping of an enhanced motion estimation algorithm to novel GPU architectures, Eklund et al. (2011b) it is shown that how the computational power of cost-efficient GPUs can be used to speed up random permutation tests, Parallel computing in Radiotherapy planning.In spite of these several studies based on parallel medical image registration has been done till now.

CONCLUSION
As we have discussed above that parallel computing is having very important significance in several image processing techniques like edge detection, histogram equalization, noise removal, image registration, image segmentation, feature extraction, different optimization techniques and many more.In the field of medical imaging it also play a significant role.In current years a broad variety of approaches have been proposed for parallel image processing having their benefits and limitations.The present review provides the brief introduction of image processing techniques, different tools and techniques of computing parallel image processing with their respective features and limitations as mentioned in Table 1.Parallel Architectures, tools and techniques of parallel image processing is also discussed in this review with their advantages and limitations as in Table 2.It can be used in different applications of image processing on the basis of its appropriateness, performance, computational cost on the basis of time, applicability.As discussed parallel implementation of Image processing find to be great area of interests by different researchers because of its performance, suitability and availability.We saw that some techniques find to be limited applications and needs more computational knowledge.However, their performance can be improved by implementing them intelligently for example integration of the concepts of java's multithreading with MATLAB can give the significant results.Finally, we have also discussed applications of parallel image processing in medical imaging in this review and highly preferred to employ parallel image processing in various techniques of medical imaging for fast and efficient results for treatment planning.

Fig. 2 :
Fig. 2: Amdahl's law physical memory.For programming terms, tasks can just rationally see local machine memory and have to use communications to access memory built in other machines where added tasks are executing.

Fig. 3 :
Fig. 3: Architecture of CPU and GPU [NVIDIA's Article] are following: • It condensed power consumption • GPUs are genuinely programmable and hold up high precision that is 32 bit floating point throughout the pipeline • It provides portability, programmability, flexibility • In GPU computing model CPU and GPU work together in a heterogeneous co-processing computing model Limitations/Drawbacks: In Kaur and Nishi (

Fig. 4 :
Fig. 4: Life cycle of thread Advantages/Benefits (Kaur and Nishi, 2010): • It is specifically designed to run for non graphic purposes.• Its software development kit includes libraries, various debugging, profiling and compiling tools.• In this programming task is simple and easy as kernel calls are written in C-like language.• Provides faster downloads and read backs to and from the GPU.• It exposes a fast shared memory region (up to 48 KB per Multi-Processor).

Table 1 :
Analysis of different parallel implementation of image processing algorithms In this the software has been tested to work with double matrices.Data redistribution and data dependency analysis need to be done.Happ et al.

Table 2 :
Analysis of the Comparison of Execution Time of different image processing algorithms in CPU and GPU Thiyagalingamet al. (2011)andKadah et al.