Separation of Text Components from Complex Colored Images

: The objective of this study is to project a new methodology for text separation in an image. Gamma Correction Method is applied as a preprocess technique to suppress non text regions and retain text regions. Text Segmentation is achieved by applying Positional Connected Component Labeling, Text Region Extraction, Text Line Separation, Separation of Touching Text and Separation of Text Components algorithms. At last, the details of each word’s and the line’s starting text component position are stored in a text file. Experiments are conducted on various images from the datasets collected and tagged by the ICDAR Robust Reading Dataset Collection Team. It is observed that the proposed method has an average recall rate of 97.5% on separation of text components in an image.


INTRODUCTION
Rapid development of digital technology has resulted in digitization of all categories of materials.Text data present in images and video contain useful information for detection of vehicle license plate, name plates, keyword based image search, content based retrieval, text based video indexing, video content analysis, document retrieving, address block location etc. Recognition of the text data in document images depends on the efficient separation of text.Many methods have been proposed for text separation in images and videos.It is not easy to describe a unified method as there are low-contrast or complex images, text with variations in font size, style, color, orientation and alignment etc. Chethan and Kumar (2010) proposed an algorithm to remove graphics from the document and correct skew for the documents captured using cellular phone.The basic process of this approach consists of three steps: First, a vertical and horizontal projection was used to remove graphics from images.Second, dilation operation was applied to the binary Images and the dilated Image was thinned.At last Hough transform was applied to skew angle.

LITERATURE REVIEW
A new text line location and separation algorithm for complex handwritten documents was proposed by Shi and Govindaraju (2004).The method used a concept of fuzzy directional run length which imitated an extended running path through a pixel of a document.The method partitioned the complex documents to separate the content of the document to texts in terms of text words or text lines and to other graphic areas.Peng et al. (2013) projected a method to classify machine printed text, handwritten text and overlapped text.Three different classes were initially identified using G-means based classification followed by a Markov Random Field (MRF) based relabeling procedure.A MRF based classification approach was then used to separate overlapped text into machine printed text and handwritten text using pixel level features.Patil and Begum (2012) presented a method for discriminating handwritten and printed text from document images based on shape features.K-nearest neighbor based on minimum distance was used to classify the handwritten and printed text words.Yao et al. (2012) proposed a system which detected texts of arbitrary orientations in natural images.The proposed algorithm consists of four stages: component extraction where image are grouped together to form connected components using a simple association rule, component analysis to remove nontext parts, candidate linking to link the adjacent character candidates into pairs and chain analysis to discard the chains with low classification scores.Coates et al. (2011) presented text detection and recognition system based on scalable feature learning algorithm and applied it to images of text in natural scenes.
A new method to locate text in images with complex background had been presented by Gonzalez et al. (2012).The method combined efficiently MSER and a locally adaptive thresholding method.The method mainly composed of three main stages: a segmentation stage to find character candidates, a connected component analysis based on fast-tocompute but robust features to accept characters and discard non-text objects and finally a text line classifier based on gradient features and support vector machines.Phan et al. (2012) proposed novel symmetry features for text detection in natural scene images.Within a text line, the intra-character symmetry captured the correspondence between the inner contour and the outer contour of a character while the intercharacter symmetry helped to extract information from the gap region between two consecutive characters.A formulation based on Gradient Vector Flow was used to detect both types of symmetry points.These points were then grouped into text lines using the consistency in sizes, colors and stroke and gap thickness.
A real-time scene text localization and recognition method was presented by Neumann and Matas (2012) The probability of each of Extremal Regions (ER) was estimated using novel features calculated with O (1) complexity and only ERs with locally maximal probability are selected to classify into character and non-character classes using SVM classifier with the RBF kernel.
A top-down, projection-profile based algorithm to separate text blocks from image blocks in a Devanagari document was proposed by Khedekar et al. (2003).They analyzed the pattern produced by Devanagari text in the horizontal corresponding to a text block possesses certain regularity in frequency, orientation and shows spatial cohesion.

PROPOSED METHODOLOGY
The aim of the proposed work is to separate the text component from the input Image.The work flow of the system is shown in the Fig. 1.The input image of the proposed system has a complex background with text in it.The first stage is pre-processing that suppresses the non-text background details from the image by applying appropriate gamma value.Otsu's thresholding algorithm is used to calculate the threshold value and applied to this image to create an output binary image.
The output may contain white and black text region and some noises.In next stage of Text Region Extraction algorithm, white text and black text region are extracted from the binary image and those text regions are stored as white foreground in black background.Also, the algorithm removes very small and large non text regions from the image. In

Preprocess technique by using Gamma Correction Method (GCM):
The Gamma Correction method proposed by Sumathi and Devi (2014) suppresses the non-text background details from the image by applying appropriate gamma value and to remove non text region.The algorithm estimated the Gamma Value (GV) without any prior details of the imaging device by using texture measures.By applying this estimated gamma value to an input image (Fig 2a, c, e), the background suppressed image (Fig. 2b, d, f) will be achieved.Otsu's thresholding algorithm is used to calculate the threshold value and applied to this image to create an output binary image (Fig. 3a, 4a, 5a).This binary Image (I) will be the input of the Extraction of Text algorithm.Step 1: Find the foreground (white) pixel and record the foreground column position of the binary matrix image in a matrix (Position Matrix).
Step 2: Unmark all the cells of the Position Matrix (PM).
Step 15: Stop the procedure.

Text Region Extraction (ETR) algorithm:
The Binary image obtained by previous phase may contain white and black text region.In Text Region Extraction algorithm white text region in black background and black text region in white background are extracted from the binary image and those text regions are stored as white foreground in black background.The algorithm uses PCCL Algorithm to find connected components.Also, the algorithm removes very small and large components from the image.
The algorithm for extraction of text region is presented as follows: Step 11: Stop the procedure.
To illustrate the method Fig. 3a is taken.Region of Interests are G, O (white pixel) T, e, s, c, o, D, i, e, t, s, l, o, s, e, a, s, t, o, n, e, i, n, 8, w, e, e, k, s (Black pixel).The output after step 2 of ETR algorithm is shown in Fig. 3b.As per step 3, Fig. 3a is reversed (Fig. 3c) to find out whether there is any black region of interest component.There are 6 components which looks like filled 'o' in Fig. 3c and one component look like filled 'o' near right bottom corner in Fig. 3d.White pixels inside the letter 'e' in Tesco, letters 'o' in lose, 'a', letter 'o' in stone and digit '8' are treated as components by Positional Connected Component Algorithm.However the white pixel inside the letter 'e' is also a component.But, according to step 2 of ETR algorithm, it is treated as very small component and it has been removed.These 6 components appeared in Fig. 3b and one component in Fig. 3d, 4d and 5d is removed after step 6 and 7 of ETR algorithm.The output of step 6 of ETR is Fig. 3e, 4e and 5e and the output of step 7 is Fig. 3f.Merge the output of step 6 and 7 of ETR algorithm to get the output image L (Fig. 3g). Figure 4b, f and Fig. 5b, f are the stages involved when the algorithm is applied for the Fig. 4a  and c, respectively.

Separation of Text Row (STR) algorithm:
The aim of this algorithm is to break the image in to row images by using maximum and minimum row position of the text components.
Separation of text line extraction method consists of the following steps: Step 5: Stop the Procedure To illustrate the STR algorithm, Fig. 3g, 4g and 5g is taken.The output of step 4 (ii) of STR algorithm is shown in the Fig. 6a, c, e, g, i and k.In Fig. 6g the component 'e' in word 'stone' is the StartLabel as the TopRowPosition is greater than the other component.The exact row height of component 'e' from column 1 to maximum column of image is examined.Some partial part of l, o, s, e, a, s, t components are found in Fig. 6g and decided that those components are also belongs to the same row.Figure 6h is obtained after step 4 (vi) of ETL is applied.The Output Rows R [] are shown in Fig. 6b, d, f, h, j and i.In Fig. 6i a part of 'l', 'o', 's' components appears, but those components will not be taken in to consideration in creation of new row as those components already been found in previous row (Step 5 (vii) of STR) (Fig. 7).

Detect and Split Touching Text (DSTT) algorithm:
This algorithm uses component width size, Outlier algorithm to detect the touching components.The separation of the touching component is done by using morphological thinning and lightly populated area algorithm.
The DSTT algorithm is presented by the following steps.8c and d.The size of component in Fig. 8c is greater than Excepted_ Component_size.So, the Component is split at the calculated junction points (Fig. 8f).Now, the Component is split in to 2 components (Fig. 8g and d).words.The algorithm of Text Position Detail (TPD) is given below (Table 1).

EXPERIMENTAL RESULTS
The performance of the proposed technique has been evaluated based on Precision, Recall and F Measure. Precision and Recall rates have been computed based on the number of characters (TP) in an image, in order efficiency and robustness of the algorithm.The metrics are as follows.
Definition 1: False Positives (FP) /False alarms are those regions in the image which are actually not characters of a text, but have been detected by the algorithm as text.The experimentation of the algorithms implemented in MATLAB Tool was carried out on the ICDAR data set consisting of 100 different images and as well as some images were taken from the WEB.Some of the Experiments results have been shown in Section above.The results in this research show that the new proposed method separates the text component of the image.The proposed method can separate most of the text region successfully, including text with different styles, size, font, orientations and color.This approach resulted in an average precision rate of 88%, recall rate of 97.5% and F-Score of 92.5%.

CONCLUSION AND RECOMMENDATIONS
The study presents a new algorithm for the separation of text region information in an image.This proposed method uses a positional connected component labeling, text region extraction, text line separation, separation of touching text and separation of text components algorithms.The proposed technique is an essential stage for most of the object recognition method.The algorithm is applied on several images with text of different styles, size, font, alignment and complex backgrounds taken from ICDAR datasets and shown promising results.The future work concentrates on the next stage of developing a text recognition algorithm from the output obtained by the newly proposed text separation technique.

Fig. 1 :
Fig. 1: Work flow of text separation from image

3 :
Find the minimum position value (MinPos) and maximum position value (MaxPos) from the Position Matrix.Step 4: Set the value of LABEL to 1. Current Row (CR) to 1.AV [] = {}, PV [] = {} Step 5: Get the first unmarked value (umv) from the Position Matrix (PV [] = umv).Set CR to the Row Number of umv.FLAG = NOTPREVMARKED and L = LABEL.If no unmarked cell found, then go to step 15.Step 6: Find out the adjacent value (AV []) of the position values (PV []) from PM.The adjacent values are P-1, P and P+1 for a value P. If P is to be MinPos, then the adjacent values are P and P+1.If P is to be MaxPos, then the adjacent values are P-1 and P. Step 7: Search for the AV [] in the CR, CR-1 and CR+1.Mark the corresponding cell If any of these cells are already Labeled by the previous pass, change the FLAG = PREVMARKED, L = Label assigned to the already Labeled cell.(Do not include CR-1 for the first row and CR+1 for the last row).Step 8: Increment CR by one.Step 9: Scan CR and find the adjacent values (AV []) of cells marked by step 7 in the row CR.Step 10: PV [] = AV [].Go to step 6 if AV [] = Ø or CR>LAST ROW.Step 11: Assign L value for the corresponding cells marked during this pass in the Input Image (I).Step 12: If FLAG IS NOTPREVMARKED then increment LABEL value by one.Step 13: If any unmarked cells found go to step 5.

Input:
Binary Image Matrix I Output: Text Image Matrix L, No_of_Text_ Components Step 1: Apply PCCL Algorithm for the Input Image I to produce Image L1.This algorithm treat white pixel as foreground (Region of Interest).Step 2: Find the size of each component and remove very small and large component from the Image L1. (White Text Components of I are extracted).Step 3: Reverse the Image I (Change the white pixel to black pixel and black pixel to white pixel).RI = ~I.Step 4: Apply PCCL Algorithm for the Reversed Image RI to produce Image L2. Step 5: Find the size of each component and remove very small and large Component from the Image L2. (Black Text Components of I are extracted.)Step 6: Find all the Components (RCs) of L1 that fit inside the components of L2.Remove RCs from the image L1.Step 7: Find all the Components (RCs) of L2 that fit inside the component of L1.Remove RCs from the image L2.Step 8: Assign 1 to the pixel value greater than zero for the image L1 and L2.(L1 = (L1>0), L2 = (L2>0)).Step 9: L = L1+L2.Apply PCCL Algorithm for the L to assign the label value.

Fig. 10 :
Fig. 8: Detect and split of touching component 1 Figure 8 to 10 are obtained on 2 st , 3 nd , 4 th iteration of step 6, respectively.The output of the algorithm is shown in Fig. 11.The algorithm is insensitive to the size of the Text as Component Width Size is calculated for each row.Sequence Separation of Text Components (SSTC) algorithm: This algorithm is used to separate the individual text component from the row Images.Components of each text line obtained by text row algorithm are sorted in ascending order according to the left most column position of the component.The components are extracted one by one according to the sequence order.The SSTC algorithm is formulated by the following steps.Input: Row Images R [] Output: Corrected Row Images R []-Label is assigned sequentially from the first component of first Row to the Last Component of Last Row, NewLineNo [], Total_No_of_Component, TextComponents [] Step 1: StartLabel = 1 Step 2: Repeat Step 3 to 7 for i =1 to No_of_Row Step 3: PCCL Algorithm is applied on R [i] as there may be an increase of number of components due to detect and Split Touching Text Algorithm Step 4: Reassign the label of Text Components in Row R [i] in the sequence order starting from StartLabel according to Left Column Pixel Position of each component Step 5: NewLineNo [i] = MinLabel assigned in the Row R [i] Step 6: Total_No_of_Component = MaxLabel assigned in the Row R [i] Step 7: For j = StartLabel to Total_No_of_Component i. [r c] = find (R [i] = = j) ii.TextComponents [j] = R [j] (min (c), min (r): max (c), max (r)) Step 8: StartLabel = maxLabel assigned in the Row R [i] +1 Step 9: Stop the Procedure According to step 4 of SSTC Algorithm, Components of each text Row R [] obtained are sorted in ascending order according to the left most column position of the component in the sequence order.Text Components are shown in Fig. 12a to c.The array NewLineNo [] contains the starting text component numbers of each Row.Text position details algorithm: The Details of the Text Position of each Text Components, Row Line and Words are stored in text files.The starting text component numbers of each Row is already obtained by the Sequence Separation of Text Components algorithm and those details are stored in 'newlinedet.txt'.Space Details of each text component are calculated to frame Fig. 12: (a) Text separation 1, (b) text separation Table 1: Text row start position of Fig. 2c Row start position 1

Step 1 :Step 2 :File
File Open to save details of Text position a. fidsp = fopen ('spacedet.txt','wt')b. fidword = fopen ('worddet.txt','wt')c. fidnl = fopen ('newlinedet.txt','wt') Starting position of component of each row are stored in newlinedet.txtusing NewLineNo [] Step 3: Repeat Step 4 to Step 6 for No_Of_Row Step 4: Space Gap between each successive component are calculated and stored in sp and in 'spacedet.txt'.(SpaceGap = the first character of each row.)Step 5: Outlier Value (s) of sp [] are calculated (exclude SpaceGap of -1000) Step 6: Outlier space gaps are treated as a delimiter of a word.These details are stored in worddet.txtStep 7: Stop the Procedure To illustrate the example the image in Fig. 2c is taken.In step 2 of TPD algorithm, the starting text position calculated in NewLineNo algorithm is saved in 'newlinedet.txt'and the component Position, left most column position of component, right most column position of component, width size of component and the gap of 2 adjacent Text Components (Gap as -1000 for the first character of new row) are stored in 'spacedet.txt'.To Find the starting position of each word, sp [] (Gap of 2 adjacent text component) for each row calculated is examined.Here the outlier of sp [] for each row is calculated and the text position whose gaps are outliers ( Open to save details of Text position ('newlinedet.txt','wt') Starting position of component of each row are stored in newlinedet.txtusing NewLineNo [] Repeat Step 4 to Step 6 for i = 1 to Space Gap between each successive component are calculated and stored in sp [] 'spacedet.txt'.(SpaceGap = -1000 for the first character of each row.)[] are calculated Outlier space gaps are treated as a delimiter of a word.These details are stored in worddet.txtTo illustrate the example the image in Fig. 2c is taken.In step 2 of TPD algorithm, the starting text [] by SSTC algorithm is saved in 'newlinedet.txt'and the column position of component, right most column position of component, width size of component and the gap of 2 adjacent Text 1000 for the first character of new row) are stored in 'spacedet.txt'.To Find the ord, sp [] (Gap of 2 adjacent text component) for each row calculated is examined.Here the outlier of sp [] for each row is calculated and the text position whose gaps are outliers (marked in red color in Table 2 to 5 and 8) or -1000 are saved as word gap in 'worddet.txt.' Description of C1, C2, C3, C4, Table 2 to 9 are given below.C1-Text Position, C2-Min Column position of Current Text, C3-Max Column of Previous Text, C4 Text Width, C5-Gap between 2 adjacent Text (C2 N = -1000.

Definition 2 :
False Negatives (FN) regions in the image which are actually text characters, but have not been detected by the algorithm.Definition 3: Precision rate (P) is defined as the ratio of correctly detected characters to the sum of correctly detected characters plus false positives: (Precision = TP/ (TP + FP) *100%) Definition 4: Recall rate (R) is defined as the ratio of the correctly detected characters to sum of correctly detected characters plus false negatives (Recall = TP/ (TP + FN) *100%) Definition 5: F-Score is the harmonic mean of recall and precision rates.
1000 are saved as word C4, C5 and N of Min Column position of Max Column of Previous Text, C4-Gap between 2 adjacent Text (C2-C3),EXPERIMENTAL RESULTSThe performance of the proposed technique has been evaluated based on Precision, Recall and F-Score Measure.Precision and Recall rates have been computed based on the number of correctly detected (TP) in an image, in order to evaluate the efficiency and robustness of the algorithm.The metrics False Positives (FP) /False alarms are those regions in the image which are actually not characters of a text, but have been detected by the False Negatives (FN) /Misses are those regions in the image which are actually text characters, but have not been detected by the algorithm.Precision rate (P) is defined as the ratio of to the sum of correctly detected characters plus false positives: FP) *100%) Recall rate (R) is defined as the ratio of the correctly detected characters to sum of correctly detected characters plus false negatives: *100%) Score is the harmonic mean of recall

Table 2 :
Text space detail of Fig.

Table 6 :
Word start position of Fig.2a

Table 9 :
Word start position of Fig.2e