Recognition of Multi-lingual Handwritten Numerals Using Partial Derivatives 1

The multi-font and multi-lingual handwritten numerals recognition has been a demanding requirement in this decade. This research work proposes multi-lingual handwritten numerals recognition using partial derivatives for classifying handwritten numerals of five major Indian languages. The objective of the proposed work aims at designing and developing a recognition algorithm for multilingual handwritten numerals. This objective is achieved through data collection and preprocessing which involves creation of handwritten numeral databases, data collection, round off mean aspect ratio value based representation and identification of features using partial derivatives. The features derived from partial derivatives are stored in a five dimensional column vector which yielded a recognition rate of 94.80, 95.89, 96.44, 95.81 and 92.03%, respectively for Kannada, Gurumukhi, Sindhi, Malayalam and Tamil Handwritten Numerals respectively.


INTRODUCTION
Extraction of numerals from student answer scripts, identification of distances from traffic information board, recognition of numerals information from tabular forms are some of the applications of a numeral recognition system.Recent computer system and communication technologies such as software packages like word processors with multiple fonts and multi size fonts, sending and receiving electronic mail and sending messages through fax machine also have the impact of increasing the number of readers, literacy and the way of writing by the human beings.According to Plamondon and Srihari (2000) even with the evolution of technologies, the process of handwriting recognition is still challenging.Students' uses pen and paper to write the language content, equations and graphical drawings, but they are not using a note-book computer system.These systems usually have the keyboard and mouse as the interface for human-machine interaction.The limitation of the input devices like keyboard is that they have limited size of space to accommodate the symbols of a language.
For languages with small character set namely English which has 26 upper case alphabets and 26 lower case alphabets and 10 numerals, the usage of keyboard seems to be easy, whereas for languages with larger character set namely, Chinese (50000 characters), Japanese (35000 characters) and Tamil (246 characters), the usage of keyboard seems to be difficult.
The Japanese language adopted from Chinese character set, uses Kanji alpha numeric which consists of 6349 characters (Tappert et al., 1990).These requirements lead to the generation of on-line recognition (electronic tablets) and off-line recognition system (optical recognition systems).The ultimate aim of these systems is to process handwritten data with arbitrary user written alphabets or symbols or any graphical marks'.For the purpose of recognizing the numeral in on-line, the handwritten data has to be converted into digital data by using a special pen for writing on an electronic surface.
In this system, the machine recognizes the handwritten data either while the user writes or at a later time.Some on-line recognizers are capable of even learning user writing samples of the writer, to adapt for subsequent recognition.In off-line recognition system, the machine recognizes the handwritten character after the writing has been completed.The recognition process can be performed days, months or even a year later.This system also facilitates interaction with the user without a keyboard and it would be of great help to the physically (visually) challenged people when interfaced with a text to voice synthesizer.The aim of the research work is to design and develop a recognition algorithm for multilingual handwritten numerals.
The major works reported in the literature about their used features for recognition, classification algorithms and recognition rate are presented in the following paragraphs.Sung-Bae (1996) has normalized the size of the input image by 16-by-16 (which has later been compressed to 4-by-4 feature vectors and is used as global feature).This has resulted in 96.05% recognition rate.Sabaei and Faez (1997) have selected features such as Pseudo, Zernike and Legendre moments for recognition.For each Farsi numeral, 200 samples in the form of handwritten numerals had been collected from different persons.As the numerals 0, 4 and 6 have different shapes, the total number of digits for Farsi numerals is 13 and the best recognition rate achieved is about 95% when the moments of orders seems to be higher than five.Elnagar et al. (1997) have proposed a method for recognizing handwritten numerals in Hindi based on structural descriptors.This process involves scanning the handwritten numeral and normalizing it to 30-by-30 pixels and then thinning it.In the second step, features like strokes and cavity have been extracted and these features have been represented syntactically.Finally, the syntactic representation of the feature is then matched against a stored set of syntactic representation prototypes and the recognition result of a maximum of 94% has been reported.Sanossian (1998) has used three primitive features in a segment, namely, boundary distance, pixel density and line distance from centroid, to extract features in Hindi numerals and obtained an average recognition rate of 95.8%.Chen and Ng (1999) have proposed a crossing feature coding method to extract the features and a recognition accuracy of 91.3% for handwritten numerals and 99% for printed numerals has been obtained.Zhang et al. (2000) have extracted global features and fine segment features of handwritten numerals in eight directions and have obtained 98.5% recognition rate for handwritten numerals, using two hidden layer feed forward neural network as classifier in the recognition system.Al-Omari (2001) used the object's Centre of Gravity (COG) and angle of orientation as key features of the shape.The testing set involved only 20 numerals for each class and the rate of recognition was found to be 87.22%.
Haar Wavelets and discrete wavelet transform have been used by Mowlaei et al. (2002) for feature extraction and classification, which resulted in 91.81% recognition rate.Zhang et al. (2004) proposed a feature extraction method that is hybrid in nature for recognition of numerals.It is comprised of geometrical features, namely, end points, loops, joints, local segment features, middle line and convexity and coefficients of complex wavelet transformation that is two dimensional in nature and has resulted in 99.1% recognition rate.Mozaffari et al. (2005) used the standard feature points for decomposing the numeral into its primitives.
Principal Component Analysis has been used to obtain same sized global codes.Using Nearest Neighbour Classifier (NNC), 94.44% recognition rate has been achieved.Impedovo et al. (2006) proposed a new technique for zoning description in which best classification results have been achieved by partitioning the pattern image into M = 9 zones, if handwritten numeral digits were considered on the training sets.It has also been mentioned that the optimal zoning still outperforms the traditional zoning method on the testing sets as well.
Purkait and Chanda (2010) have proposed morphological opening and closing operation on preprocessed image and obtained 500 features.Structuring element (line) along the major, minor, vertical and horizontal directions have been used to get four different images.This method had yielded in 97.75% recognition accuracy.Hossain et al. (2011) proposed rapid feature extraction method that computes the projection of each section which is formed by partitioning the image.This method resulted in 94.12% recognition accuracy using with probabilistic neural network, 94.10% accuracy using k-nearest neighbour classifier and 92.03% recognition accuracy using Feed forward back propagation neural network.Majhi et al. (2011) have developed a classifier for Odiya handwritten numerals.Curvature and Image gradient features have been used to extract the features and the image has been normalized (64*64 pixels of height and width).Using these primitive features, the number of features obtained is 2,592 which has been reduced to 64, using Principal Component Analysis (PCA) technique.This feature extraction method yielded an accuracy of 98 and 94% for gradient features and curvature feature respectively.Kartar et al. (2011) have used three different feature sets, namely, projection histogram, distance profile and Background Directional Distribution (BDD) resulting in the recognition rate of 99.2, 98 and 99.13%, respectively has been reported respectively, using SVM with Radial Basis Function (RBF) kernel classifier for Gurumukhi handwritten numerals, using only 150 samples.Mamatha et al. (2011) have used k-means clustering algorithm for classification and directional chain code method to extract the features from the resized image of size 30-by-30 pixels and obtained 96% recognition rate for Kannada numerals.Roy et al. (2012) have proposed quad tree based feature set using SVM classifier and the recognition rate of 93.38% has been reported for Bangla numerals.Baheti and Kale (2013) have used affine invariant moments feature extraction approach for Gujarati numerals and the recognition rate of 94% has been reported using Support Vector Machine (SVM) as a classifier.Baheti and Kale (2013) have used zone based pixel density values as the features for Gujarati numerals and the recognition rate of 94% has been reported using neural network as a classifier.Medhi and Kalita (2014) has proposed the features based on blobs or stems in the shape of the Assamese numerals.Decision tree has been used as a classification algorithm and has obtained 80% recognition rate.
Pirlo and Impedovo ( 2012) proposed an algorithm based on Voronoi diagrams and has achieved 94% accuracy for handwritten Latin numerals.Reddy et al. (2012) have applied projection profiles as the primary features and normalization of individual numerals has been done using 64-by-64 pixels as image size.A total of 832 features have been used and an accuracy of 99.3% has been obtained for Off-line Assamese Language.
Bhattacharya and Chaudhuri ( 2009) have created two Indian scripts databases, namely, Bangla and Devanagari.After applying multilayer perceptron classifiers, the recognition accuracy of 98.2 and 99.04% has obtained for Bangla and Devanagari numerals respectively.
The objective of the proposed research work aims at, Development of handwritten numerals databases of five different Indian languages, namely, Kannada, Gurumukhi, Sindhi, Tamil and Malayalam, Finding round off mean aspect ratio value based representation scheme, Identification and extraction of features using partial derivatives and Recognition of multi lingual numerals using a distance metric.

HANDWRITTEN NUMERAL DATABASE COLLECTION FOR VARIOUS INDIAN LANGUAGES
In order to solve the problem of recognizing a pattern in which a feature vector and a classifier would be identified to automate the process of handwritten numeral recognition, it is essential to have the data that has been collected to represent the expected conditions in which they operate.
One of the challenges faced while doing recognition of handwritten numeral of various Indian languages is the non existence of standard databases.However, standard databases such as MNIST, Centre for Excellence for Document Analysis and Recognition (CEDAR) and Centre for Pattern Recognition and Machine Intelligence (CENPARMI) are available for Latin numerals.According to the Eighth Schedule to Indian Constitution, there are 22 official languages in India namely, Bengali, Assamese, Dogri, Bodo, Hindi, Kannada, Gujarati, Konkani, Kashmiri, Malayalam, Maithili, Nepali, Oriya, Santali, Marathi, Gurumukhi, Sanskrit, Tamil, Santhali, Urdu and Telugu and Urdu.Out of 22 official languages, the selected scripts for this research work are Kannada, Gurumukhi, Malayalam, Santali and Tamil.

Data collection:
In the proposed work, samples handwritten were collected using a specialized tabular form.Each numeral of a particular language was written in a frame box of size of 1.5 cm×1.5 cm, on an A4 size white sheet with light gray background lines provided as separators.Each A4 size white sheet can accommodate (a maximum of 400 handwritten numerals) 25 square boxes in each column and 16 square boxes in each row and hence six white sheets have been used to collect the individual numerals.Based on the willingness of the candidate, the candidates were asked to write different numerals one or more times.The objective of the data collection had not been disclosed to the candidates who had been asked to write the numerals and was restricted to write only one numeral per square box.

Data sources:
The numerals of various languages selected for these research works have been written by different categories of people namely, University and college students and research scholars.In a real life application, the type of pens used may be ink pen, gel pen or ball point pen and the colour of the ink may also vary.In the proposed research work, only blue and black colours of the ink were allowed to fill the tabular forms.The resolution used to scan the filled sheets is 300 dots per inch.Hewlett Packard scanner has been  used for scanning and the scanned image has been stored as a binary image (Fig. 1).

Roundoff mean aspect ratio value based representation for binary numeral images:
performing image preprocessing and segmentation, image representation and feature selection of numerals play a vital role in a recognition system.language and the numeral pattern classes of L be denoted by P 1 , P 2 …P W , where W represents the numb of pattern classes.Each pattern class P by a pattern vector x.In order to obtain 'n' common descriptor for each pattern class of a language, the round off mean aspect ratio value based zoning representation scheme has been proposed in research work.Using the proposed representation scheme, the number of zones has been identified along both axes and the zone size would be defined for all the numerals in a language.One of the advantages of the proposed zoning representation scheme handle various handwritten styles of different writers without affecting the shape of the numerals.
Table 1 shows the language, the number of training data has been used to obtain mean aspect ratio value, mean aspect ratio value of a language, round off mean aspect ratio value and the zones along y and x axis.
All the feature values are calculated and converted into absolute values.These feature values are stored for round off mean aspect ratio values in terms of zones for Kannada, Gurumukhi, Sindhi, Tamil and Malayalam

Mean aspect ratio value
Round off mean aspect ratio value ----------------------------------------------------Y  .In order to obtain 'n' common descriptor for each pattern class of a language, the round off mean aspect ratio value based zoning representation scheme has been proposed in this research work.Using the proposed representation scheme, the number of zones has been identified along both axes and the zone size would be defined for all the One of the advantages of the proposed zoning representation scheme is that it could handle various handwritten styles of different writers without affecting the shape of the numerals.

Representation of round off mean aspect ratio values in terms of zones
Table 1 shows the language, the number of training data has been used to obtain mean aspect ratio value, uage, round off mean aspect ratio value and the zones along y and x axis.

Extraction of features using partial derivatives from
In this research work, five primitive features using partial derivatives of a two dimensional numeral n identified and extracted based on the density matrix of the numerals.The density matrix of the numerals has been defined as the number of white pixels (text pixels) in a given zone.These primitive features are, zd(zd_xaxis+xincr, zd_yaxis)s, zd_yaxis), zd(zd_xaxis, zd_yaxis-yincr)xincr,zd_yaxis + (zd_xaxis + xincr, (zd_xaxis, zd_yaxis) and zd xincr, zd_yaxis) + zd(zd_xaxis, zd_yaxisxincr, zd_yaxis + yincr) + zd -4zd(zd_xaxis, zd_yaxis), where 'zd' stands for zone density of a numeral image and 'zd_xaxis' stands for zones x axis The values of the All the feature values are calculated and converted into absolute values.These feature values are stored for each numeral of a pattern class and the feature values are clustered based on single linkag mean feature value of numeral images in each cluster has been stored in the feature vector.The number of features obtained using these primitives could be varied based on the round off mean aspect ratio value of a language.For the language Kannada and Tamil, the number of zones along y axis is 9 and along the x axis is 10, so the number of features required is 395, whereas, the number of zones along y axis and x axis are 7 and 5 respectively for the languages Gurumukhi and Sindhi and the number of features is 141.Similarly, for the language Malayalam, the number of zones along both axes is 5 and hence the number of features is 97. Figure 2 to 7 shows the extraction of a feature value from Kannada numeral five.The mean feature value of numeral images in each cluster has been stored in the feature vector.The number of features obtained using these primitives could be varied based on the round off mean aspect ratio value of a ge Kannada and Tamil, the number of zones along y axis is 9 and along the x axis is 10, so the number of features required is 395, whereas, the number of zones along y axis and x axis are 7 and 5 respectively for the languages Gurumukhi number of features is 141.Similarly, for the language Malayalam, the number of zones along both axes is 5 and hence the number of features is 97.Step 4 : Extract the features based on partial derivatives.
Step 5 : Apply the distance metric between the generated features and the features stored in the library for a particular language and assign the numeral class to the nearest feature vector in the library End of the Algorithm

EXPERIMENTAL RESULTS
The proposed feature extraction algorithm using partial derivates has been implemented using MATLAB version R2007b.Pentium Dual Core CPU, Processor of speed 2.0 GHz and 3.0 GB RAM have been used in this research work for carrying out the experiment.Testing of sample handwritten numerals has been done for the languages Kannada, Gurumukhi, Sindhi, Tamil and Malayalam and the results obtained have been discussed in the following sub sections.

Results and discussion for Kannada numerals:
The samples used in this research work for Kannada language are 16,200 for training and 4,963 for testing.The number of clusters formed using the proposed training algorithm for the Kannada numerals 0 to 9 are listed in the Table 2.The confusion matrix obtained using the proposed algorithm for the Kannada numerals 0 to 9, the number of samples tested and their corresponding percentage of accuracy are listed in the Table 3.

Results and discussion for Gurumukhi numerals:
The samples used in this research work for Gurumukhi language are 16,200 for training and 4,901 for testing.The number of clusters formed using the proposed training algorithm for the Gurumukhi numerals 0 to 9 are listed in the Table 4.The confusion matrix obtained using the proposed algorithm for the Gurumukhi numerals 0 to 9, the number of samples tested and their corresponding percentage of accuracy are listed in the Table 5. Results and discussion for Sindhi numerals: The samples used in this research work for Sindhi language are 16,200 for training and 5,374 for testing.The number of clusters formed using the proposed training algorithm for the Sindhi numerals 0 to 9 are listed in the Table 6.The confusion matrix obtained using the proposed algorithm for the Sindhi numerals 0 to 9, the number of samples tested and their corresponding percentage of accuracy are listed in the Table 7.

Results and discussion for Malayalam numerals:
The samples used in this research work for Malayalam language are 14,400 for training and 6,474 for testing.
The number of clusters formed using the proposed training algorithm for the Malayalam numerals 0 to 9 are listed in the Table 8.The confusion matrix obtained using the proposed algorithm for the Malayalam numerals 0 to 9, the number of samples tested and their corresponding percentage of accuracy are listed in the Table 9.
Results and discussion for Tamil numerals: The samples used in this research work for Tamil language are 10800 for training and 9,106 for testing.The number of clusters formed using the proposed training algorithm for the Tamil numerals 0 to 9 are listed in the Table 10.The confusion matrix obtained using the proposed algorithm for the Tamil numerals 0 to 9, the number of samples tested and their corresponding percentage of accuracy are listed in the Table 11.
The proposed algorithm for multilingual recognition has been compared with the existing algorithms in the literature (Table 12).When compared to the existing algorithms, the proposed algorithm is found to give better recognition rate for Malayalam.But this is not the case for other languages as their recognition rates are found to be less when the proposed algorithm is used.The reason is that most of the proposed algorithm has been designed to recognize multilingual numerals whereas the existing algorithms have been designed to recognize numerals of a specific language only.Hence, as a first step, the objective to recognize multilingual handwritten numerals has been successfully achieved and in future, enhancement need to be done so as to improve the recognition accuracy of proposed algorithm against the existing algorithms.

CONCLUSION
The objective of the proposed work is to recognize multilingual handwritten numerals.The languages used for experimentation are Kannada, Gurumukhi, Sindhi, Malayalam and Tamil and the recognition accuracy of 94.80, 95.89, 96.44, 95.81 and 92.03%, respectively has been obtained using the proposed algorithm.

LIMITATIONS
The handwritten numerals must be single connected which means that they should not be broken or fragmented.In such a case, the algorithm treats the broken numerals as different number of numerals based upon the number of fragments.This is one of the most important limitations.

SCOPE FOR FUTURE ENHANCEMENT
The zone size applied in the proposed representation scheme is 16-by-16.The zone size could be applied with various values and an optimum value of zone size could be identified.Single Linkage clustering algorithm and Euclidean distance metric has been applied for clustering data, for which an average of 42% of the sample data are required as prototypes.This percentage can be reduced using other clustering and distance metrics.

Fig. 1 :
Fig. 1: Specially designed tabular form for data collection (Kannada numeral 5) and the scanned image has been Roundoff mean aspect ratio value based representation for binary numeral images: After performing image preprocessing and segmentation, image representation and feature selection of numerals play a vital role in a recognition system.Let L be a language and the numeral pattern classes of L be , where W represents the number of pattern classes.Each pattern class P i , is represented

Fig. 2 :Fig. 6 :
Fig. 2: Kannada binary image (numeral 5) Fig. 4: Density matrix of the numeral in Fig. 3 -34 0 Numeric value obtained using zd(x+1, y) -zd(x, y) Numeral image for a language from the database Output : Pattern vector stored in the library as a prototype Step 1 : Enhance the binary numeral image Step 2 : Resize the enhanced image based on the round off mean aspect ratio value of a language Step 3 : Calculate the density matrix of the numeral based on the zones along two axes Step 4 : Extract the features based on partial derivatives Step 5 : Repeat the steps 1 to 4 to cluster the images using single linkage algorithm and store the class prototype in the library End of the algorithm Testing numerals algorithm: Input : Single numeral image for a language from the database Output : Classification of the numeral based on the language Step 1 : Enhance the binary numeral image Step 2 : Resize the enhanced image based on the round off mean aspect ratio value of a language Step 3 : Calculate the density matrix of the numeral based on the zones along the axes

Table 1 :
Representation of round off mean aspect ratio values in terms of zones for Kannada, Gurumukhi, Sindhi, Tamil and Malayalam

Table 2 :
Number of clusters for Kannada numerals

Table 4 :
Number of clusters for Gurumukhi numerals

Table 5 :
Recognition accuracy obtained for all numerals in Gurumukhi