A Study of Authorship Attribution in English and Tamil Emails

The aim of our study is to identify author of unknown emails of Tamil and English. The recent approaches in Authorship Attribution show that apart from lexical measures some other features of written language are considerably effective as discriminators of author style. However, there have been no attempts to compare the attribution potential of these features. The aim of the present study, then, has to examine the effectiveness of several styles-markers in authorship attribution between the following two languages, English and Tamil equally important, however, we have to compare the usefulness of the chosen style-markers across a two languages the results proved high attribution effectiveness can be achieved in both the language.


INTRODUCTION
The study of identifying the owner (author) of Text/Email/Message/blog is called Authorship Attribution (AA).Currently, there were very few works on AA for Tamil emails (Bagavandas et al., 2009) have been done when compare to English.Previous authorship studies contain lexical, syntax (Grieve, 2007;Luyckx and Daelemans, 2008) structural and content-specific features, word based features including word length distribution, words per sentence and vocabulary richness were successful in earlier authorship studies.Syntactic features, called style markers, consist of all-purpose functional words.
The importance of text classification techniques rooted in machine learning marked as a pivotal turning point in authorship attribution studies.The use of such methods is straightforward: Training texts are used as labeled numerical vectors.They use learning methods to find boundaries between classes (authors) that minimize some classification function.The nature of the land boundaries depends on the learning method used.These methods facilitate the use of classes of boundaries that extend well beyond those implicit in methods that minimize distance.The earliest methods applied various types of neural networks using small sets of functional words as features.Graham et al. (2005) used neural networks on a wide variety of features.Other studies used k-nearest neighbor (Zhao and Zobel, 2005), rule learners (Koppel and Schler, 2003;Abbasi and Chen, 2005) and Bayesian regression (Genkin et al., 2007;Madigan et al., 2005;Argamon et al., 2003).Support Vector Machine (SVM) learning is suitable for text categorization as any other learning method and find the same for authorship attribution (De Vel et al., 2001;Diederich et al., 2003), Winnow (Koppel et al., 2002;Argamon et al., 2009).
The studies since that of Mosteller and Wallace have shown the use of function words for authorship attribution in different scenarios (Holmes et al., 2001a, b;Baayen et al., 2002;Binongo, 2003).Typical modern studies using function words in English use lists of a few hundred words, including pronouns, prepositions, auxiliary and modal verbs, conjunctions and determiners.Results of different studies using somewhat different lists of function words have been similar, indicating that the precise choice of function words is not crucial.For documents such as email formatting, structural features can be used for authorship attribution (Corney et al., 2002) (Fig. 1).

Materials:
The Table 1 describes the sequence of operations of the proposed system in this study for email authorship categorization.The proposed system is the combination of FLD and RBF algorithms.
Step 1: Emails have been used for Enron database.
Step 2: Tokenize the information of the enron emails.Create a dictionary of information.The template contains functional words like preposition, conjunctions, interjections, pronouns, verbs, adverbs, adjectives.This template has been used for filtering out  irrelevant information that will not be used for authorship analysis.
Step 3: Signature for each email is created by extracting features based on lexical characters, lexical words and syntactic properties.The total number of features for each email signature is 322.The details of the features (Farkhund et al., 2008(Farkhund et al., , 2010) ) are as follows: • Lexical analysis based on characters • Total characters per line (NC) • Ratio of digits to total characters (RD_T_C) • Ratio of letters to total characters (RL_T_C) • Ratio of uppercase letters to total characters (RUCL_T_C) • Ratio of spaces to total characters (RS_T_C) Find the number of words and the number of occurrences (frequencies) an email and all the emails of authors.Create a matrix with rows equivalent to the total number of unique words extracted from all emails of all authors.The number of columns is equivalent to number authors.Fill up the columns with frequencies of words corresponding to respective authors.Each column is treated as a signature, which is further transformed into 2-dimensional pattern.A labeling is done for each pattern.
Step 4: The emails of each author are taken as a separate class.In this study, emails of100 authors are grouped into 100 classes.Fishers linear discriminant method is used to create two projection vectors ϕ 1 and ϕ 2 .These projection vectors transform 322 dimensional signature into 2 dimensional pattern.Fifty emails for each author has been considered and hence a total of 5000 (50emails*100authors) signatures is obtained.
Step 5: Radial basis function with 75 centers (any other value) is used to learn 20% of emails of each author (Total of 10 emails X 100 authors = 1000 signatures) to get final weights.Many neural networks are available, however, we preferred RBF as it learns non linear data effectively.
Step 6: Testing the proposed system is done by using 80% of 50 emails per author (Total of 40 emails X 100 authors = 4000 signatures) are used.
Step 2 to step 4 are adopted to obtain two dimensional signatures of the testing emails.Each signature is processed with the final weights obtained in step 5.The output of the RBF is used for categorization of the authorship of an email.

Methods:
Linear discriminant: Linear Discriminant Analysis (LDA) and the related Fisher's linear discriminant are methods used in statistics, pattern recognition and machine learning to find a linear combination of features which characterize or separate two or more classes of objects or events.The resulting combination may be used as a linear classifier.This linear classification can be fine tuned by applying a radial basis function on it.The mapping of the original vector 'X' onto a new vector 'Y' on a plane is done by a matrix transformation, which is given by: where, X is the signatures and: where, ϕ 1 = A projection vector (also called a discriminant vector) ϕ 2 = Another projection vector The 2-dimensional pattern from the original 322dimensional vector is denoted by 'y i '.The vector 'y i ' is given by: The vector set 'y i ', is obtained by projecting the original signatures 'X' of the 5000 signature patterns onto the space spanned by ϕ 1 and ϕ 2 by using Eq.(3).

Radial basis function:
The radial basis function is a supervised neural network, which uses a distance measure between the input pattern and the centers of the RBF nodes (Pandian and Sadiq, 2011).The summation of the distance is passed over an exponential activation function.This forms the outputs of the hidden nodes in the RBF network.A bias value is appended to the outputs of nodes in the hidden layer.The outputs of the hidden layer are processed with the labeled values (targets) assigned to obtain the final weights which will be used for testing.

RESULTS AND DISCUSSION FOR ENGLISH EMAILS
The plots in Fig. 2 to 4 define the characteristics of the emails of 100 authors based on the information mentioned in step 3.The email can be categorized to an author by averaging the signatures of the emails as shown in Fig. 2. The brown color plot shows the difference between the successive authors.The average difference is 0.3511 that indicates that the author can be categorized.Figure 3 presents the intersections of ϕ 1 and ϕ 2 projection vectors.In Fig. 3, signatures of 100 authors are projected using ϕ 1 and ϕ 2 vectors into 2-dimension.From this plot, very few authors' signatures overlap and the remaining authors' signatures are visible distinctly.In order to overcome the overlapping, RBF is used for correct categorization.
RBF network is trained with projected signature patterns along with labeling.A final weight matrix is obtained which is further used to test the untrained emails.The outputs of RBF are categorized to a trained authors database else, the email is categorized to some other author outside the database.

Problem statement and objectives for Tamil emails:
The problem focused in this study is as follows: • Suspicious Tamil email is under consideration.The Writing Style (WS) in this suspicious email has WS of one author or more than one author.The number of suspects can be (S 1 , S 2 , …S n ).• The WS 1-N is available in the database repository (R).

Initial approach:
Extract the WS of N authors using lexical, syntactic methods.
Cluster the WS of emails of each author and check for separability among the authors.
To enhance the identification of an anonymous author of the suspected email, apply reduction of the WS signature of the authors of higher dimension to 2 dimensions.Preposition-1, preposition-2, preposition-3, preposition-4, adjectives, adverbs and conjunctions have their standard meanings.

Subsequently, use
The total number of words used as basic dictionary is 1571 (work+action+prepositions+adjectives+ adverbs+conjunctions).The numbers mentioned in parenthesis are the total in each category, whereas, only few words are given in the Table 2 to 4.  The sequence of operations of the proposed system in this study for Tamil email authorship association is as follows: Step 1: Tamil emails of 50 authors are considered.Ten emails of each author has been considered.
Step 3: Created signature for each Tamil email by extracting features based on lexical characters, lexical words and syntactic properties.The total number of features for each email signature is 322.The details of the features (Farkhund et al., 2008;2010) are as follows (Table 5) Obtain the number, of words and the number of occurrences (frequencies) of information in emails.
Create a matrix with rows equivalent to the total number of unique words extracted from all emails of all authors.The number of columns is equivalent to number authors.
Fill up the columns with frequencies of words corresponding to respective authors.
Treat each column as a signature.Do a labeling for each 2-dimensional pattern.
Step 4: Take the emails of each author as a separate class.In this study, we group emails of 50 authors into 50 classes.Create two projection vectors ϕ 1 and ϕ 2 using Fishers linear discriminant method.These projection vectors transform 322 dimensional signature into 2

RESULTS AND DISCUSSION FOR TAMIL EMAILS
The implementation of FLD, RBF and ESNN is done using Matlab 10.The plots in Fig. 2 to 8 define Figure 9 presents the plot ϕ 1 and ϕ 2 projection vectors obtained using Eq.(2). Figure 10 presents the plots of (u, v) using Eq. ( 3) for 50 authors (each 2 emails).There is overlap as many points as shown in Fig. 10.

CONCLUSION
In the first part of this study, there is overlapping of a few authors (Fig. 3), RBF has been used.Advantages of the proposed system are as follows: • The size of the 322-dimensional signature pattern is reduced to 2-dimension.• The training of the RBF is faster with less computational complexity.• The size of the RBF topology is reduced from 322 to 2 in the input layer.• Since, the activation function used in RBF is nonlinear, the overlapping problem is solved.
The second part of this study presents Tamil email AA uses FLD with RBF and, FLD with ESNN.As there is overlapping of a few authors (Fig. 10), there is still some mismatching of results.
From the above results we have utilized the RBF and FLD in both English Emails and Tamil Emails.In addition to that, ESNN used in Tamil Emails.By this way in the future, the same methods we can try in the other languages having the most valuable ancient texts.
In the future, we can try our above specified method to apply in a single bilingual document also.

Fig. 1 :
Fig. 1: Results of different studies of using function words in English

•
Occurrences of alphabets to total characters (OA_T_C) • Occurrences of special characters: < > j { } (OSC_T) • Lexical word based analysis • Number of Words (NW) • Sentence length in terms of characters per line (SL) • Average token length (ATL) • Ratio of short words (1 to 3 characters) to T (RSWT) • Ratio of word length frequency distribution of T (20 features) (RWLF) • Average sentence length in terms of characters (ASLC) • Ratio of characters in words to N (RCW) • A word which occurs only once in the email document (SWO) • A word which occurs only twice in the email document (TWO) • Syntactic features • Occurrences of punctuations (OP) • Occurrences of function words (OFW)

Fig. 2 :
Fig. 2: Average frequency of all features (1 to 3 characters) to T RSWT Ratio of word length frequency distribution of T (20 features) RWLF Average sentence length in terms of characters ASLC Ratio characters in words to N RCW A word occurs only once in the email document SWO A word which occurs only twice in the email document TWO Syntactic features Occurrences of punctuations OP Occurrences function words OFW dimensional pattern.We consider ten emails for each author and hence obtain a total of 500 (10 Tamil emails X 50 authors) signatures.Step 5: Training of Radial basis function is done separately with 75 centers (any other value) in the hidden layer.Similarly, training of ESNN is done separately with 21 reservoirs in the hidden layer.In each case, 20% of the emails are used (Total of 2 emails X 50 authors = 100 signatures) to get final weights.Step 6: Testing RBF and ESNN is done separately.Eighty percent of 10 emails per author (Total of 8 emails X 50 authors = 400 signatures) are used.Adopt step 2 to step 4 to obtain two dimensional signatures of the testing emails.Process each signature with the final weights obtained in step 5. Use the outputs of the RBF/ESNN for AA Methods: Echo State Neural Network (ESNN): The echo state neural network is a recurrent network (Jaeger, 2001a, b; Purushothaman and Suganthi, 2008).The echo state condition is the spectral radius (the largest among the absolute values of the eigenvalues of a matrix, denoted by (|| ||) of the reservoir's weight matrix (||W|| <1).This condition states that the input controls the dynamics of the ESNN and the effect of the initial states vanishes.The current design of ESNN parameters relies on the selection of the spectral radius.There are many possible weight matrices with the same spectral radius.They do not perform at the same level of mean square error (MSE) for functional approximation.

Fig. 5 :
Fig. 5: Pi chart for distribution of Tamil letters

Table 1 :
Steps of the proposed system Table 2 to 4 present words used for filtering the Tamil email and analyze for unique information.Work words will analyze how an author writes email and what clarity is present in the email.The number of work words will indicate performance task requirements in an unambiguous manner.Action words indicate some actions present in the email.

Table 5 :
Features used in this study