A Grammatical Evolution Approach for Content Extraction of Electronic Commerce Website

Web content extraction, a problem of identifying and extracting interesting information from Web pages, plays an important role in integrating data from different sources for advanced information-based services. In this paper, an approach and techniques of extracting electronic commercial information from the Web pages without any given template is investigated in a way of Grammatical Evolution (GE) method. Although a lot of research used the Xpath technique to extract the content of Web pages, but due to the complexity of the Xpath grammar, it is too difficult to perform the processing automatically for evolutional tools. Hence, a reduced language integrating Xpath and DOM techniques is given to generate the solution of parse in a BNF grammar form, which is used in the GE. Moreover, a fitness function evaluation method is also proposed on the fuzzy membership of the two parts in the chromosome. Finally, empirical results on several real Web pages show that the new proposed technique can segment data records and extract data from them accurately, automatically and flexibly.


INTRODUCTION
Web content extraction is a problem of identifying and extracting interesting information from Web pages.It plays an important role in integrating data from different sources for advanced information-based services, such as customizable Web information gathering, comparative shopping, meta-search and so forth.
To solve this problem, several approaches ranged from heuristics, meta-heuristics, to the methods using data mining, statistics, or ontology, are used.Among them, Ziegler and Skubacz (2007) proposed an approach to extract real content from Web news pages using a particle swarm optimization algorithm.Chang et al. (2003) introduced a system called IEPAD to discover extraction patterns from Web pages without user-labeled examples but using several pattern discovery techniques, including PAT-trees, multiple string alignments and pattern matching algorithms.The paper (Qiu and Yang, 2010) presented a set of novel techniques based on page similarity measure, page clustering and wrapper generation to automatically extract data from E-Commerce web sites.In Reis et al. (2004), traditional hierarchical clustering techniques are used to extract the desired news from Web news sites.McKeown et al. (2003) provided an article extraction module using machine learning program Ripper.
In this study, our interests is the approach and techniques of extracting electronic commercial information from the Web pages without automatically rather than with some fixed template such as in Zhai and Liu (2007).A typical example is the product list and description pages in the electronic commerce Websites as shown in Fig. 1a and b. intuitively, these two kinds of page are slightly different in both layout and content.The former, called list pages, often enumerate the list items, i.e., a number of goods with summary introductions; but the latter, called detail pages, include the elaborate description of the products, such as product name, image, description and price, for comparative e-shopping.
The main ideas of our attempt could be summarized as follows:  Adopting the GE algorithm to get the expression automatically.Using this Meta heuristic method could generate the solution expression in a BNF grammatical search space rather than probes binary ones arbitrarily. Reducing the grammar of Xpath and combine a Dom operations into it.In this study we introduce a new language Xpath-DOM to locate the elements and attributes of the pages. A two segments denotation of chromosome illustrating Xpath part and DOM part also is used to feature the solution and help to calculate the fitness value.

GRAMMATICAL EVOLUTION
Grammatical Evolution (GE) (O'Neill and Ryan, 2001), a variant of Genetic Programming (GP) (Koza, 1992), is an automatic programming evolutionary algorithm which includes a context free grammar and genotypes with its mapping into phenotypes.This kind of representation could select production rules in a context-free grammar in Backus-Naur form and thereby creates a phenotype.Mathematically, the grammar G is a formal grammar in which all production rules are in the form V→w where V a nonterminal symbol is and w is a sequence of terminal and non-terminal symbols.A context-free grammar can be represented by the quad-tuple: G = (V T , V N , P, S), where V T is a finite set of terminal symbols, V N is a finite set of non-terminal symbols, P denotes a set of production rules and S represents a non-terminal symbol as the Start notation.
The GE algorithm gradually replaces all nonterminal symbols with the right-hand of the selected production rule starting from the start symbol S. The substitution is defined by the following mapping Eq. ( 1): where, B = A gene R N = The number of rules for the specific nonterminal symbol This kind of symbol replacement process is repeated until the end of the chromosome is reached.If the final chromosome no valid expression has been produced, the algorithm repeats from the starting of the chromosome (called wrapping operation) or the mapping procedure is terminated by assigning a small fitness value to the relevant chromosome.Due to its properties of universality, simplification and efficiency, it has been used with success in many fields such as symbolic regression (O'Neill and Ryan, 2001), Santa Fe Ant Trail (O'Neill and Ryan, 2003), discovery of trigonometric identities (Ryan et al., 1998), robot control (Collins and Ryan, 2000) and financial prediction (Brabazon and O'Neill, 2003).

THE PROPOSED APPROACH
In this section, an approach and techniques of extracting electronic commercial information from the Web pages without any given template is investigated in a way of Grammatical Evolution (GE) method.Termination testing: A termination condition of the maximum number of generations or chromosome with best fitness value is tested in this step.If it reaches or exceeds a predefined threshold, then the process terminates; otherwise a new chromosome would be formed again.

Xpath-DOM language definition in BNF:
Although a lot of research used the Xpath technique to extract the content of Web pages, but due to the complexity of the Xpath grammar, it is difficult to perform the processing automatically.Hence, a more simple language should be important.Moreover, there exists a "last mile" problem the when the Xpath expression locates the probable position the final content could not still be distilled.So in this case, the Dom tree structure can be used to simplify the whole Xpath expression.Therefore, in this paper, we propose a simple language integrate the Xpath and Dom together called Xpath-DOM Language.Firstly, we should review the standards of grammar specification, i.e., Backus Naur Form (BNF).Definition 1: Backus Naur Form (BNF): BNF is a notation for expressing a language grammar as Production Rules (PRs).BNF grammar consists of the tuple <T, N, P, S> where, T is terminal set; N is nonterminals set; P is PRs set; S is start symbol.
Definition 2: Xpath-DOM: The Xpath-DOM Language whose PRs can be defined by a Context Free Grammar (CFG) in BNF as shown in Fig. 2. Now we investigate the example shown in Fig. 1 and we can examine the html source with the help of browser as in Fig. 3 and also get the Dom tree in the form of Fig. 4. In it, green, blue and red node denotes element, text and attributes respectively.According to the above grammatical production rules as shown in Fig. 2, we could also write a two-parts expression of Xpath-DOM, the XPath part is expressed in a orange region and the DOM part in a green one.For instance, if someone wants to get the entrance URL of reviews, a expression of "/HTML/BODY/DIV [@ class = 'wmain'] //UL [@ class = 'list- ref).value" should be provided.

Chromosome foundation and transformation:
The proposed algorithm uses fixed-length chromosomes rather than variable-length.This restriction limits the Fig. 5: Process of fitness value evaluation creation of very large expressions decreasing also the search space.Formally, a chromosome C can be represented as binary and consists of a set of genes: where, l is the fixed length of the chromosome and the upper bound 255 limits the set number of PRs for each terminal symbol cannot exceed it.
The genetic operators of crossover and mutation are applied to the genetic population forming the next generation of chromosomes.
 Crossover operation: In the crossover procedure, a number of new chromosomes are created replaced by the new population with the lowest fitness value in the current generation.Usually, the crossover probability in the present implementation is set to 0.95.Pair of chromosomes, randomly selected parents from the current pool, is segmented at a randomly chosen point and the right-hand sub-chromosomes are exchanged.The parents are selected through tournament selection method, i.e., first a group of K>2 randomly chosen chromosomes is formed; then the individual with the best fitness in the group is selected; finally, the others are discarded. Mutation operation: In this step, a random number in an interval (0, 1) is chosen for each unit in a chromosome and each chromosome can be changed in a range (0, 255).If this number is less than or equal to the mutation rate, then the related unit is changed randomly; otherwise it is remained intact.

Fitness value evaluation:
The chromosome is split into two parts: Xpath part and Dom part, which are used to construct respective features by the mapping processes as shown in Fig. 5.According to these two features, the fitness function can be defined as follows: where, the match() function calculate the fuzzy membership value of the parsed text in a tree node related to the object text often in a known database; and the Redudance Xpath computes the redundancies of the tag soup which is defined in the next section.Obviously, in the idea case, according to Eq. ( 3), if the Xpath part and the DOM part locate the objective tag accurately, the fitness value is 1; otherwise, if the noise information is larger than the useful very much, then this fitness could reach to 0.

EMPIRIMENTAL EVALUATION
To investigate the performance of our approach, we tested this algorithm on some real electronic commerce Website ranged from comprehensive shopping center to travel agent site.To retrieve the content pages, we implemented a theme crawler for these sites in Java, which is characterized by the future model, thread pool and blocking queue promised the concurrency and efficiency.Next, we converted the fetched pages into a XML file by the html parser of by several researchers for a long time, existing techniques are either inaccurate or not automatic.Our method does not only make any assumption about the structure of the pages, but also needs any temple difficult to provide.A reduced language integrating Xpath and DOM techniques is given to generate the solution of parse in a BNF grammar form, which is used in the Grammatical Evolution (GE) approach.Empirical results on several real Web pages show that the new proposed technique can segment data records and extract data from them accurately, automatically and flexibly.
Fig. 1: An example of list page and detail page in electronic commerce website Overall phrases: The schema and main processing steps of GE are the following, which are also illustrated in detail in the next sub-sections: Initialization: This step includes the set-up of the population, coefficients and relevant parameters. Definition of the evolutional grammar: In this step, a context-free grammar, describing all the possible algebraic expressions of the original set of features on both Xpath and Dom is created.A Context Free Grammar (CFG) of the Xpath-DOM Language is given in the next subsection. Chromosome make-up: Every part of each chromosome in the genetic pool is made randomly in a range of an integer interval. Fitness evaluation: Each chromosome g is evaluated in two parts: Xpath part and Dom part, which are related to Xpath feature and Dom feature, respectively.Then the fitness function could calculate the value considering the impacts on them totally. Chromosome transformation: In this phase, genetic operators, such as crossover and mutation, are imported to product the next generation of chromosomes.