Font Size: a A A

Study On Information Extraction Technology In Web Pages Of Review

Posted on:2012-09-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y J YangFull Text:PDF
GTID:2178330332975995Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Web pages of product's reviews have become a key factor in e-commerce's consumer deciding to purchase a product. With the development of e-commerce in recent years, largely increasing reviews of the product resulting in a large number of potential users is difficult to find useful information in the reviews, so reviews greatly reduced the scope of application. Paper will analyze the existing system architecture and algorithms of the web page extraction, combined with information extraction, study how to keep high accuracy and efficiency of extraction in large-scale web pages of reviews.Paper will analyze the typical information extraction system and the algorithm find the advantages and disadvantages of existing systems and extraction algorithms, proposes a weight-based information extraction algorithm. The algorithm changes crawled web page into a tag tree structure, do some preprocessing to the tag tree such as the noise removal and so on, assigned the weight for each node of the tree from bottom to up, making nodes with different labels in different levels has different weights, then identify data regions through the similar tree and the continual position, the set of tag-tree which contains data record aligned and generated a base tree as a template of extraction, finally all data records do alignment and extraction according to template of extraction. The algorithm has strong adaptability which could handle the different structures of the reviews then generate different templates of extraction, and it does not require too much human intervention, the results show that the algorithm has a good effect on the extractionBased on this algorithm, the paper design a web page extraction system, its function includes transform a single web page into a tag tree, identify the data record from the tag tree, align the data records generated the template, mark the property to the template, and use the template extracted information from a series of web page. System based on this algorithm compared with systems with other algorithms, the results show that without too much human intervention and a high degree of automation, the system also has a high accuracy, and the running time is far superior to other systems.
Keywords/Search Tags:product reviews, weight, tag-tree, information extraction
PDF Full Text Request
Related items