Font Size: a A A

Research On WEB Information Extraction Based On The EM Algorithm And DOM Tree

Posted on:2014-01-31Degree:MasterType:Thesis
Country:ChinaCandidate:S S QiaoFull Text:PDF
GTID:2248330398952397Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Accessing to information is becoming difficult because of text resources’ incremental development in websites recently. Therefore, how to extract information that people needed from the massive data, the structuration of unstructured data in the website, the extracted rules being reused because of the diversity of websites templates, the efficiency of the text classifier being low, and they are particularly challenging. To address these problems, this paper proposes a WEB information extraction system based on EM algorithm and DOM tree, and it has important practical value and research significance.First of all, the paper have studied problems of webpages classification、rules extraction and multiplex、text extraction in deep. The traditional WEB information extraction system is based on the single template, the rules extracted cannot reuse because of the differences of the website templates, and the adaptation is low. For this reason, this paper has used the modified algorithm of the sub-tree optimal matching structure based on DOM tree similarity calculating the structure similarity, and then categorizing the webpages, then learning, extracting and storing based on the similar webpage from the same website. And the next, this paper has proposed a text extraction algorithm based on the center node and text length characteristics on DOM tree based on webpages similarity.Secondly, it uses WEB text classifier processing the text extracted, the efficiency of WEB text classifier is poor, and it usually needs vast training text samples marked, and these samples is structured by the expert, and this would waste time and energy, and the current situation is that we can get a mass of training text samples unmarked. Therefore, we regard the text extracted as text samples unmarked, and it is cast to XML document. To the XML document, we have designed a semi supervised EM rebuilding training set algorithm (SSEM). Towards the vectorization of documents, vector the textual characteristics by the optimal TF-IDF, and increased the message of positional weights, and this can actualize the feature vector; and then we vectored text space model using GMM to improve the performance of the text classifier, and the model have merits of computational efficiency and easy to implement, this can simplify the complexity of the text vector space.In the experiment, the paper have designed the model design and tested the parts of the WEB information extraction system based on EM algorithm and DOM tree. And then after the test of operating the system and the analysis of the algorithm efficiency, I have verified that this system can improve the precision rate and the recall rate. And after analyzing the algorithm efficiency, the results indicated that it can improve the efficiency of extraction and the running time of the whole system, and reach the expected effect.
Keywords/Search Tags:Information Extraction, DOM Tree, EM Algorithm, Text Classifier
PDF Full Text Request
Related items