Font Size: a A A

Research On SVM-Based Web Information Extraction Technology

Posted on:2009-05-16Degree:MasterType:Thesis
Country:ChinaCandidate:J P XiaoFull Text:PDF
GTID:2178360278480829Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, Web has turned into the information center of globalization. How to get access to the data efficiently and accurately is becoming increasingly urgent. To meet such need, the technique of information extraction which has broad prospects comes into being, through which necessary information can be obtained and applied as the basis for intelligent query system and data mining system.Although at present massive research work has been done towards Web information extraction, there are still some technique shortcomings, such as low accuracy of extraction, poor degree of automation, and weak application ability. This paper is based on the classification method of support vector machine, meanwhile, exploration on the theory and practice of Web information extraction is also made, and its main innovation is as follows:Using transductive support vector machine to classify large numbers of samples always cause some problems such as low accuracy of classification, and long cost of time on classification training. On the ground of the comparison made among improved classification algorithms of TSVM, this paper presents an algorithm of TSVM based on incremental learning. Apply the idea of incremental learning into TSVM, as well as the region multi-sample labeling rule and label reset rule, we can shorten the training time and raise the classified speed.Considering existing Web pages usually contains massive information irrelevant to the subject, this paper proposes a two-level noise filter algorithm based on DOM tree, on the basis of the structuralized analysis to Web pages. Setting reasonable "hyperlinks granularity" will effectively guarantee the correct judgment on the noise relevance conduced by the result of sub-tree matching algorithm, so as to remove the irrelevant information, reduce the DOM tree's production scale, and be advantageous to data's sustained application.According to the TSVM classification algorithm based on incremental learning and two level of noise filter algorithm of DOM tree, we designed the Web information extraction system based on support vector machine. This system generated by the Web-based DOM tree, uses two-level noise filter algorithm to deal with noise and reduces the Web pages scale. The key issue of classification algorithm system is how to classify and extract the information the users need from websites, namely realizes the dada classified extraction. Simulation experimental data show that: while ensure efficient extraction, the simulation system's accuracy and recalling rate can also reach very high levels.
Keywords/Search Tags:Web information extraction, transductive SVM, incremental learning, XML, classification extraction
PDF Full Text Request
Related items