Research On SVM-Based Web Information Extraction Technology

Posted on:2009-05-16

Degree:Master

Type:Thesis

Country:China

Candidate:J P Xiao

Full Text:PDF

GTID:2178360278480829

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet, Web has turned into the information center of globalization. How to get access to the data efficiently and accurately is becoming increasingly urgent. To meet such need, the technique of information extraction which has broad prospects comes into being, through which necessary information can be obtained and applied as the basis for intelligent query system and data mining system.Although at present massive research work has been done towards Web information extraction, there are still some technique shortcomings, such as low accuracy of extraction, poor degree of automation, and weak application ability. This paper is based on the classification method of support vector machine, meanwhile, exploration on the theory and practice of Web information extraction is also made, and its main innovation is as follows:Using transductive support vector machine to classify large numbers of samples always cause some problems such as low accuracy of classification, and long cost of time on classification training. On the ground of the comparison made among improved classification algorithms of TSVM, this paper presents an algorithm of TSVM based on incremental learning. Apply the idea of incremental learning into TSVM, as well as the region multi-sample labeling rule and label reset rule, we can shorten the training time and raise the classified speed.Considering existing Web pages usually contains massive information irrelevant to the subject, this paper proposes a two-level noise filter algorithm based on DOM tree, on the basis of the structuralized analysis to Web pages. Setting reasonable "hyperlinks granularity" will effectively guarantee the correct judgment on the noise relevance conduced by the result of sub-tree matching algorithm, so as to remove the irrelevant information, reduce the DOM tree's production scale, and be advantageous to data's sustained application.According to the TSVM classification algorithm based on incremental learning and two level of noise filter algorithm of DOM tree, we designed the Web information extraction system based on support vector machine. This system generated by the Web-based DOM tree, uses two-level noise filter algorithm to deal with noise and reduces the Web pages scale. The key issue of classification algorithm system is how to classify and extract the information the users need from websites, namely realizes the dada classified extraction. Simulation experimental data show that: while ensure efficient extraction, the simulation system's accuracy and recalling rate can also reach very high levels.

Keywords/Search Tags:

Web information extraction, transductive SVM, incremental learning, XML, classification extraction

PDF Full Text Request

Related items

1	Research On Nonlinear Incremental Feature Extraction For High Dimensional Data
2	Research Of Algorithms And Applications On Transductive Transfer Learning
3	Design And Implementation Of Enterprise News Information Classification Subsystem In Distributed Environment
4	Design And Implementation Of Web Information Extraction Rules
5	Short Text Classification And Information Extraction Research Based On Deep Learning
6	Incremental Learning With Label Noise For Face Recognition
7	The Research On Optimization Of ETL Process And Incremental Data Extraction
8	Research And Application Of Image Classification Based On Transductive Support Vector Machines
9	The Research Of Land Cover Information Extraction With Remote Sensing Data Based On Machine Learning
10	Question Intention Classification Based On Information Extraction