Font Size: a A A

The Research And Implementation Of Web Text Classification That Use Table Information

Posted on:2009-08-10Degree:MasterType:Thesis
Country:ChinaCandidate:H X GuiFull Text:PDF
GTID:2178360272463922Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the Internet and other information technologies's development and being widely used, Web has become one of the most important approaches to obtain information.It is very urgent to find how to search and classify the document quickly and precisely from the huge information database.The technologies of extracting information from Web document and classifying Web text automatically are consided as essential components of the information process,and more and more people pay attention to them.Fist of all, this paper researches the technologies of Web information extraction and presents a new model that extracts information from tables of Web documents based on table structure. It is composed of table positioning module, table structure pretreatment module and table information extraction and refactoring module.This model extracts information from table according to Web table structure label and heuristic method rules of user-definition.Experimental results show that this model can be well applied on information extraction from tables of Web documents.Later on, We establish domain ontology with regard to the characteristic information of Web table by researching the technology of Web text classification and theory of ontology , and design a Two Times classification model to classify Web text .This model classifies test data by the approach of classification based on Support Vector Machine in the first classification . As regards test data whose categories aren't confirmed, we extract the characteristic information of Web table from them and match similarity with classification model based on domain ontology in second classification.Finally, we compare Two Times classification model with Support Vector Machine classification model in the experiments , find that the precision-rate and recall-rate are improved significantly, proving the validity of this model.
Keywords/Search Tags:information extraction, classification of Web text, heuristic method rules, domain ontology, similarity matching
PDF Full Text Request
Related items