Research Of Web Information Extraction Technique Based On Ontology And Text Feature

Posted on:2012-03-20

Degree:Master

Type:Thesis

Country:China

Candidate:B Zhang

Full Text:PDF

GTID:2248330395455568

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the fast development of the Internet, the massive web data resources have alreadybecome an important channel for people to obtain knowledge and information. Therefore, thecommon prob lem faced by everyone now is how to get useful information from numerous andjumbled data. In order to solve this problem, the web information extraction is put forward.Although a great deal of research has been carried out for web information extraction,existing technology is often lacking in description of the Web data itself and semanticinformation included in it is not very clear. Besides, no proper solution for page noisehandling is proposed, thus the current technology is hard to adapt to web pages characterizedwith various forms, and structures.For these problems, based on the study of Ontology technology and the overall solutionto web information extraction technology, the article puts forward a solution for page noisehandling and an Ontology-driven information extraction pattern in which theinformation-positioning and information extraction are completed by file structure and featurematching via research and analysis on the fundamental principle of conventional method ofinformation extraction, extraction technology, designing thoughts and development status.Firstly, this system uses the spider to get related web pages from the given URL addresscontaining many links, and does some preprocessing work to those extracted pages such asdocument cleaning, code conversion and html parser. By taking the method towards textfeature for pa ge noise ha ndling, the non-standard HTML Documents can be transformed toXML-DOM trees almost without pa ge noise. Then, the semantic information is adde d to theseXML doc uments by the establishment of related Ontology using the Ontology establishingmethod. At the same time, the information node is located with the help of XPath technology.Finally, the system transfers a source XML document to a new XML document through theXSLT technology. The result of experiment proves that the proposed web informationextraction method can solve the problem of web page noise well, and the ratio of precisionand recall of system can also be higher.

Keywords/Search Tags:

Web information extraciton, Semantie, Text feature, Ontology, XML

PDF Full Text Request

Related items

1	Research Of Web Information Extraction Technology Based On Semantie
2	The Research Of Whole Palm Vein Recognition Algotithm
3	Short Text Classification Based On Integration Of Ontology And BTM Feature Extension
4	The Research Of Text Classification Based On Ontology
5	Web Text Mining And Information Retrieve Services Based On Ontology
6	Study On Chinese Text Classification Combined With Ontology
7	Unstructured Information Search Based On Ontology Semantics And Object Feature
8	An Ontology-Based Approach To Storage And Represent The Results Of Text Mining
9	Research On Text Intelligent Classification Based On Ontology
10	Feature Informaiton Extraciton And XML Document Exchange Technology Based On STEP Neutral File