Font Size: a A A

Research Of Web Information Extraction Technique Based On Ontology And Text Feature

Posted on:2012-03-20Degree:MasterType:Thesis
Country:ChinaCandidate:B ZhangFull Text:PDF
GTID:2248330395455568Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the fast development of the Internet, the massive web data resources have alreadybecome an important channel for people to obtain knowledge and information. Therefore, thecommon prob lem faced by everyone now is how to get useful information from numerous andjumbled data. In order to solve this problem, the web information extraction is put forward.Although a great deal of research has been carried out for web information extraction,existing technology is often lacking in description of the Web data itself and semanticinformation included in it is not very clear. Besides, no proper solution for page noisehandling is proposed, thus the current technology is hard to adapt to web pages characterizedwith various forms, and structures.For these problems, based on the study of Ontology technology and the overall solutionto web information extraction technology, the article puts forward a solution for page noisehandling and an Ontology-driven information extraction pattern in which theinformation-positioning and information extraction are completed by file structure and featurematching via research and analysis on the fundamental principle of conventional method ofinformation extraction, extraction technology, designing thoughts and development status.Firstly, this system uses the spider to get related web pages from the given URL addresscontaining many links, and does some preprocessing work to those extracted pages such asdocument cleaning, code conversion and html parser. By taking the method towards textfeature for pa ge noise ha ndling, the non-standard HTML Documents can be transformed toXML-DOM trees almost without pa ge noise. Then, the semantic information is adde d to theseXML doc uments by the establishment of related Ontology using the Ontology establishingmethod. At the same time, the information node is located with the help of XPath technology.Finally, the system transfers a source XML document to a new XML document through theXSLT technology. The result of experiment proves that the proposed web informationextraction method can solve the problem of web page noise well, and the ratio of precisionand recall of system can also be higher.
Keywords/Search Tags:Web information extraciton, Semantie, Text feature, Ontology, XML
PDF Full Text Request
Related items