Font Size: a A A

Xml-based Web Information Extraction Technology Research

Posted on:2009-10-08Degree:MasterType:Thesis
Country:ChinaCandidate:X B ShiFull Text:PDF
GTID:2208360245468763Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet has become an important source of global information dissemination and sharing. Data on the Web has grown geometrically. To obtain useful information from the Web has become increasingly difficult. "Information overload" has become an urgent solution. The ideal situation is that people enable like searching the database for information on the same Web inquiries. However, how to access to and use useful information from Web has become the problem for research work.The characteristics, such as magnanimity, different constructing, and dynamite change that Internet has, made web information extraction different from traditional mformation extraction, brought the new challenge at the same time. Extraction technology is enriching constantly with increase of the demand, many kinds of information extraction methods have emerged both at home and abroad in recent years. These methods have focused solution problems confronting the Web information extraction to the above, achieved good results overall, but in certain areas there are varying degrees of limitations or flaws. In order to better address the many problems and shortcomings to the web information extraction, it is necessary for web information extraction for further research.In this thesis, author uses of standard XML technology to solve the problem of web site information extracetion and to develop a professional Cheating Event Information Extraction System(CEIES).Based on standard XSLT, using its powerful and flexible properties can code simple, healthy and the general rules. First get target HTML paper, and translating HTML files into XHTML file with the XML parser. Then use XML data query capability to inquiry powerful XML library. DOM trees will be used to restore the rules into the rule base. Based on the usage of the key verb that is expressed by the Case Grammar, partial information of sentence is extracted and is expressed by Knowledge Graphs. Through the join of Knowledge Graphs, partial information is integrated. Finally, some items of information is stored in the database of CEIES.
Keywords/Search Tags:Information Extraction, Natural Language Comprehension, XML, DOM trees, Knowledge Graphs
PDF Full Text Request
Related items