Font Size: a A A

Study On Information Extraction Technology Based On Web Described With XML

Posted on:2008-12-31Degree:MasterType:Thesis
Country:ChinaCandidate:J ChenFull Text:PDF
GTID:2178360215974394Subject:Computer applications
Abstract/Summary:PDF Full Text Request
The rapid development of the Internet has become an important source of global information dissemination and sharing. Data on the Web has grown geometrically. To obtain useful information from the Web has become increasingly difficult. "Information overload" has become an urgent solution. The ideal situation is that people can be like searching the database for information on the same Web inquiries. However, How to access to and use useful information from Web has become the problem for research work.The characteristics, such as magnanimity, different constructing, and dynamic change that Internet has, made Web information extraction different from traditional information extraction, brought the new challenge at the same time. Extraction technology is enriching constantly with increase of the demand, many kinds of information extraction methods have emerged both at home and abroad in recent years. These methods have focused solution problems confronting the Web information extraction to the above, achieved good results overall, but in certain areas there are varying degrees of limitations or flaws. In order to better address the many problems and shortcomings to the Web information extraction, it is necessary for Web information extraction for further research.In this paper, we use of standard XML technology to solve the problem of website information extraction. Based on standard XSLT, using its powerful and flexible properties can code simple, healthy and the general rules. First get target HTML paper, and translating HTML files into XHTML file with the XML parser. Then use XML data query capability to inquiry powerful XML library. DOM trees will be used to restore the rules into the rule base. The results will then spread to client, thereby completing the required data from users. However, it is not a common extraction system, but certain areas, it is also necessary to formulate effective extraction rules, once changes in the extract structure of the source Web page, may take failure. So we make improvement which paper will be divided into several semantic blocks, block themes possible related to the same theme. After segmentation of the pages some useless piece of information, such as navigation information and copyright information can be eliminated; operate to semantic related blocks, rather than the entire page, which will significantly improve the quality of information retrieval. Here in the light of information entropy theory and construct the DOM semantic tree to make up for the lack of DOM division. Finally, we will target XML as a tree, using the object-relational mapping language to map object to the relational databases. XML and the information extraction will be stored in the relational database. With our platform, we can develop robust and general wrapper rapidly.
Keywords/Search Tags:Information Extraction, mapping, information entropy, DOM tree
PDF Full Text Request
Related items