Font Size: a A A

An Application Research Of Information Extraction On Topic Search Engine

Posted on:2012-06-28Degree:MasterType:Thesis
Country:ChinaCandidate:C XuFull Text:PDF
GTID:2218330344950314Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
The rapid development of network and information technology promotes the Internet to become a key platform in global information resource propagation and sharing. With exponentially growing amount of information in Internet, Obtain useful information from the Web has become increasingly difficult. "Information overload" has become a pressing problem. It is an urgent need that people can search Information precisely like query the database. Therefore, the problem that how to effectively organize and analyze vast amounts of Web information resources and extract useful information from them becomes a hotspot and involves lots of researchers. In recent years, the topic search engine came into been, which is the theme/field based professional search engine and only collect topic relevant information to help user rapid access knowledge and information they need. That's why extract information on specific topics from the network repository is required.At present, among those key technologies that topic search engine involved, the structural information extraction technology is the key point to differentiate it and general search engine. In previous information extraction technology scheme we always extract structured information on the entire Web page and generates wrapper, the accuracy of which is not high. The "non-theme" Information involved in the information extraction process may interferes with the extraction result and affects the accuracy of final wrapper extracting similar Web pages. On the other hand, improve the efficiency and accuracy of information extraction can enhance the universality of the topic search engine to large extent and provide users with efficient and accurate search results.In order to improve the performance and efficiency of extraction, this paper studies the extraction technology aiming at Web page structured information extraction, then through effective and reliable method to enhance the performance and efficiency of Information extraction process, try the best to realize automate information extraction process to avoid excessive manual intervention. In all, this paper tries to utilize all sorts of resources effectively and construct the information retrieval system rationally. It includes following aspects:Firstly, it studies Web page batch obtain method. In order to get the information on specific field, it's necessary to download relevant pages from Internet as the original data. Through analyzing key technologies in page crawler and related module structure, this paper implements a simple Web page crawler.Secondly, it studies Web page denoising method. The existence of page noise seriously affects the identification of page topic and then affects the quality and efficiency of final search results. Remove noise content in Web page becomes an important guarantee and premise to the accuracy of the search engine. After studying several existing kinds of block segmentation & denoising model and analyzing their advantages and disadvantages, this page proposed an improved algorithm associates statistical methods with DOM-based page segmentation model in order to enhance effects and efficiency of the Web page denoising.Thirdly, it studies the vector representation model of the chinese character text and feature selection method. The quality of text features representation will directly affect the selection of feature terms, thus affecting the latter text classification. Previous research results indicates that the feature dimension number in text vector space representation model a close relationship with the efficiency of classification algorithm. This paper studies the feature selection methods, analyzes several popular feature selection Algorithms based on vector space model and makes some improvements towards CHI square algorithm after in-depth study on it, which improve the algorithm performance and quality to achieve better effects in latter text classification.Fourthly, it studies Web page text classification methods. Due to the characteristics of Web pages such as huge number and lack of reliable label information, its effective utilization must based on reliable category label information obtained by automatic classification for further separate processing. Considering the particular semi-structured characteristic of Web text data, this paper makes improvement based on the classic Naive Bayes algorithm to extract and utilize structural information in Web page text, which enhance the effect of classification and reduce the classification error.Finally, this paper conducts comparative experiment on Web page classification methods. Experimental results show that the proposed method can effectively improve the classification accuracy and reduce computation time, suit for information extraction in topic search engine. Then this paper describes the design of system's top architecture and module's workflow chart and class diagram as well as provides some implementation code for future researchers'reference.The main innovation of this paper is to propose improved algorithms on Web page denoising, feature selection and Web page classification, which are aiming at information extraction requirements on topic search engine. Experimental results show that compared to similar algorithms they can effectively improve the accuracy and reduce the time complexity. Those Algorithms fulfill the text Information extraction requirements on topic search engine well.
Keywords/Search Tags:Topic search engine, Information retrieval, Feature selection, Text categorization
PDF Full Text Request
Related items