Font Size: a A A

Research And Implementation Of Topic-specific Information Collection Method Based On HMM

Posted on:2011-11-01Degree:MasterType:Thesis
Country:ChinaCandidate:L PengFull Text:PDF
GTID:2178360302480379Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
This thesis, based on the problems of existing Web information collection and focusing on the vertical search engine of topic-specific information collection, has completed the following main work:First of all, the thesis gives a general summary about the research background of Web information collection method, and systematic analyzes the advantages and disadvantages of general information collection method and the existing topic-specific information collection method.In the meantime, the thesis discusses the key techniques of focused crawling, and based on the analysis of distribution features of Web pages, some useful rules are summed up for focused crawling.Secondly, to identify topic relevance, we extract domain information to extend topic description with the aid of the general search engine, and a topic-weight table would be dynamically established. With the topic-weight table, we introduce a topic relevance algorithm for predicting the relevance between onsite information and topic by using combination of content similarity and web metadata analysis. The metric of the characteristic vector of the topic, constructed by the topic relevance of page and link, is measured to find relevant Web resources, and it is the core of the thesis and the precondition for the following modeling.Thirdly, the thesis focuses on the field of the application of Hidden Markov Model of topic-specific information collection. According to the theory and the main algorithm of HMM, we combined topic hierarchies of Web sites with the characteristic vector of the specific topic in order to solve the shortages of traditional focused crawler. Therefore, based on HMM, a new method of topic-specific information collection is proposed here, together with its related area is discussed in details. Finally, the innovation of this thesis mainly lay in the utilizing some open-source projects, and a prototype system is developed. Many experiments are conducted on the Web for different topics and the results demonstrated that our novel information collection method by using the trained Hidden Markov Model can improve traditional focused crawler based on its automatic recognition capability. Furthermore, this method can significantly improve the precision of topic relevance, prevent "off-topic" phenomenon effectively, and alleviate the problem of "tunneling" to some extent. It will save users a lot of time to filter web sites and integrate relevant web pages. Therefore, the method can greatly meet the requirements of the people with specific topic needs.From the theoretical analysis and the results of experiments, we can conclude that the researches about topic-specific information collection method based on HMM have important theoretic value as well as broad application prospects.
Keywords/Search Tags:focused crawler, HMM, information collection, topic relevance
PDF Full Text Request
Related items