Research And Implementation Of Topic-specific Information Collection Method Based On HMM

Posted on:2011-11-01

Degree:Master

Type:Thesis

Country:China

Candidate:L Peng

Full Text:PDF

GTID:2178360302480379

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

This thesis, based on the problems of existing Web information collection and focusing on the vertical search engine of topic-specific information collection, has completed the following main work:First of all, the thesis gives a general summary about the research background of Web information collection method, and systematic analyzes the advantages and disadvantages of general information collection method and the existing topic-specific information collection method.In the meantime, the thesis discusses the key techniques of focused crawling, and based on the analysis of distribution features of Web pages, some useful rules are summed up for focused crawling.Secondly, to identify topic relevance, we extract domain information to extend topic description with the aid of the general search engine, and a topic-weight table would be dynamically established. With the topic-weight table, we introduce a topic relevance algorithm for predicting the relevance between onsite information and topic by using combination of content similarity and web metadata analysis. The metric of the characteristic vector of the topic, constructed by the topic relevance of page and link, is measured to find relevant Web resources, and it is the core of the thesis and the precondition for the following modeling.Thirdly, the thesis focuses on the field of the application of Hidden Markov Model of topic-specific information collection. According to the theory and the main algorithm of HMM, we combined topic hierarchies of Web sites with the characteristic vector of the specific topic in order to solve the shortages of traditional focused crawler. Therefore, based on HMM, a new method of topic-specific information collection is proposed here, together with its related area is discussed in details. Finally, the innovation of this thesis mainly lay in the utilizing some open-source projects, and a prototype system is developed. Many experiments are conducted on the Web for different topics and the results demonstrated that our novel information collection method by using the trained Hidden Markov Model can improve traditional focused crawler based on its automatic recognition capability. Furthermore, this method can significantly improve the precision of topic relevance, prevent "off-topic" phenomenon effectively, and alleviate the problem of "tunneling" to some extent. It will save users a lot of time to filter web sites and integrate relevant web pages. Therefore, the method can greatly meet the requirements of the people with specific topic needs.From the theoretical analysis and the results of experiments, we can conclude that the researches about topic-specific information collection method based on HMM have important theoretic value as well as broad application prospects.

Keywords/Search Tags:

focused crawler, HMM, information collection, topic relevance

PDF Full Text Request

Related items

1	Research And Implementation Of Focused Crawler Based On Word2Vec
2	Technology Research, Based On Focused Crawling Of Web Information Collection
3	Research On Topic Focused Web Crawler And Related Technologies
4	The Design And Implementation Of The Complex Rules-Driven Focused Crawler System
5	Research On The Topic Crawler Algorithm Based On Vector Space Model
6	A Focused Crawler Based On Statistical Machine Translation And Topic Propagation
7	Research And Implementation Of Focused Crawler Based On Distributed Strategy
8	Design And Implementation Of Focused Crawler For Blogs
9	The Design And Implementation Of The Topic-focused Web Crawler System
10	Research And Implement Of Focused-crawler Relevance Algorithm In Search Engine