Font Size: a A A

Topic Search Engine Key Technology Research

Posted on:2012-09-24Degree:MasterType:Thesis
Country:ChinaCandidate:F H WangFull Text:PDF
GTID:2218330362458874Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet, web resources have been close to ZBmagnitudes. How to get personalized knowledge and information in massiveresources quickly and efficiently has become an urgently problem. The emergence ofsearch engine is effective solution to the information "lost" problem. Existed generalsearch engines lack customer-oriented function for demand in special applicationdomains. Recently, topic search engine is appeared which focuses on topicinformation and timely updating for special domains'search.Development of topic search engine technology on domestic and international isintroduced in the paper. Then principles of search engine and two topic searchtechnologies, topic distillation and information extraction, are analyzed. Focused onhow to filter the unrelated topic pages with URL and content during web pagescrawling, and extract information from filtered pages. Depth study of the PageRankalgorithm, combined with vector space algorithm proposed topic filtering method.Through search on how to extract the information which is filtered, proposed topicextraction method module.Based on the Nutch search engine architecture, the artificial intelligence,information extraction and data mining, are integrated in topic search engine. Withadaptation and optimization of PageRank and space vector, through the HTMLstructure of a combination of information extraction relative algorithm, a topicdistillation and information extraction module is developed, in which J2EEtechnology is used for secondary development. Then building a topic search enginearchitecture. .Finally, the prospect of future work, pointed out that the filtering algorithm inthe topic, filtration and filtration speed and accuracy need to be further optimized.
Keywords/Search Tags:topic distillation, information extraction, PageRank, Nutch, J2EE
PDF Full Text Request
Related items