Font Size: a A A

The Topic Of Science And Technology Projects Search Engine Based On Nutch

Posted on:2012-10-16Degree:MasterType:Thesis
Country:ChinaCandidate:P HuFull Text:PDF
GTID:2248330395462363Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the growing Internet information, general search engines have been unable to meet the user’s queries for the information needs of specific areas. Development of search engines are to be personalized, topical and intelligent, among them the search engine based on a specific topic has become a hot research. Currently, in science and technology project areas, in the project areas of science and technology, science and technology projects information acquisition basically depends on general search engine queries and expert experience, through the research of science and technology project topics search engine for the user to provide relevant project information is relatively less. This paper aims to general search engine inquires science and technology project information is inaccurate, design and develop a search engine system based on the science and technology project subject, convenient for users to know exactly what the science and technology project development. The research work of this paper is as follows:(1) Analysis the key technology of topic crawler, we propose a crawler model based on the topic of science and technology projects. Model by selecting the authoritative page URL as the initial seed page, choosing the project template documents to train projects subject thesaurus, using improved VSM cosine methods to determine the topic correlation of the web pages, propose the topic crawling strategy based on Shark Search and Hits. The model filter web pages with poor relevant to science and technology projects topic, so that the spider to crawl more pages relevant to the topic, to improve the quality of web crawling.(2) According to the PageRank algorithm prone to "drift theme", emphasis on the old web page, we propose an improved algorithm TD-PageRank (Time Decay PageRank) based on a time decay factor. The algorithm express the web content as space vector model, on the basis of using TF-IDF calculate key words weights, give key words the corresponding page weights in different regions, reduce the "topic drift", by adding the time decay factor, in order to speed up the old Web "precipitation". Experiments show that compared to PageRank algorithm the improved algorithm make theme-related new pages increase in the page sort, more pages relevant to the topic come in the forefront of the result set.(3) Based on the above two research results, combined with the Nutch open source search engine, design a search engine prototype based on the theme of science and technology projects. system improve Nutch crawling module, adding theme relevant judge module and training themes thesaurus modules, joining IKAnalyzer Chinese word segmention, by combining Nutch scoring mechanism and TD-PageRank algorithm to improve query results sorting, designing user query interface. Experimental tests verify the feasibility of the prototype system.With the background of domestic research in theme Search Engine refer to the field of science and technology projects is less, this paper’s research in the science and technology projects theme search engine play an "initiate" role in the filed of science and technology projects search.
Keywords/Search Tags:Search engine, Subject crawler, Page Sort, Science and technology projects, Nutch
PDF Full Text Request
Related items