Font Size: a A A

Studies On Domain-Based Information Collection Technologies

Posted on:2012-01-16Degree:MasterType:Thesis
Country:ChinaCandidate:C S LuFull Text:PDF
GTID:2178330335952729Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With its rapid development, the Internet has become the largest carrier of information in the world. The growing demand of the network information is more and more personalized. How to find the data of users concerned from the massive data source has become a hot research topic on current Web information retrieval technology. As the growth explosion of information on the Internet, the number of pages is 60 billion with an annual growth rate of 78.6 percent in 2010, which is reported in the 27th China Internet Development Statistics Report published by CNNIC in January 19th,2011.The general search engines are facing enormous challenges on information collection, storage, real-time and so on, with the huge amount of information on Web. The general search engines are open to all users, and try to satisfy the possible queries of users by crawling web pages as many as possible. This greatly reduces the efficiency and accuracy of web crawling. The theme crawler was born by improving the efficiency and precision on crawl and query.A theme crawler downloads the pages which are restricted in certain subject area within a given certain theme. It avoids massive irrelevant information with the theme areas in the process of crawling. Obviously, theme search has improved significantly on the query accuracy and crawl efficiency. Therefore, the key issues of whether it crawling the information in certain theme or not are what strategies to select the network path crawling should take. Currently, the main theme crawling strategies can be divided into two kinds, which are the search strategy on Web-based link structure and on Content-based evaluation. The search strategy on Web-based link structure determines the importance of web pages and the order of links visiting by analyzing link relationships between web pages. The method takes the link structure and link relationship between pages, which is indeed able to avoid crawling some irrelevant pages into consideration. However, it ignores the content and the relevance of the theme that will cause the theme of search drift. The latter, originated in text similarity evaluation of text retrieval is able to evaluate accurately of the relevance of web contents and themes. However, it ignores the characteristics of structural information in links, and thus there are still some deficiencies in forecasting the value of the linked pages.Taking the advantages of the above two kinds of strategies, this thesis evaluates multi-granularity to the correlation of web pages and themes based on the specific circumstances of both sides. On the one hand, it forecasts the analysis of correlation of the links. On the other hand it analyzes on the correlation of contents of the pages and the theme in the cases of links can't be confirmed. On the basis of traditional information retrieval model, this thesis puts forward a theme network crawling model based on semantic tree combining the concept of ontology. The model can describe a theme with semantic concept tree. It is different from the traditional methods of describing theme based on keywords. It can describe the simple semantic relations between concepts. On this basis, the calculation of correlation between the HTML page content and the theme is given. On the analysis of correlation of URL, it analyzes not only the correlation between links with anchor text and theme, but also the correlation between links by combining improved PageRank algorithm. It downloads the pages, which link is corresponding only when the link correlation doesn't reach a given threshold. The URL correlation calculation method can not only greatly reduce unnecessary computation, but also make full use of importance information on the anchor text and link. Finally, it calculates the content correlation for those pages which don't ensure relevance to the theme, and ultimately determines whether this page should be crawled or not.
Keywords/Search Tags:ontology, concept tree, theme network, anchor text, theme correlation
PDF Full Text Request
Related items