Studies On Domain-Based Information Collection Technologies

Posted on:2012-01-16

Degree:Master

Type:Thesis

Country:China

Candidate:C S Lu

Full Text:PDF

GTID:2178330335952729

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

With its rapid development, the Internet has become the largest carrier of information in the world. The growing demand of the network information is more and more personalized. How to find the data of users concerned from the massive data source has become a hot research topic on current Web information retrieval technology. As the growth explosion of information on the Internet, the number of pages is 60 billion with an annual growth rate of 78.6 percent in 2010, which is reported in the 27th China Internet Development Statistics Report published by CNNIC in January 19th,2011.The general search engines are facing enormous challenges on information collection, storage, real-time and so on, with the huge amount of information on Web. The general search engines are open to all users, and try to satisfy the possible queries of users by crawling web pages as many as possible. This greatly reduces the efficiency and accuracy of web crawling. The theme crawler was born by improving the efficiency and precision on crawl and query.A theme crawler downloads the pages which are restricted in certain subject area within a given certain theme. It avoids massive irrelevant information with the theme areas in the process of crawling. Obviously, theme search has improved significantly on the query accuracy and crawl efficiency. Therefore, the key issues of whether it crawling the information in certain theme or not are what strategies to select the network path crawling should take. Currently, the main theme crawling strategies can be divided into two kinds, which are the search strategy on Web-based link structure and on Content-based evaluation. The search strategy on Web-based link structure determines the importance of web pages and the order of links visiting by analyzing link relationships between web pages. The method takes the link structure and link relationship between pages, which is indeed able to avoid crawling some irrelevant pages into consideration. However, it ignores the content and the relevance of the theme that will cause the theme of search drift. The latter, originated in text similarity evaluation of text retrieval is able to evaluate accurately of the relevance of web contents and themes. However, it ignores the characteristics of structural information in links, and thus there are still some deficiencies in forecasting the value of the linked pages.Taking the advantages of the above two kinds of strategies, this thesis evaluates multi-granularity to the correlation of web pages and themes based on the specific circumstances of both sides. On the one hand, it forecasts the analysis of correlation of the links. On the other hand it analyzes on the correlation of contents of the pages and the theme in the cases of links can't be confirmed. On the basis of traditional information retrieval model, this thesis puts forward a theme network crawling model based on semantic tree combining the concept of ontology. The model can describe a theme with semantic concept tree. It is different from the traditional methods of describing theme based on keywords. It can describe the simple semantic relations between concepts. On this basis, the calculation of correlation between the HTML page content and the theme is given. On the analysis of correlation of URL, it analyzes not only the correlation between links with anchor text and theme, but also the correlation between links by combining improved PageRank algorithm. It downloads the pages, which link is corresponding only when the link correlation doesn't reach a given threshold. The URL correlation calculation method can not only greatly reduce unnecessary computation, but also make full use of importance information on the anchor text and link. Finally, it calculates the content correlation for those pages which don't ensure relevance to the theme, and ultimately determines whether this page should be crawled or not.

Keywords/Search Tags:

ontology, concept tree, theme network, anchor text, theme correlation

PDF Full Text Request

Related items

1	A Study Of Themes On News Reports From Business Week-Based On The Theme Theory
2	Research Of Text Clustering Based On Semanteme And Domain Correlation
3	Study Of Scene Theme In Text-to-Scene Conversion
4	Research On Theme Reptiles Based On Educational Information Resource Ontology
5	Stock Research Engine Based On Theme Crawler
6	Mongolian Theme Book Publishing Research (2008—2018)
7	Research On The Key Technology Of Theme Crawler
8	Research And Implementation Of Multithreading Web Crawler Based On Theme
9	Research And Implement Of The Theme Crawler For Automotive Industry
10	Research On The Blog Community Detection And Its Theme Extraction Technology