Font Size: a A A

Research On Subject-Based Incremental Parallel Crawling

Posted on:2014-01-30Degree:MasterType:Thesis
Country:ChinaCandidate:Q Y HuangFull Text:PDF
GTID:2248330398959206Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology, Web has become a large and widely distributed information sources. In order to use the information effectively, at first, the Web pages are crawled down from the different information sources; then through information extraction and fusion, the useful information is stored in the local database, which can provide support for the applications such as market intelligence analysis. This process is called data integration. However, Web data is large, heterogeneous, autonomous and dynamic, which makes the automation integration of Web data become a challenging research topic. Web crawling is one of the key problems of Web data integration, which is the basis of data integration.Due to the rapid growth of Web data, it is very difficult to get enough useful information through a single common crawler in a reasonable amount of time. Therefore, the subject-based incremental parallel crawling caused the wide attention of researchers both at home and abroad, which not only can crawl pages of multiple related subjects at the same time but also ensure the page freshness and acquisition time. Aimed at the key problems of this study, the main work and contributions of this paper are summed up as follows:1. To solve the problem of query word submit in the deep web incremental crawl process, a novel approach based on incremental harvest model is proposedConstruct the set of incremental records by the returned results of the deep web full crawling before, then based on these incremental records, obtain the incremental harvest model through the machine learning method which can select the query words automatically to acquire incremental records as much as possible. Set covering model is introduced to represent the different versions of Web database, which effectively save the memory space; at the same time, considering the incremental records, the old deleted records, the new inserted records and the updated records are all taken into consideration rather than consider only the new inserted records in the previous work.2. To solve the problem of pages change frequency prediction in the surface web incremental crawl process, a novel approach based on update frequency judgment model is proposedConstruct the graph model according to the history of the Web pages change frequencies to make the pages whose change frequency is similar adjacent. Group the Web pages by cloud cover theory, compute the average change frequency of pages in the same group according to "heart pages", and rank each group, thus the update frequency judgment model is set. A mass of experiments demonstrated that this novel approach of surface web incremental crawl can predict the evolution frequency of the Web pages effectively, which provides basis for the re-crawl frequency judgment.3. To solve the problem of URL assignment in the parallel crawl process, a novel approach based multi-objective decision making method is proposedIn this paper, a novel URL assignment model based on multi-objective decision making method is proposed, which consider various factors synthetically such as CPU, relevance, network bandwidth and so on. At first, quantify the evaluation factors; then compute the weight of each factor and the evaluation value of each crawler to factors by the analytic hierarchy process;finally, compute the weighted summation of each crawl and rank them, thus the best crawler for a given URL is selected. Through this method, the problems of repeated download and load imbalance are avoided effectively, and relevance of pages is raised as well.4. To solve the problem of communications among the crawlers, a novel parallel crawl architecture based on second level master is proposedIn this paper, the second level master is added into the parallel crawl, which can manage multiple crawlers of same subject and take charge of their mutual communications. Thus, not only can reduce the redundancy of pages, ensure the quality of the web pages, but also reduce the network bandwidth cost largely.
Keywords/Search Tags:Data Integration, Surface Web Incremental Crawl, Deep WebIncremental Crawl, Parallel Crawl, Focused Crawl
PDF Full Text Request
Related items