Font Size: a A A

Research Of The Key Technology In Web Content Update Detection

Posted on:2018-10-02Degree:MasterType:Thesis
Country:ChinaCandidate:B W ShengFull Text:PDF
GTID:2348330542487336Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Nowadays the Internet is developing rapidly,the network resources are also becoming richer and richer,we need to get the effective and real-time news all the time.Meanwhile,as dynamic scripting technology becomes more mature,the Internet resources not only increase exponentially,but also change more frequently.It is very important to study how to crawl the complete content from the dynamic website based on scripting technology and how to detect the content update from such a large scale data.So this paper presents two research contents,one is for the dynamic website based on the scripting technology,how to crawl it completely,the other is for the crawled web content,how to track and detect the situation of content update effectively.In the crawling of the dynamic website content,the existing crawling model of the dynamic website is to crawl the content by the browser instance or scripting engine instance,and it uses the result to construct the state diagram to crawl completely.But during the process of the crawling,it does not consider about the parallel model to parse the scripting codes,and also it dose not consider about the situation of the repeat state when updating the state diagram,so this causes the low crawling efficiency.So,this paper presents a model of dynamic crawling based on the redundancy state elimination,this model builds the egine pool for analyzing the dynamic code parallel,then constructing the state diagram by the result,and during the process it uses hashtable to simulate the state warehouse to eliminate redundancy state.Finally proved by the comparative experiment,the model presented by this paper can not only crawl the dynamic web content correctly and completely,but also improve the crawling efficiency.In the updating detection of the web content,the existing model of the updating detection is to gather the statistics of the updating situation for a period of time using the crawled data,and then classifying the crawled content by the frequency according to the statistic results,finally it will track and detect the updating content at this fixed frequency.Its failings are that it can not classify for the new crawling web and can not adjust the updating frequency adaptively according to the website updating.So this paper presents an adaptive updating detecting model based on the web content classification.This model at first use the SVM algorithm of the machine learning to construct the classifier according to the training data,then using it to classify the crawled web in order to determine the initial update detection frequency and at last use Exponential Smoothing to adjust the updating frequency adaptively according to the situation of the website updating.Finally proved by the comparative experiment,the model presented by this paper has an obviously increase on the web fresh.
Keywords/Search Tags:scripting technology, state redundancy elimination, updating detection, ES algorithm, SVM algorithm
PDF Full Text Request
Related items