Font Size: a A A

Research On Web-Based Extraction Technology Of Hyperlink And Web Page Content

Posted on:2007-09-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y D PuFull Text:PDF
GTID:2178360185486022Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The web hyperlink extraction techniques aims to extract the href value of label in the source codes of the web, according to the need of the consumers. At present, the web hyperlink extraction techniques extracts all the hyperlinks in the web, and does not classify the hyperlinks, so the result contains useless hyperlinks and does not satisfy the consumers. In this paper, web hyperlink is divided into two kins, efficient hyperlink and noise hyperlink, and the noise is also divided into inside-hyperlink noise and outside-hyperlink noise, respective of the topic of the news. Based on the theory above, rule-based machine learning technique is used to obtain the link mode of the web page. Online algorithm and amortized analysis techniques are used to analyze the web page, and we establish sample selection rules to choose the hyperlink, and then the learned hyperlink mode is expressed in the form of regular expression, at last, we see whether the hyperlink mode matches all the hyperlinks to obtain the efficient hyperlink. The advantage of this method lies in that the run-time is greatly shortened when extracting the frequently updated web hyperlinks, and can conquer the similar page analysis on the same web page. What's more, this algorithm is fit for the period hyperlink extraction on the same website in order to achieve swift and efficient hyperlink extraction.Web page content extraction technique aims to extract the texts that have whole structure and related web subject in the web page. The traditional methods on extracting content are representing the web page with the data structure of trees in the main memory. Unfortunately, the space and time complexity is relatively high when building and searching the tree. Because the nested web labels are so popular, and we have to traverse its ancestors and descendants frequently when we settle the relation of paragraphs, the efficiency is very low. A novel web page content extraction based on linear paragraph clustering is proposed in this paper. This method restructures the web source codes, the web noises are removed for the first time with it. And then skeleton paragraphs of web page content can be obtained by filtering division and paragraph clustering based on original paragraph sets. Finally the web page content comes into being after...
Keywords/Search Tags:web hyperlink, web page content, extraction, linearization, machine learning
PDF Full Text Request
Related items