Research On Web-Based Extraction Technology Of Hyperlink And Web Page Content

Posted on:2007-09-16

Degree:Master

Type:Thesis

Country:China

Candidate:Y D Pu

Full Text:PDF

GTID:2178360185486022

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

The web hyperlink extraction techniques aims to extract the href value of label in the source codes of the web, according to the need of the consumers. At present, the web hyperlink extraction techniques extracts all the hyperlinks in the web, and does not classify the hyperlinks, so the result contains useless hyperlinks and does not satisfy the consumers. In this paper, web hyperlink is divided into two kins, efficient hyperlink and noise hyperlink, and the noise is also divided into inside-hyperlink noise and outside-hyperlink noise, respective of the topic of the news. Based on the theory above, rule-based machine learning technique is used to obtain the link mode of the web page. Online algorithm and amortized analysis techniques are used to analyze the web page, and we establish sample selection rules to choose the hyperlink, and then the learned hyperlink mode is expressed in the form of regular expression, at last, we see whether the hyperlink mode matches all the hyperlinks to obtain the efficient hyperlink. The advantage of this method lies in that the run-time is greatly shortened when extracting the frequently updated web hyperlinks, and can conquer the similar page analysis on the same web page. What's more, this algorithm is fit for the period hyperlink extraction on the same website in order to achieve swift and efficient hyperlink extraction.Web page content extraction technique aims to extract the texts that have whole structure and related web subject in the web page. The traditional methods on extracting content are representing the web page with the data structure of trees in the main memory. Unfortunately, the space and time complexity is relatively high when building and searching the tree. Because the nested web labels are so popular, and we have to traverse its ancestors and descendants frequently when we settle the relation of paragraphs, the efficiency is very low. A novel web page content extraction based on linear paragraph clustering is proposed in this paper. This method restructures the web source codes, the web noises are removed for the first time with it. And then skeleton paragraphs of web page content can be obtained by filtering division and paragraph clustering based on original paragraph sets. Finally the web page content comes into being after...

Keywords/Search Tags:

web hyperlink, web page content, extraction, linearization, machine learning

PDF Full Text Request

Related items

1	WEB Extraction And Analysis Based On SVM And LDA
2	Research On Content Extraction In HTML Web Pages Based Multi-Features
3	Research On WEB Page Structure And Data Extraction Technology
4	Machine Learning Based Hidden Hyperlink Detection For Web Pages
5	Research On Web Page Classification And Information Collection
6	Research On Web Hyperlink Analysis And Its Application In Search Engine
7	Semi-supervised Web-page Classification And Its Application In Directory-style Search Engines
8	Chinese Web Page Classification Based On Web Page Features
9	The Research And Implementation On Content Extraction In Web Pages Based Page Segmentation
10	The Designation And Implementation Of Business Insight System Base On Web Content