Research On Extracting Information By Text Density And Structure Of Webpage

Posted on:2016-06-16

Degree:Master

Type:Thesis

Country:China

Candidate:Y Xiao

Full Text:PDF

GTID:2308330473956504

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

In recent years, with the widespread development of Internet, a growing number of information emerges. Mobile devices for its high portability, instant light and other features are liked by more and more people of all ages, the mobile phone is no longer just a simple communication tools, people get the latest news and information through mobile phones, so as to tablet computers and other mobile devices. Internet is the widely and most rich source of information, with the gradual expansion of the amount of information, people get the latest information through the news pages dynamic, but not only valuable information display on the site, but also contains advertisements, interference information unrelated images and so on. According to Tomkins et al study shows that noise on the website approximately 40% to 50%, but the screen size of mobile devices is smaller than the PC, it takes a long time to operate the scroll bar. How to extracted valuable part of information from personal computer become a serious problem.On the one hand, the density of the text-based Web information extraction methods are usually used to extract information through the web data mining method, however, the density of traditional web-based information extraction methods cannot effectively manage those pages which contains a small amount of text and a lot of noise. On the other hand, the existing page segmentation technique:one is through heuristic rules segment web page by HTML tags, this approach is only suitable for fewer labels, once it is used in a large number of studies on a particular tag heuristics, will reduce the general applicability; another method is to use visual heuristics. However, the block method needs to download and parse the style sheet, which will significantly affect efficiency.To solve these problems, we propose a comprehensive text density extraction and page segmentation techniques. First, by Webpage Block (N) algorithm segment web roughly. Then, in CalculTextDens (N) algorithm to calculate the density of each block-level elements according to the text character length and tag character length calculation density, finally, according to a set threshold eliminate noise information. The advantage of this algorithm is that even if a page contains a lot of noise information, we can accurately extracted subject information, because in general the traditional web page, information relating to a complete structure, not dispersible and Web presence, once we will calculate the density of page segmentation, based on density values can determine properties of the block-level elements.Dissertation innovative points are as follows:(1)Through the web page structure analysis, this paper presents a Webpage Block (N) block algorithm, the algorithm by determining the property of page labels, the page with the most small units of split-level elements.(2)Text proposed method for research on extracting information by text density and structure of webpage,the method based on the block of the page is calculated density of each block-level element,then extracted information according to the set threshold, the page extraction algorithm which has a precise rate of about 90% can extracted almost complete information.Experiments show that the method proposed text precise rate of 0.903 and 0.918 recall rate in the page content extraction.

Keywords/Search Tags:

Information Extraction, Page segmentation, Text Density

PDF Full Text Request

Related items

1	Research On Multi-page Special Web Page Text Extraction And Merging Technology
2	The Research Of Web Pages Information Extraction Based On Page Structure Analysis Technique
3	Research On Mining Structure Of WEB Page For Information Extraction
4	Research On Web Article Automatic Extraction Method Based On Page Segmentation
5	Research On Web Page Classification And Information Collection
6	Research On Specialty Knowledge Retrieval Method Based On Web Information Extraction
7	Extraction Algorithm, Based On Visual Features Of The Web Page
8	Reasersh On Internet Public Opinion Information Extraction And Classification
9	A Study On Methods Of Web Page Topical Information Extraction
10	Research On Web Page Content Extraction Based On Hadoop