| In recent years, with the widespread development of Internet, a growing number of information emerges. Mobile devices for its high portability, instant light and other features are liked by more and more people of all ages, the mobile phone is no longer just a simple communication tools, people get the latest news and information through mobile phones, so as to tablet computers and other mobile devices. Internet is the widely and most rich source of information, with the gradual expansion of the amount of information, people get the latest information through the news pages dynamic, but not only valuable information display on the site, but also contains advertisements, interference information unrelated images and so on. According to Tomkins et al study shows that noise on the website approximately 40% to 50%, but the screen size of mobile devices is smaller than the PC, it takes a long time to operate the scroll bar. How to extracted valuable part of information from personal computer become a serious problem.On the one hand, the density of the text-based Web information extraction methods are usually used to extract information through the web data mining method, however, the density of traditional web-based information extraction methods cannot effectively manage those pages which contains a small amount of text and a lot of noise. On the other hand, the existing page segmentation technique:one is through heuristic rules segment web page by HTML tags, this approach is only suitable for fewer labels, once it is used in a large number of studies on a particular tag heuristics, will reduce the general applicability; another method is to use visual heuristics. However, the block method needs to download and parse the style sheet, which will significantly affect efficiency.To solve these problems, we propose a comprehensive text density extraction and page segmentation techniques. First, by Webpage Block (N) algorithm segment web roughly. Then, in CalculTextDens (N) algorithm to calculate the density of each block-level elements according to the text character length and tag character length calculation density, finally, according to a set threshold eliminate noise information. The advantage of this algorithm is that even if a page contains a lot of noise information, we can accurately extracted subject information, because in general the traditional web page, information relating to a complete structure, not dispersible and Web presence, once we will calculate the density of page segmentation, based on density values can determine properties of the block-level elements.Dissertation innovative points are as follows:(1)Through the web page structure analysis, this paper presents a Webpage Block (N) block algorithm, the algorithm by determining the property of page labels, the page with the most small units of split-level elements.(2)Text proposed method for research on extracting information by text density and structure of webpage,the method based on the block of the page is calculated density of each block-level element,then extracted information according to the set threshold, the page extraction algorithm which has a precise rate of about 90% can extracted almost complete information.Experiments show that the method proposed text precise rate of 0.903 and 0.918 recall rate in the page content extraction. |