Font Size: a A A

Information Extraction Based On Table Area Locating For E-Commerce Websites

Posted on:2010-08-31Degree:MasterType:Thesis
Country:ChinaCandidate:R DongFull Text:PDF
GTID:2178330338982293Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of e-commerce, online shopping becomes more and more popularity among consumers. However, it is a very difficult process to find out their most wanted goods from vast amounts of merchandise. A direct application of web information extraction is to help people find products from web quickly and accurately. Currently, being lack of specific information extraction technology on e-commerce websites, it is difficult to use a common Web information extraction technology to find target product information quickly and accurately, so it is necessary to do further research on information extraction of merchandise in e-commerce websites.By analyzing web table characters of HTML pages of shopping websites, this paper proposes a new page model, the pages of shopping site is divided into three regions: Core Area, Preparative Core Area and Non-Core Area. And based on this page model, this paper proposes the notion of area location. And decomposes the merchandise information extraction into three key processes: page pre-treatment, Area location and Area structural analysis.Page pre-processing module is mainly responsible for HTML page tag repair, noise processing. By analyzing the page structure, the HTML document parsing to construct DOM, the unwanted Elements of the HTML documents is removed from the DOM tree, mainly include Ad pictures, the script code. Thus minimize the noise impact of the work of information extraction.Area location module is mainly responsible for locating the information area of goods the user interested from the DOM tree. At the area locating process, combination of product attributes keywords, find a matching node, and then the bottom-up, locating the preparative core area, and then under the expected value of preparative core area and the proportion of node types in the region, locating the core area.Area structural analysis module is mainly responsible for analyzing the structure of the core area, locating of goods "attribute - value" of information, and extracting the attributes information of commodity.After the completion of the extraction, the system will add the missing keyword to the collection of goods attribute, thus improve the collection of keywords and improve the extraction efficiency and accuracy. Reusing the path of core area at the pages in the same e-commerce websites can increase the treatment efficiency.
Keywords/Search Tags:Information extraction, Web Tables, DOM tree, Area location
PDF Full Text Request
Related items