Font Size: a A A

Web Information Extraction Technology Applied Research, Competitive Intelligence Platform In The Enterprise

Posted on:2011-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:L MiaoFull Text:PDF
GTID:2208360308466238Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Internet Technologies have been developing rapidly since the 1990's, and become an important part of various aspects including economy, politics, science, technology, education, culture and entertainment etc. As a carrier of global information, the Internet showed explosive growth. Today, as the gradual enhancement of informationization, it's more and more necessary to extract useful information accurately from all kinds of web pages, in order for people to access information quickly and avoid getting lost in the ocean of information. Hence, the Web Information Extraction has emerged.Information Extraction involves creating a structured representation for information selected from content of text. The core tool of Web information extraction is called the wrapper, which is also referred to Web information extraction program in some studies. To generate a wrapper manually is not only demanding, but also difficult to adapt to the complex changes. Therefore, how to improve the automation of wrapper generating has become a significant research topic of Web Information Extraction.In this paper, Web Information Extraction and the content of HTML is analyzed and researched, and a top-down method of information extraction from web pages is put forward and implemented, based on the HTML parser library on Java platform. This approach is not targeted at a specific page, that is to say, it does not depend on a particular web page templates, but rather on some digital information such as the characteristics of each node as well as their text length, text link rate. The method builds these HTML nodes into a tree structure, and in the top-down traversal process, it searches from the root, down to the leaves step by step, according to the confirmed linking nodes, statistics of the data and HTML structural features, so as to locate the Best Content Sub Tree, from which clear text data is extracted. The test results indicate that this method has a higher accuracy than other methods.This paper has designed and implemented a general news information extractor, based on the extraction method in combination with specific application of Enterprise Competitive Intelligence platform. This extractor can not only extract the text data of body, headlines, published date and source of the news, retaining the attachments and hyperlinks in the news, but also process paged news, combining the body of the same news distributed in multiple pages. This extractor has changed the old plug-in management services, extended the functionality, and greatly reduced the maintenance workload. On this basis with combination to a news information extraction method for some specific websites, This essay proposed a rule-learning mode, when repeatedly extracting the contents of a particular network, extraction rules that is described by XML Schema would be automatically generated. If extraction from the same site is performed again, information will be extracted according to the rules, to avoid duplication of statistics, and dramatically accelerate the pace of news information extraction.
Keywords/Search Tags:Web Information Extraction, HTML, Wrapper, Best Content Sub Tree, Competitive Intelligence
PDF Full Text Request
Related items