Web Information Extraction Technology Applied Research, Competitive Intelligence Platform In The Enterprise

Posted on:2011-04-29

Degree:Master

Type:Thesis

Country:China

Candidate:L Miao

Full Text:PDF

GTID:2208360308466238

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

Internet Technologies have been developing rapidly since the 1990's, and become an important part of various aspects including economy, politics, science, technology, education, culture and entertainment etc. As a carrier of global information, the Internet showed explosive growth. Today, as the gradual enhancement of informationization, it's more and more necessary to extract useful information accurately from all kinds of web pages, in order for people to access information quickly and avoid getting lost in the ocean of information. Hence, the Web Information Extraction has emerged.Information Extraction involves creating a structured representation for information selected from content of text. The core tool of Web information extraction is called the wrapper, which is also referred to Web information extraction program in some studies. To generate a wrapper manually is not only demanding, but also difficult to adapt to the complex changes. Therefore, how to improve the automation of wrapper generating has become a significant research topic of Web Information Extraction.In this paper, Web Information Extraction and the content of HTML is analyzed and researched, and a top-down method of information extraction from web pages is put forward and implemented, based on the HTML parser library on Java platform. This approach is not targeted at a specific page, that is to say, it does not depend on a particular web page templates, but rather on some digital information such as the characteristics of each node as well as their text length, text link rate. The method builds these HTML nodes into a tree structure, and in the top-down traversal process, it searches from the root, down to the leaves step by step, according to the confirmed linking nodes, statistics of the data and HTML structural features, so as to locate the Best Content Sub Tree, from which clear text data is extracted. The test results indicate that this method has a higher accuracy than other methods.This paper has designed and implemented a general news information extractor, based on the extraction method in combination with specific application of Enterprise Competitive Intelligence platform. This extractor can not only extract the text data of body, headlines, published date and source of the news, retaining the attachments and hyperlinks in the news, but also process paged news, combining the body of the same news distributed in multiple pages. This extractor has changed the old plug-in management services, extended the functionality, and greatly reduced the maintenance workload. On this basis with combination to a news information extraction method for some specific websites, This essay proposed a rule-learning mode, when repeatedly extracting the contents of a particular network, extraction rules that is described by XML Schema would be automatically generated. If extraction from the same site is performed again, information will be extracted according to the rules, to avoid duplication of statistics, and dramatically accelerate the pace of news information extraction.

Keywords/Search Tags:

Web Information Extraction, HTML, Wrapper, Best Content Sub Tree, Competitive Intelligence

PDF Full Text Request

Related items

1	Research On Enterprise Competitive Intelligence Acquisition Based On Web Information Extraction
2	Research For Information Extraction Based On Wrapper Model Algorithm
3	Based On The Protection Of The Asymmetric Information Theory Of Competitive Intelligence Research
4	Extraction Technology Research, Based On Ontology Can Be Customized Web Information Intelligence
5	Research On The Technology Of The Web Employment Information Extraction Based On The HTML
6	Research On Web-based Collection Technique For Enterprise Competitive Intelligence
7	Extracting Enterprise Competitive Intelligence From The Web
8	Based On The Html Pages Of Web Information Extraction
9	The Bank Competitive Intelligence Collection System Based On Internet
10	Based On The Information Environment Of The Securities Company Model Of Competitive Intelligence Research