Font Size: a A A

Research On Data Extraction In Web Data Integration Based On Domain

Posted on:2010-06-05Degree:MasterType:Thesis
Country:ChinaCandidate:L H WangFull Text:PDF
GTID:2178360278472609Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet and related technology, the Internet has become the biggest source of information. Whether for business or personal, Web has gradually become the main source of information. Nowadays, 90% of the global top 500 enterprises have established their own well-defined market intelligence analysis systems. Because of the shortage in the financial, technical and manpower, small and medium-sized enterprises cannot carry out the work of market intelligence analysis. How to effectively integrate Web data to provide support for market intelligence analysis of small and medium-sized enterprises has considerable significance. Because of dynamic, distribution and diversity of Web data, how to fast, accurate and stable access to valuable commercial information becomes an enormous challenge.Because of excessive quantity of website as well as the resulting proliferation of information, accessing useful information becomes more and more difficult. Being Traditional method of accessing to information search engine and Web querying has been unable to meet this demand. But, Wed data extraction technology can meet this demands, it is inclined to find out concerned Web documents form the documents collections and extract structured data form the documents discovered, that is transforming the semi-structured data into structured data. A large amount of information on the Web is stored in hidden databases. Such information is dynamically generated in response of the users' query. These web pages are generated by one template. There is high structural comparability between the HTML codes of Web data rows. Naturally, the structural comparability between the HTML codes of Web data rows. Naturally, the structures of sub DOM trees are similar to each other.This paper takes providing perfect support for market intelligence as background. And this paper proposes Web Data Extraction system that can extract target data with semantic description from related query Web facing domain feature. In this paper, the main work has: labels recognition, data extraction and label assignment. The method of labels recognition uses the relationship between the labels of web form and the labels of list pages. It makes uses of machine learning technology and pattern matching technology to recognize the labels of list pages. The method of data extraction proposes template detecting approach based on DOM trees matching algorithm by analyzing Corresponding relationship between Web document and DOM tree. This approach can get template of data records by analyzing the structure of code of multiple data records. We can recognize and extract similar data records using it. The method of label assignment uses web form query interface, recognized labels and some Heuristic rules to assign labels for target data.This paper does exploratory study on how to effectively extract target data with semantic description. And this paper describes the detail experimental analysis. The experimental results show that the method proposed in this paper is correct, the result is satisfactory. This paper proposes an effective idea and method for the problem of extracting data, at the some time it provides a certain help on market intelligence analysis. This makes the research of this paper have theoretical research value and practical value.
Keywords/Search Tags:data extraction, data integration, market intelligence, machine learning
PDF Full Text Request
Related items