Font Size: a A A

Web Information Extraction And Integration

Posted on:2005-05-31Degree:MasterType:Thesis
Country:ChinaCandidate:H Z XueFull Text:PDF
GTID:2208360152966962Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of the computer technology and communication technology, Internet is getting more and more important in our life and work. The number of data on the Web is tremendous. But, the data on Web are non-structured or semi-structured. It can be understood by browser to display , but it can't be auto processed by computer. So, it is difficult to make use of these information. How to find and get useful information from these tremendous data on Web is the target of web information extraction.Now, more and more people have focused on the area of web information extraction research and have had many achievements. But, all these technologies have their own advantages and disadvantages. In this article, we propose a new method to perform web information extraction. We provide a user-friendly interface that allows users do define the process of web information extraction. Then, another program performs this process according to the user's definition. After the extraction of web information, we should make use of the extracted data. It should be integrated into the target database. Before the process of data integration, the extracted data must be cleaned,transformed, then loaded. So, we provide an ETL tool to help user to define the process. We also provide a user-friendly interface to help people to access hetero-geneous data sources, get their models, define the transformation rules between the source dataset and the target dataset. Then the program stores the information about the ETL process into script file. The implement program reads the script file, parses it, gets the information and performs the ETL process. In this way, we realize the integration of web data.
Keywords/Search Tags:non-structured, semi-structured, Information Extraction, ETL, data integration
PDF Full Text Request
Related items