Font Size: a A A

Xml-based Web Data Extraction Technology Research

Posted on:2006-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:X Z WuFull Text:PDF
GTID:2208360182968926Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the global information space, Web contains large potential value. Much research has focused on the study of Web data extraction, while its current status is still far from satisfaction of Web users. XML has become the standard to represent data in Web and it provides a uniform data model for Web data.The dissertation reviews the state of Web data extraction and presents a fast applicable Web data extraction method based on XML. The further study is made for some key technologies, such as search strategy, transformation algorithm and extraction method. We wish to make some contributions for data extraction.The main contributions of this paper include the followings.1. The dissertation presents a modified HITS algorithm in small Web search. According to the characteristic of link structure, we construct implicit links in small Web, and weight links according to how often they were accessed. The new link structure of a small Web is close to that of the global Web, and weight links combine with users' feedback. Theory and experiment show that the modified algorithm is correct.2. A stack-based HTML to XML transformation approach is put forward. It simplifies data extraction and gets ready for extract appropriate data.3. The criterion of robust data extraction in XML is proposed. The criterion is applied in XML data extraction: location special area and mapping merging data. Good methods are provided for each. The results show that the methods are effective.4. The prototype is completed. Data extraction, XML and Java are comprehensively utilized in the prototype by combining those three theories. The prototype provides a fast general Web data extraction solution based on XML. It has good adaptbility and portbility.
Keywords/Search Tags:data extraction, XML, link structure, stack, robustness
PDF Full Text Request
Related items