Font Size: a A A

Research On Web Data Extraction Technology

Posted on:2010-08-19Degree:MasterType:Thesis
Country:ChinaCandidate:X J ChenFull Text:PDF
GTID:2178360272479393Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Nowadays, the Internet has become the important resources of information transmission and sharing. The features of Web data such as semi-structured, heterogeneous and magnanimity make traditional data mining technology can not be directly applied to the Web data sources. Therefore, usefull information should be extracted from semi-structured Web data before making data mining, and it takes out the technology of the extraction of Web data. Because of the advantages, such as structural and expansibility of XML, it is a better way to ouput the XML data to Data Mining System.In this thesis, the current research situation and hot-pot technology of information extraction theory were introduced at first, then, put the focus on analysing Web data features and research the technology for extraction of Web data. A expanded DOM tree was proposed base on analysing visual feature of Web pages. And then, the page's data strcture features were combined to improve the STM algorithm and the expanded DOM tree were applied to the extraction system . After that, this thesis improved the MDR algorithm based on the Web automatic extract system proposed by Bing Li by using the web display features. At last, this thesis introduced the conception of extraction pattern to improve the algorithm's recognition ability of data object, and presented the extraction pattern of tree and used it in the extraction process. It's proved that the extraction pattern of tree could improve the data extraction's Recall rate by experiments. The extracted results were saved as XML documents.Finally, the algorithms were verified by comparative experiments, and the results were analyzed, the advantages and weaknesses of the algorithms were discussed.
Keywords/Search Tags:Web data mining, Web data extraction, expanded DOM tree, Automatic extraction, Information extraction patterns
PDF Full Text Request
Related items