Font Size: a A A

Research Web Content Mining Based On XML

Posted on:2008-10-26Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhouFull Text:PDF
GTID:2178360215990939Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With Internet development, the information on WWW increases fast. WWW provides massive information for people, but also causes us to fall into a contradiction which is on the one hand, people need to acquire information from WWW fast and effectively, on the other hand, the information on WWW is so huge, structure of the information is complicate, and there are many difficulties for dealing with the information. In order to resolve this contradiction, the Web mining technology provides a way, at present the research of Web mining is in developing stage, and needs much research in theory, implementation method and technology. The Web mining technology is an application of tradition data mining technology under the Web environment, which is discovering pattern of implication, unknown, having latent application value, uncommon from massive Web document set and Web information of users browsing. According to research objects of Web mining, Web mining divides into three kinds: Web content mining, Web structure mining, and Web use log mining. Web content mining technology is mainly researched in the paper, the purpose of research is for solving the problem that now search engines only are used in information searching on Web, can't discover latent knowledge on Web, and most users feel awkward situation exceeding their abilities when they use the results which search engines return.Paper analyzed some basic concepts, methods and technology about the data mining, Web mining and XML technology. Through based on semi-structured data processing and the key technologies involved research, such as extraction method, transformation algorithm etc, a stack-based HTML to XML transformation approach is put forward. it tansforms semi-structured data to structured data, and gets ready for mining appropriate data. Paper described a method uses in establishing of multi-layered Web database which using the XML data. Through to existing data mining decision tree algorithm and cluster classification algorithm research, these two algorithms are modified according to the practical application. Then the algorithms are applicable to the current Web mining duty.The idea is realized by building a XML-based Web Mining prototype system named Web_srm. The prototype system can be used for Web page content mining which is obtained by search engine according to the user input inquiry information. The system consists of six major components, they respectively are Web page data acquisition,preprocessor, data converters, mining synthesizers, user interface and multi-layered database, Providing people with a Web Mining Tools, the system carries on the analysis and the mining to the search result which obtains from Web, can help people more quickly and effectively from the search results to obtain the content which they are interested in.
Keywords/Search Tags:HTML, XML, Data mining, Web content mining
PDF Full Text Request
Related items