Font Size: a A A

Web Data Mining And Its Applications, Network News Text Data

Posted on:2011-12-12Degree:MasterType:Thesis
Country:ChinaCandidate:F HuFull Text:PDF
GTID:2208360308466999Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the development of computer software, hardware and network technology, people have become accustomed to the Internet as a main platform of information release and exchange, Web information is on the explosive growth now. With over 800 million pages covering most areas of human endeavor, the World-wide Web is a fertile ground for data mining research to make a difference to the effectiveness of information search. Today, Web surfers access the Web through two dominant interfaces: clicking on hyperlinks and searching via keyword queries. This process is often tentative and unsatisfactory. Better support is needed for expressing one's information need and dealing with a search result in more structured ways than available now. Data mining and machine learning have significant roles to play towards this end. Web mining is a new research area raised in this context to apply data mining techniques to the semi-structured Web data. It is aimed at the characteristics of Web data for knowledge discovery.In this paper, we use the news pages data on the Web as carrier to study the Web mining. Web mining can be divided into three categories: Web content mining, Web structure mining and Web usage mining. This study focuses on Web content mining. Namely, Web mining applied to text content of news pages. In this paper, the work is reflected in the following areas:1. A systematic study of the basic theory of the Web mining and data mining for hypertext and text.2. Implement related pre-processing technology required by Web content mining based on the news page. These pre-processing technologies include data collection, web content extraction, and segmentation of Chinese and English word and so on.3. Proposed a similarity detection method based on MinApriori measurement. This method is inspired by approach used by association rules algorithm when dealing with the numerical data. Applied to the document similarity detection the method can significantly improve the detection speed and can maintain the accuracy of detection.4. Apply classification to news page to make browsing news friendlier. In this paper, we make a systematic study on the classification learning algorithms and dimensionality reduction methods that can apply to the text, and through systematic experiments, analysis the performance of various algorithms in the text classification , as well as effect of the dimension reduction . Finally we implemented an automatic classification system for news page based on ComplementNaiveBaye method.5. Build up an online Web mining services platform. The platform integrates the work of the above.Through the online Web mining services platform, we can achieve duplication detection, classification and other web mining function for news on the web on the premise of less manual intervention. The realization of these functions, can improve efficiency in the use of information, the platform has broad application prospects and potential commercial value.
Keywords/Search Tags:Web mining, news, text classification, MinApriori, similarity detection
PDF Full Text Request
Related items