Font Size: a A A

Research On Outliers Mining Method To Web Content

Posted on:2011-11-04Degree:MasterType:Thesis
Country:ChinaCandidate:B YuFull Text:PDF
GTID:2178360305956072Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
With the development of information technology, more and more data were accumulated in companies. For some of them, exceptional data are more valuable than normal ones. Technology about the area have been put into many applications in our modern society and acted as a useful tool. Even more challenging is searching for outliers from Web data repositories. Outlier mining includes two parts which are outlier detection and outlier analysis. Outlier analysis is related with its background knowledge. The most pivotal question, which is outlier detection, is discussed in this dissertation.Existing outliers mining algorithms designed solely for numeric data cannot be applied directly to mine outliers from Web datasets which contain data of different types (i.e., text, hypertext, video audio, images, etc). And there are bugs for the traditional algorithms. The thesis provided a taxonomy for Web outliers and discussed a general framework for mining Web outliers but concentrates on designing models for mining Web content outliers. Besides that, it made a detailed analysis to the key technology.In this thesis, we summarized the theory of outliers mining, and deeply analyzed the algorithms of outlier detection. As the Web server and Web page content is dynamic and changeable, using a single algorithm can not mine all outliers. Web outlier data were classified in this thesis, and we designed a general framework for Web outlier mining combined with the characteristics of Web data. Through analyzing the distribute character of outliers, the thesis proposed a local isolation coefficient-based algorithm for outliers mining that introducing a new outliers measurement. It did not only solve the shortage of algorithm based on distance, but also improved the efficiency compared to that based on density. In e-business, most data contains both numerical data and categorical data. An algorithm based on attribute frequency was proposed in this paper to solve the problem. In the case of equivalent accuracy, the efficiency of this algorithm is more superiority than the other ones, such as Greedy algorithm based on frequency items, because it need only once scan of the datasets. A series experiments were executed to test the algorithms and their performance. Finally, we discussed how necessary outlier mining is to e-business and provided application instances about commodity and trade datasets according to requirement.
Keywords/Search Tags:Data Mining, Web Content, Outliers, Local Isolation Coefficient, Attribute Frequency
PDF Full Text Request
Related items