Research On Outliers Mining Method To Web Content

Posted on:2011-11-04

Degree:Master

Type:Thesis

Country:China

Candidate:B Yu

Full Text:PDF

GTID:2178360305956072

Subject:Management Science and Engineering

Abstract/Summary:

PDF Full Text Request

With the development of information technology, more and more data were accumulated in companies. For some of them, exceptional data are more valuable than normal ones. Technology about the area have been put into many applications in our modern society and acted as a useful tool. Even more challenging is searching for outliers from Web data repositories. Outlier mining includes two parts which are outlier detection and outlier analysis. Outlier analysis is related with its background knowledge. The most pivotal question, which is outlier detection, is discussed in this dissertation.Existing outliers mining algorithms designed solely for numeric data cannot be applied directly to mine outliers from Web datasets which contain data of different types (i.e., text, hypertext, video audio, images, etc). And there are bugs for the traditional algorithms. The thesis provided a taxonomy for Web outliers and discussed a general framework for mining Web outliers but concentrates on designing models for mining Web content outliers. Besides that, it made a detailed analysis to the key technology.In this thesis, we summarized the theory of outliers mining, and deeply analyzed the algorithms of outlier detection. As the Web server and Web page content is dynamic and changeable, using a single algorithm can not mine all outliers. Web outlier data were classified in this thesis, and we designed a general framework for Web outlier mining combined with the characteristics of Web data. Through analyzing the distribute character of outliers, the thesis proposed a local isolation coefficient-based algorithm for outliers mining that introducing a new outliers measurement. It did not only solve the shortage of algorithm based on distance, but also improved the efficiency compared to that based on density. In e-business, most data contains both numerical data and categorical data. An algorithm based on attribute frequency was proposed in this paper to solve the problem. In the case of equivalent accuracy, the efficiency of this algorithm is more superiority than the other ones, such as Greedy algorithm based on frequency items, because it need only once scan of the datasets. A series experiments were executed to test the algorithms and their performance. Finally, we discussed how necessary outlier mining is to e-business and provided application instances about commodity and trade datasets according to requirement.

Keywords/Search Tags:

Data Mining, Web Content, Outliers, Local Isolation Coefficient, Attribute Frequency

PDF Full Text Request

Related items

1	A Study On Local Outliers Mining Algorithm Based On Weighted-Attribute
2	Local sparsity coefficient-based mining of outliers
3	Optimal Subspace Outlier Mining Algorithm Based On Entropy Increment And Local Attribute Weighting
4	Research On Extended Knowledge Discovery In High-Dimension And Sparse Outliers Set
5	The Local Outlier Mining Algorithm Based-on Conditional Cumulative Holoentropy And Global Neighbourhood
6	Study And Improvement Of Local Outliers Mining Based On Density
7	Research Of Outliers Mining Applied In Snort System Improvement
8	Study Of Mining Outliers Based On Interestingness
9	A Research On Outliers Mining Algorithm Based On Heat Metering Data
10	Mining Association Rules Among Outliers Based On Histogram And FP-growth