Font Size: a A A

Research On Data Dependent Similarity Measure For High Dimensional Data

Posted on:2019-04-08Degree:MasterType:Thesis
Country:ChinaCandidate:N J DengFull Text:PDF
GTID:2348330545481041Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Data mining is to discovery the unknown and valuable knowledge from a large amount of data.With the development of the information age,the increase of data volume and the complication of data structure have given new challenges to data mining tasks.A large number of high-dimensional data has been accumulated in many industries.It is the focus of current research to effectively use data mining techniques to detect hidden value in high-dimensional data.Similarity measure is used to measure the similarity between data objects,which is a key part of data mining task.However,the traditional similarity measure method shows great limitations when faced with high-dimensional data.The data dependent similarity measure methods consider the influence of data distribution on the similarity measure and solve the problem of the traditional similarity measure methods.Data dependent similarity measure methods compare the distribution of data objects rather than the geometric distance.With the same geometric distance,the method considers two objects in a sparse area to be more similar than two objects in a dense area.While the existing research based on data dependent similarity measure has some shortcomings,that is,the lack of research on high dimensional data and the measurement of multi-type data structure.In view of the above problems,this paper studies the data dependent similarity measure,the following innovations are proposed:first,optimize the interval partition of data dependent similarity measure,to propose a similarity measure algorithm based on data dependency for high-dimensional data;second,study the processing for categorical data,propose an effective data dependent similarity measure for categorical data.Aiming at the problem of data dependent similarity measure which randomly selects the attribute to build interval that makes the algorithm unsuitable for high-dimensional data,this paper proposes a data dependent similarity measure algorithm based on attribute selection.Firstly,the algorithm uses rough set attribute reduction theory and information system theory to propose attribute importance definition.Secondly,the attribute selection formula is defined according to attribute importance and applied to interval partition which selectively builds the interval partition;finally,define the formula of probability mass,and evaluate the probability quality comprehensively to get the final similarity measure result.Experiments show that this algorithm can be used in kNN outlier detection algorithm to detect outliers in high-dimensional data effectively.Most similarity algorithms are dealing with numerical data,and lack of research on the similarity measure of categorical data.In order to solve that,this paper proposes a data dependent similarity measure algorithm based on concept hierarchy tree.First of all,the algorithm uses the conceptual hierarchy tree to construct the hierarchical tree of attribute values of categorical data.Secondly,the attribute tree matching method of concept hierarchy tree is optimized,and the depth of the smallest common parent node of attribute value is considered.Finally,dependency division is applied to the hierarchy tree,a new partitioning interval is defined,and the interval partition and the calculation of probability mass of categorical data are completed.Experiments show that this algorithm is applied to user-based collaborative filtering recommendation algorithm,which can provide more accurate user recommendations.
Keywords/Search Tags:similarity measure, data dependent, rough set, concept hierarchy tree
PDF Full Text Request
Related items