Research On Data Dependent Similarity Measure For High Dimensional Data

Posted on:2019-04-08

Degree:Master

Type:Thesis

Country:China

Candidate:N J Deng

Full Text:PDF

GTID:2348330545481041

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Data mining is to discovery the unknown and valuable knowledge from a large amount of data.With the development of the information age,the increase of data volume and the complication of data structure have given new challenges to data mining tasks.A large number of high-dimensional data has been accumulated in many industries.It is the focus of current research to effectively use data mining techniques to detect hidden value in high-dimensional data.Similarity measure is used to measure the similarity between data objects,which is a key part of data mining task.However,the traditional similarity measure method shows great limitations when faced with high-dimensional data.The data dependent similarity measure methods consider the influence of data distribution on the similarity measure and solve the problem of the traditional similarity measure methods.Data dependent similarity measure methods compare the distribution of data objects rather than the geometric distance.With the same geometric distance,the method considers two objects in a sparse area to be more similar than two objects in a dense area.While the existing research based on data dependent similarity measure has some shortcomings,that is,the lack of research on high dimensional data and the measurement of multi-type data structure.In view of the above problems,this paper studies the data dependent similarity measure,the following innovations are proposed:first,optimize the interval partition of data dependent similarity measure,to propose a similarity measure algorithm based on data dependency for high-dimensional data;second,study the processing for categorical data,propose an effective data dependent similarity measure for categorical data.Aiming at the problem of data dependent similarity measure which randomly selects the attribute to build interval that makes the algorithm unsuitable for high-dimensional data,this paper proposes a data dependent similarity measure algorithm based on attribute selection.Firstly,the algorithm uses rough set attribute reduction theory and information system theory to propose attribute importance definition.Secondly,the attribute selection formula is defined according to attribute importance and applied to interval partition which selectively builds the interval partition;finally,define the formula of probability mass,and evaluate the probability quality comprehensively to get the final similarity measure result.Experiments show that this algorithm can be used in kNN outlier detection algorithm to detect outliers in high-dimensional data effectively.Most similarity algorithms are dealing with numerical data,and lack of research on the similarity measure of categorical data.In order to solve that,this paper proposes a data dependent similarity measure algorithm based on concept hierarchy tree.First of all,the algorithm uses the conceptual hierarchy tree to construct the hierarchical tree of attribute values of categorical data.Secondly,the attribute tree matching method of concept hierarchy tree is optimized,and the depth of the smallest common parent node of attribute value is considered.Finally,dependency division is applied to the hierarchy tree,a new partitioning interval is defined,and the interval partition and the calculation of probability mass of categorical data are completed.Experiments show that this algorithm is applied to user-based collaborative filtering recommendation algorithm,which can provide more accurate user recommendations.

Keywords/Search Tags:

PDF Full Text Request

Related items

1	Study On The Generalized Fuzzy Rough Set Model And Its Application
2	Research On Concept Lattice Updating Construction Algorithms Based On Concept Hierarchy
3	A Method For Building Semantic Web Rough Ontology
4	Automatic Update Of Ontology Concept Hierarchy With Structure-Content Similarity Measurement
5	Research On Data Preprocess And Interactive Visualization For Data Mining
6	Based On Xml And The Concept Of Hierarchical Tree Data Mining Research
7	Research On The Method Of Intelligent Data Analysis Based On Rough Set And Concept Lattice
8	Multi-granulation Rough Sets And Granular Reductions Based On Similarity Measure
9	Similarity Measure And Clustering Based On The Extended Rough Set Models
10	Research On Semantic Similarity Measure Method For RDF Graphs