Font Size: a A A

Clustering Results Explaination And Linear Relation Study Of Government Data

Posted on:2009-11-21Degree:MasterType:Thesis
Country:ChinaCandidate:J M PangFull Text:PDF
GTID:2178360242980504Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the quickly development of computer technology andcommunication technology, we entered into information times.In last tenyears, as the development of database technology and popularization ofdatabase applization,the huge database appears. the researches on KDD riseto extract valuable information and knowledge from huge databases, as abranchofKDD,DataMiningisthecoreofKDD.asabranchofDataMining,the aim of cluster analysis technique is to obtain a collection of data objectsthat are similar to one another within the same cluster and are dissimilar tothe objects in other clusters. The cluster analysis technique is widely appliedin data mining, pattern recognition, content-based image retrieval etc.in thispaper, Ido some research on how to explain the clustering result.besides,I dosome research on classifying the 3D model database by clustering ,besides Idosomeresearchonhowtofindthelinearrelationingovernmentdata.Usually, it is very difficult for clients to understand the clustering resultBecause of lacking of pre-experience knowledge and the complexity ofarithmetic. Besides, it is veryuseful for estimating clustering result, and thenwe can adjust the clustering process, so use the clustering result efficiently.So the topic of creating efficient, uniform, and friendly clustering resultexplanationinterfaceisverypressingandmeaningful.Someone explains the clustering result by visualization technique, butthey meet the question of visual high dimension data. Dimensionalityreduction technique is a good way to reduce the dimension, but the datareduced will lose their primary character. Someone explain the clusteringresult by watching the clustered data directly, the method needs professorknowledge and complex illation process.NAKAMURAbrings forward a newclustering result explanation method based FlexDice,FlexDice is a clusteringmethod,whichcandeals withhugedatabases andoutliers.Theprocess ofthemethod is as follows. First getting the distribution of each attribute in eachcluster, then clustering the distribution and getting the outliers, the outlierswill be the character of the attribute, the shortcoming of the method is that it completely depends on special clustering method.Srinivasa K G* brings amethod for distinguishing different clusters. Inspired by it, I find a new waytoexplainclusteringresult.In this paper I bring forward two methods.firstly,studying characteristicclusters of attributes,namely,finding which attribute decides thecluster;secondely,studying characteristic attribute of each cluster, namely,findingattributeswhichdistinguishdifferentclusters.The correlation analysis of huge datasets has been widelyadopted in thecommercial field. Atypical exampleis thediscoveryofthe"beer anddiaper"rule. However, little effort has been spent on the analysis of the correlationsof government data, although this topic has important application value. Forinstance, Chinese auditors usually handle the large government databaseswithout meta-data. In that case, the correlation analysis of attributes can helpthemunderstanddatasetsanddiscoverillegalactions.Atraditional method of correlation analysis is the linear regression. Theregression analysis technique originates from the Gaussian Least squaremethod. One variable linear regressive is the simplest form, which onlyconsiders the case with just one independent variable and one attributivevariable. However, it is influenced greatly by outliers. In the field of datamining, the association rule analysis technique is adopted to discover therelationships among attributes and it performs quite well for the categoricalor basket attribute. If we use the method to deal with the numeric attribute,the numeric attribute must be converted to the categorical attribute thatcorresponds to different sections. so the discovered rules cannot reflect theaccurate relationship between the numeric attributes. There are manytechniquesfordealwithhugedata,suchas,sampling,parallelprocessingandotherspecialarithmeticTo overcome the drawbacks of regression analysis and association ruleanalysis, this paper proposes a new method for discovering the linearcorrelations between numeric attributes. The basic idea of the proposedmethod is theapplication of the cluster analysis technique on the governmentdataset. As an unsupervised technique, the cluster analysis can find thecongregation of data without prior knowledge. Small clusters can be prunedas outliers, while the remaining clusters reflect different distribution patterns. Therefore, we can discover the attribute correlations from each resultingcluster. Thus we can handle some difficult cases, like when more than onecorrelation exists between attributes and if the correlation exists in just asmall part of the dataset. Besides, we use linear correlation technique in eachcluster,sowecanfindlinearrelationquicklyandexactly.Thanks to the 3D scan and modeling techniques, several of 3D modeldatabases appear. In order to make the best of resource and find the right 3Dmodel, people pay much attention on 3D model retrieval based on content.Thekeyof3Dmodelretrievalisfeatureextractionof3Dmodel.Because of lack of optimization theory, the compare of different featureextraction methods is an important topic. The current estimate target, such asPrecision, Recall, and R-Precision, are all need classification information of3D model database. So model classification information is the key and baseof3Dmodelfeatureextraction.To overcome the drawback of PSB manual classification, we bringforward3Dmodelclassificationmethodbasedonclustering.Thestudyofclusteringresultexplanation,3Dmodelclassificationbasedon clustering and linear correlation analysis of numeric attributes forgovernment data still needs improvement. As the widely application ofclustering, there are many different clustering result explanation methods, inthis paper,weonlydosomebasicresearches onattributes andclusters, so themethod needs to be improvement or study new clustering result explanationmethods. The effect of 3D model classification is not very ideal, we need tostudy new clustering methods for 3D model retrieval, besides, more targetsshould be used to estimate the clustering result of 3D model, such as purity,entropy and so on. Linear correlation analysis of numeric attributes forgovernment data should have a wide application foreground, the future workwillbefindlinearcorrelationamongattributes,andmakefulluseofoutliers.
Keywords/Search Tags:Explaination
PDF Full Text Request
Related items