Font Size: a A A

Research On The Data Model And The Approaches To Data Mining In The Semi-structured Data

Posted on:2011-08-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:T SunFull Text:PDF
GTID:1118360305453689Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
As the society coming into the information period, and the comprehensive application of the computer network and computer technology, the database in every industry accumulates substantive data increasingly. How to use these data and pick up useful information or knowledge from them to guide the production and distribution of the enterprises comes into being and develops a new computer technology—Data Mining Technology which is widely used and has tremendous practicality. Along with the popularization of Internet, the network data increase endlessly with a great deal of semi-structured data appears. The semi-structured data is preferred of the data storage and data exchange as its scalability, self-describing and dynamically. It provides flexibility for system implementation and makes convenience for resource share between corporations.The characteristic of semi-structured data lacking of rigid and integrated structure makes it include content and structure information, its structure may be connotative, even being modified constantly. Therefor, it needs to design data models which can better describe semi-structure data characteristic based on data analysis requirement. The well designed models can establish the stability bases for data storage, indexing construction optimization query and knowledge discovery. Besides, as the flexibility of semi-strutured data, there are many problems while doing application analysis, such as data skewness, obscurity of clustering boundary, clustering boundary noises, it needs to design reasonable semi-structure data mining algorithms solving these problems. The structure and content of semi-structured data may be modified continuously and exhibit highly dynamic characteristic. The changes of structure and content can definetly reflect the change rules in time. How to find out the dynamics structure from the history changing process, and how to make use of the dynamic structures and information to do semi-structured data analysis work along with the clustering and classification method. These will be great signification to better use the flexibility and dynamic of semi-structured data.Along with the expanding of data scale and the increasing of analysis requirement, it needs to develop many kinds of data analyzing tools and data mining systems. By mining the history data, it can build decision rules to instruct the management or development and make more economy benefit for corporation. Data mining is face to application at the beginning, and no other than the widely using and popularization, it can promote the researches on data mining theory contrarily.The main results obtained by this thesis are summarized as follows:1) We analyze the current research work of the semi-structured data model and data mining work. By the analysis of relevant literatures, we summarize the characteristic of semi-structured data and data scheme which has been put forward, and point out the worse description while doing with the application. From the application of semi-structured data, we present the problem of data skewness, obscurity of clustering boundary, etc. Then, we sum up the research work on feature extraction, frequent structure discovery, document clustering and classification; introduce the characteristic of the popularity data mining system. All the reference reading work makes the bases for this thesis. 2) Based on the data mining requirement, we design two semi-structured data model LTRS and ADAWT. In order to characterize and deal with the vagueness and uncertainty of structured data as well as the compositions and contents implied within semi-structured data models, we present a Labeled Tree Rough Set Model (LTRS) by extending the traditional rough set model. Making use of the structure and content of the semi-structured data, from the tree structure we redefine the information system and rough set's basic concepts, such as equivalence relation, indiscernibility relation, upper approximation and lower approximation, etc. Furthermore, we give a description about the discernibility matrix and decision rules. By analyzing the XML data sets using the LTRS model, we can construct decision rules by structure and content at the same time and describe composing relationship between tree nodes and knowledge reduct of content. Based on the existing semi-structured data model lacking of the formalize defination about the data change direction and the degree of change, being short of the definitely description of data dynamic property, we presented a tree model ADAWT with dynamic change information of tree depth and width. The model can integrate the dynamic change information about the tree shape document like XML in N history edition files, and can establish the basis for the effective dynamic structure discovery.3) We put forward a data balance algorithm SSGP based on the classification problem about the semi-structured skew data. There are substantive skew data in the semi-structure data Web application field, the traditional classifier isn't efficiency while dealing with this skew data. The classifier may partly or completely ignore the positive examples, so much as forecast every examples into negative examples. Therefor, the forecast and analysis on the less proportion examples is an important branch of data mining. It needs design classify algorithm to solve the widely used semi-structured skew data classification problem. In order to balance the training sets that have several classes, an algorithm called SSGP is introduced, which is based on the idea that little difference lies between the same class cases. SSGP form new minority class cases by interpolating between several minority class cases that lie together. It's proved that SSGP would not add noise to the data set. To enhance the efficiency, SSGP adopt the modulus in stead of calculating a lot of dissimilarity between cases. Take decision tree classifier to test the effect of balancing, the results show that SSGP can improve the predictive accuracy of several minority classes by running once.4) We presented the clustering algorithm concerning vector influence between objects called VICA to deal with the obscurity of clustering boundary and clustering noises problems. While solving semi-structured skew data classification problems, we find clustering and classification problems facing to the obscurity of clustering boundary and clustering noises causing precision decrease problem. We present a density based clustering algorithm concerning vector influence between objects. From the point view of the law of gravity, the influence between particles includes two aspects, namely distance and direction. We define a concept of Vector Influence Function by introducing the scalar influence function and direction influence function. Moreover, we propose two methods, i.e. similarity method and summation method, to compute the direction influence. The VICA algorithm normalizes the object project of the core point in its neighborhood, inspects the balance of the core point and then expands objects which are reachable by balanceable core points with balanceable density into a cluster. The theoretical analysis and experimental results indicate that this algorithm can discover clusters with arbitrary shape and can also effectively eliminate noise such as boundary sparse points. It addresses many problems due to the obscurity of clustering boundary division for high dimensional data, an uneven density distribution, plenty of clustering boundary objects. The algorithm improves the accuracy of clustering and offers better results of clustering on various data sets.5) We research on the dynamic feature extraction and document clustering of XML data. For the problem of traditonal static mining algorithm being incapable of knowledge discovery on dynamic change XML document, we sum up the basic conception and definition of existing FSC, FS finding work, and design the corresponding structure finding algorithm based on the temporal data model, decrease substantive time consuming causing by change detection between different editions. Then, we present the ADAWT model at the point view of scaling space change between XML editions. Moreover, we construct feature space using kinds of extracted dynamic structure, make XML document into the eigenvector, implement the clustering of large scale XML documents by the algorithm VICA.6) We design a multi-strategy Data Mining System DBIN Miner. The development of the database technology and the comprehensive application of the database management system result in the data expanding and the increasing of the analysis requirement. Many kinds of datamining system and business intelligence software are developed continuously. We review the development history of the data mining system, analyze the characteristic of the typical data mining system, and design a multi-strategy data mining system. In dealing with the large scale data, we introduce and design the algorithm groupware idea, buffer processing technology, configuration file based on the XML. The system integrates the algorithms designed above and makes it well extensibility. The research results of this thesis promoting the research work of the semi-structure model, the classification and clustering facing semi-structured data analysis, dynamic feature extraction and document clustering of semi-structured data. Our contribution of theory research and prototype design takes on definite theory signification and application value.
Keywords/Search Tags:Data Mining, Semi-structured Data, Labelled Tree, Skew data, neighborhood balance, frequent change structure, Data Mining System
PDF Full Text Request
Related items