Font Size: a A A

Research On The Clustering Technology Of JSON Semi-structured Document

Posted on:2018-11-15Degree:MasterType:Thesis
Country:ChinaCandidate:D W LiuFull Text:PDF
GTID:2348330542969351Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
The semi-structured document occupies the vast majority of data in the Internet.How to deal with semi-structured documents has become the focus of business and academic attention.JSON is a typical semi-structured document widely used in the Internet,but JSON document clustering research rarely involved.In this paper,we study the clustering technology of JSON semi-structured documents,propose an advanced hybrid clustering algorithm based on K-Means,apply the clustering model to government open data,and finally implement the clustering system.We introduce the characteristics of semi-structured documents,and make a comparison between JSON and XML documents from qualitative and quantitative aspect.From the model perspective,the document vector representation of JSON semi-structured document is given.Considering the feature reduction technique and both the hybrid factor and the path level factor,an advanced hybrid clustering algorithm based on K-Means is proposed.From the application perspective,the background of the government open data and the relevant information of the data set are provided.We discuss the clustering quality evaluation index and designs the experiments of the clustering validity evaluation experiment and determination of the number k of the clusters.From the system perspective,the clustering system of JSON semi-structured document is implemented,and the system flow chart is designed and the system module are designed.The concept of frequent weight and specific weight is proposed for system effect virtualization.The conclusions of this paper can be shown as follows:(1)Two factors influencing the ability of document differentiation are proposed:path level and hybrid factor,which can be verified in the experiment.(2)Experiments show that it is necessary to comprehensively examine the effect of the two on the clustering effect,to verify the separate consideration of the hybrid factor or the path level factor lonely is not enough.(3)In the JSON semi-structured document clustering,it is verified that the SC index is better than the CHI index.(4)Develop and implement a prototype system for JSON semi-structured clustering.(5)Put forward the frequent weight and specific weight,from the topic and model angles to show JSON semi-structured document content and structure of the two parts.While in the display process the tag cloud technology is used,effect of presentation is very obvious.
Keywords/Search Tags:JSON, XML, K-Means, Mixture Factor, Path Level
PDF Full Text Request
Related items