Font Size: a A A

Research And Design Of Automatic Clustering Based On Massive Scientific Literature

Posted on:2019-03-03Degree:MasterType:Thesis
Country:ChinaCandidate:D ZhangFull Text:PDF
GTID:2348330542998713Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
As a record carrier of scientific and technical knowledge,scientific literature plays an extremely important role in science and technology.After the advent of the Internet age,the scientific literature has multiplied by geometric multiplication,and the artificial screening of information can not meet the demand obviously.Data mining of scientific and technological documents can better help us to obtain scientific and technological information.However,with the rapid development of science and technology and the advent of a large numbers of new disciplines and online words,the traditional classification of disciplines can not meet the current needs of the subject division of science and technology literature.At the same time,the clustering of the scientific literature put more requirements to the efficiency of the methods and the corresponding software and hardware facilities.In this paper,text features are extracted based on the features of TF-IDF method and the features of scientific literature.In Hadoop distributed environment,text clustering is carried out through canopy improved k-means algorithm.Finally,I achieved the automatic clustering of massive scientific literature.The main contents and topics of this paper are as follows:First,I do the research for the framework of the text clustering knowledge.It focuses on the text pretreatment,feature extraction and clustering techniques.The feature extraction method and clustering algorithm are introduced in detail.At the same time,the characteristics of the scientific literature and the distributed technology are studied and introduced.And the difficulty of text clustering in big data environment is analyzed.Secondly,based on the basic research of text clustering,this paper proposes a method including the features of text nouns and TF-IDF method to extract the features,and then using the reduced dimensionality to establish the corresponding space vector model.Finally,in the distributed environment,the canopy improved k-means algorithm can be used to cluster scientific literatures.At last,I achieved clustering of the massive scientific literature.It includes the construction of distributed environment,the realization of functional modules and the analysis of clustering results.At present,this system has been successfully applied to a project of Chinese Internet text viewpoint extraction,and the system has achieved good clustering results.
Keywords/Search Tags:Text clustering, Scientific literature, K-means, Canopy, Hadoop
PDF Full Text Request
Related items