Research And Design Of Automatic Clustering Based On Massive Scientific Literature

Posted on:2019-03-03

Degree:Master

Type:Thesis

Country:China

Candidate:D Zhang

Full Text:PDF

GTID:2348330542998713

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

As a record carrier of scientific and technical knowledge,scientific literature plays an extremely important role in science and technology.After the advent of the Internet age,the scientific literature has multiplied by geometric multiplication,and the artificial screening of information can not meet the demand obviously.Data mining of scientific and technological documents can better help us to obtain scientific and technological information.However,with the rapid development of science and technology and the advent of a large numbers of new disciplines and online words,the traditional classification of disciplines can not meet the current needs of the subject division of science and technology literature.At the same time,the clustering of the scientific literature put more requirements to the efficiency of the methods and the corresponding software and hardware facilities.In this paper,text features are extracted based on the features of TF-IDF method and the features of scientific literature.In Hadoop distributed environment,text clustering is carried out through canopy improved k-means algorithm.Finally,I achieved the automatic clustering of massive scientific literature.The main contents and topics of this paper are as follows:First,I do the research for the framework of the text clustering knowledge.It focuses on the text pretreatment,feature extraction and clustering techniques.The feature extraction method and clustering algorithm are introduced in detail.At the same time,the characteristics of the scientific literature and the distributed technology are studied and introduced.And the difficulty of text clustering in big data environment is analyzed.Secondly,based on the basic research of text clustering,this paper proposes a method including the features of text nouns and TF-IDF method to extract the features,and then using the reduced dimensionality to establish the corresponding space vector model.Finally,in the distributed environment,the canopy improved k-means algorithm can be used to cluster scientific literatures.At last,I achieved clustering of the massive scientific literature.It includes the construction of distributed environment,the realization of functional modules and the analysis of clustering results.At present,this system has been successfully applied to a project of Chinese Internet text viewpoint extraction,and the system has achieved good clustering results.

Keywords/Search Tags:

Text clustering, Scientific literature, K-means, Canopy, Hadoop

PDF Full Text Request

Related items

1	Research On Parallel Clustering Algorithm Based On Hadoop Cloud Computing Platform
2	The Research Of Clustering Algorithm Based On Hadoop Cloud Computing Platform
3	Research On Multi-Topic Partition Method For Scientific And Technical Literature Set Based On Surface Text Information
4	Research On Hot Topics Discovery In Microblog Based On Distributed K-means Algorithms
5	Research On The Application Of User Behavior Analysis Based On Hadoop
6	Research On Parallelization Of Text Clustering Based On Hadoop
7	The Research Of Parallel Clustering Algorithm Based On Hadoop Platform
8	Research On Clustering Algorithm On Hadoop Platform
9	Bibliometric Analysis Of Output And Collabration Of China's Scientific Literature
10	User Behavior Analysis In Software Version Management System