Font Size: a A A

Research On Multi-Topic Partition Method For Scientific And Technical Literature Set Based On Surface Text Information

Posted on:2016-04-15Degree:MasterType:Thesis
Country:ChinaCandidate:Y XiaoFull Text:PDF
GTID:2308330470460965Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Scientific and technical literature is an important source of providing the information of science and technology. The internal potential information and knowledge of scientific and technical literatures can be revealed through the effective processing methods in order to help people get information quickly and effectively. Automatic categorization of scientific and technical literatures is an important research content in the field of information retrieval and data mining, and, nowadays, it has become a research focus of literature information processing. To categorize and evaluate scientific and technical literatures, information features need to be deeply studied and analyzed, meanwhile, a variety of methods including machine learning, natural language processing, etc. should be applied properly and the effective implementation method of literature analysis should be researched. Automatic categorization of scientific and technical literatures based on general short piece text information (such as title, abstract, keywords) is a research topic with higher practical value.Since the general short piece text information contains few feature terms, the correlation of literatures is not easy to be found, and it is not scientific to categorize scientific and technical literatures rigidly through general method in consideration of interdisciplinary and multidisciplinary scientific and technical literatures. Therefore, in this dissertation, on the basis of analyzing the specific characteristics of scientific and technical literatures, a multi-label clustering method was proposed for multi-topic literature classification based on the surface text information of general features of the literatures, which makes a literature may be assigned to different categories according to different themes. As a result, a subject can be described from multiple perspectives and the multidisciplinary characteristics of literature are easy to be shown. Furthermore, a richness evaluation method for literature set was presented based on automatic partitioning of literature set.Firstly, the terms were selected according to term frequency, and the document set was described by vector space model (VSM). In addition, the latent semantic analysis model was adopted to deal with the problems of traditional text information processing. After the term-document matrix was decomposed for dimension reduction by the semantic dimension reduction method based on singular value decomposition (SVD), the representation of literature set in low dimensional latent semantic space was obtained to reveal the semantic relation among literatures.Secondly, the literatures were clustered by using the modified K-means algorithm. Meanwhile, adaptive determination of literature clustering granularity was proposed to deal with the multi-topic label of literatures, which realized multi-topic clustering analysis of scientific and technical literatures.Finally, to provide the evidence for evaluating the richness of the literature set, the diversity of literature data was quantitatively described according to the indices including diversity index and evenness degree.Experiment results show that proposed multi-label clustering method can realize cluster partition of scientific and technical literatures and deal with multi-topic label in an effective way, which is beneficial for a reasonable and accurate categorization of the literatures. Furthermore, it can provide effective and feasible intelligent means of the construction and utilizing of scientific and technical literature library.
Keywords/Search Tags:automatic categorization of scientific and technical literature, latent semantic analysis, multi-topic clustering, richness evaluation of literature set
PDF Full Text Request
Related items