Font Size: a A A

Domain Topic Clustering Based On Multi-source Corpus

Posted on:2022-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:Z ZhouFull Text:PDF
GTID:2558306347951299Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
How to mine academic topics in research fields from massive text data is of great significance for researchers to grasp the current research trends.At present,the discovery of domain topics is mainly derived from a single type of data,while scientific and technological literature includes text data types with different structures such as titles,keywords,and abstracts.Text data types with different structures have different emphasis on the same topic description.This paper conducts related research on topic clustering from multiple data sources of titles,keywords and abstracts.The main contributions of this paper are as follows:(1)Aiming at the problem of a single source of data types for subject extraction of scientific and technological literature,this paper considers that the text data types of different structures in scientific and technological literature have different subject description angles,and conducts research from the perspective of subject clustering and fusion of multiple data sources.(2)LDA only concerned for the overall document can not capture sequential relationship between lexical items,word2vec word vector model only concerned with the relationship between the local range of words while ignoring the overall theme of the document issue,this paper presents thematic pre-extraction model based LDA Get themes heavy term weighting and combining the acquired word model word2vec weighting vector construction and characterization methods relating to,improved k-means algorithm to select the initial cluster centers to improve the short text extracting data relating to the quality of the source.In order to further optimize multi-source topics,this paper proposes a hierarchical clustering topic fusion algorithm,based on the similarity between topics,iterative fusion,and enhance the discrimination between topics.(3)As the number of topics depends on the perplexity decision without considering the irrationality of the correlation between the words within the topic and the correlation between the topics,this paper uses two objective indicators of topic consistency and average similarity between topics as the basis for selecting the number of topics.In order to verify the paper proposed a method of multiple source theme found,this paper builds the basic corpus artificial intelligence literature,and by comparison with the experimental results verified the improved k-means algorithm in the theme of the essay on this subject extraction have higher consistency and lower average similarity between subject,hierarchical clustering theme fusion method is verified in the field of artificial intelligence data set on the subject extraction effect is better than that of using LDA model extraction effect.
Keywords/Search Tags:Multiple data sources, Topic clustering, Word vector model, k-means
PDF Full Text Request
Related items