Domain Topic Clustering Based On Multi-source Corpus

Posted on:2022-02-02

Degree:Master

Type:Thesis

Country:China

Candidate:Z Zhou

Full Text:PDF

GTID:2558306347951299

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

How to mine academic topics in research fields from massive text data is of great significance for researchers to grasp the current research trends.At present,the discovery of domain topics is mainly derived from a single type of data,while scientific and technological literature includes text data types with different structures such as titles,keywords,and abstracts.Text data types with different structures have different emphasis on the same topic description.This paper conducts related research on topic clustering from multiple data sources of titles,keywords and abstracts.The main contributions of this paper are as follows:(1)Aiming at the problem of a single source of data types for subject extraction of scientific and technological literature,this paper considers that the text data types of different structures in scientific and technological literature have different subject description angles,and conducts research from the perspective of subject clustering and fusion of multiple data sources.(2)LDA only concerned for the overall document can not capture sequential relationship between lexical items,word2vec word vector model only concerned with the relationship between the local range of words while ignoring the overall theme of the document issue,this paper presents thematic pre-extraction model based LDA Get themes heavy term weighting and combining the acquired word model word2vec weighting vector construction and characterization methods relating to,improved k-means algorithm to select the initial cluster centers to improve the short text extracting data relating to the quality of the source.In order to further optimize multi-source topics,this paper proposes a hierarchical clustering topic fusion algorithm,based on the similarity between topics,iterative fusion,and enhance the discrimination between topics.(3)As the number of topics depends on the perplexity decision without considering the irrationality of the correlation between the words within the topic and the correlation between the topics,this paper uses two objective indicators of topic consistency and average similarity between topics as the basis for selecting the number of topics.In order to verify the paper proposed a method of multiple source theme found,this paper builds the basic corpus artificial intelligence literature,and by comparison with the experimental results verified the improved k-means algorithm in the theme of the essay on this subject extraction have higher consistency and lower average similarity between subject,hierarchical clustering theme fusion method is verified in the field of artificial intelligence data set on the subject extraction effect is better than that of using LDA model extraction effect.

Keywords/Search Tags:

Multiple data sources, Topic clustering, Word vector model, k-means

PDF Full Text Request

Related items

1	Design And Implementation Of Topic Analysis System For Web Data In Social Network
2	Text Classification Based On Word Vector And Topic Vector
3	Automatic Discovery Technology Based On The Hot Topic Of Multiple Data Sources
4	Automatic Topic Labelling Based On Word Vectors
5	Network Hot Topic Discovery Based On Topic Model And Clustering Algorithm
6	Topic Quester:An Interactive Visual Exploration Of Topic Information From Multiple Sources
7	Micro-blog Hot Topics Detection Method Based On Hybrid Clustering
8	Research Of Block Data Clustering Algorithms Based On The Bag Of Word Model
9	Research On Deep Web Sources Clustering Based On Dirichlet Process
10	Multiple Documents Automatically Summary Based On Semantic Word Vector