| Clustering with mixture model is a generative clustering method based on probability,which models the text features with probability distributions and estimates the model parameters and text assigning probabilities through statistical inference algorithm.Currently the researches of mixture model based clustering mainly focused on the improvements of model structure and inference algorithm.But the improvement of text features is relatively little.Text representation is the most basic part in text clustering.If the text features are able to characterize the text more accurately,then the results of text clustering will be better.So,this paper will start with text representation to improve the mixture model clustering.As for the academic literature,its text features can be extracted from the body text which contributes the most,as well as the citation,which reflects the dissemination of knowledge and contains a wealth of related information between literatures.Many empirical researches have proven that,the citation feature is able to improve text clustering in some degree.But the citation feature used now is mainly based on the idea of traditional citation analysis,which takes the reference records which only contains limited information as analysis object.With the development of natural language processing and text mining technology,the full-text citation analysis is increasingly attracting the attention of researchers.The researchers have carried out analysis on citation position,citation frequency and citation context by means of the technology to dig more comprehensive value the citation involved.Based on the full-text citation analysis method,this paper proposed the idea to apply weighted citation feature to clustering texts,which gave weight value to citation feature according to the citation position,citation frequency and citation context to present the text more precisely.In the empirical study,this paper optimized the traditional citation feature by using citation frequency(the total citation frequency that a reference is repeatedly cited in the text)as weight,and used it in mixture model based text clustering to improve clustering algorithm.The main research contents and conclusions as:1.Extracted the citation frequency from the text,and used it to weight the citation feature(reference relationship feature and the reference title feature);it was found that weighted citation feature with weights performed better than traditional citation feature in mixture model based text clustering,which proved that citation frequency ccould be used to improve the traditional citation feature.2.Clustered texts with both reference relationship feature and original term feature(exclude reference terms),and it concluded that reference relationship feature was an important supplement for term feature.Retaining high-frequency reference relationship(i.e.,removing reference relationship of frequency 1 and 2)leaded to a better clustering result.3.Added terms in reference title into the original term feature set,and gave different weights to the terms in title,abstract,body and reference to analyze the different values of terms in different positions of the text;it showed that based on data set used in this paper,the importance of reference title was between abstract and title,and the optimal ratio of the four positions was 4:2:1:3.4.Analyzed the result of each cluster,and found that citation feature has a positive influence on the clusters with cross content.Citation feature helps reinforce the discrimination among,clusters,thereby enhance the clustering algorithm to identify cluster,and thus improve the quality of clustering.This paper conducted a preliminary exploration of weighted citation feature in mixture model based text clustering.The experiments confirmed that citation feature could improve the clustering,and the study provided a reference for in-depth study and application of weighted citation feature subsequently. |