Font Size: a A A

Research On Chinese Text Similarity Computing Based On Semantic Weighted

Posted on:2016-09-24Degree:MasterType:Thesis
Country:ChinaCandidate:K DuFull Text:PDF
GTID:2348330518999015Subject:Information Science
Abstract/Summary:PDF Full Text Request
Since entering the age of the Internet,people ' s life is inseparable from the Internet and communication technologies,complex crowds of the network society is a geometric information-explosion-driven growth.People need information,but it must be treated with antiseptic and heating were useful information.As an important part of the information,the effect and efficiency of the Chinese text will be received extensive concern in text processing field.Text similarity is a basic part of text information processing,and its calculation directly affect the effect of the following text mining.Starting with widely used vector space model that to represent text,this paper study the cosine similarity that is a method for calculating the similarity of text and frequently used in vector space model.When calculating the similarity of the text,the cosine similarity only considers the same words in the text,and does not take into account the correlation between words and phrases.There is rich semantic information in the Chinese text,and there is a strong semantic relation between words,making full use of the semantic information in the text,and improving the effect of text similarity measure is inevitable.In order to solve the problem of ignoring the semantic information of words,this paper makes a study on the text representation model,the computation of the feature weight and the cosine similarity formula,and analyzes the defects of the feature weight algorithm and the formula of cosine similarity.In this paper,a new algorithm is proposed to improve the calculation of feature weight and a new method to calculate the semantic weighted text similarity,which is also the innovation points of the research.Specific improvements can be explained from the following:(1)The semantic relation between words and phrases mainly contains the semantic similarity and semantic relativity.Conceptually,semantic relativity is a kind of semantic similarity,which is related to the characteristics between words that have some kind of interdependence and influence each other.This has certain directive significance to the complex network model of a single text.Using the rich semantic knowledge of Wikipedia,this paper calculate the semantic relatedness between words and construct the complex network model of text.The evaluation function CF using the structural characteristics of complex networks is constructed,and the CF-IDF algorithm based on complex network is proposed to improve the calculation of the feature weights.(2)Semantic similarity is a special case of semantic relativity,which means that the two words can be replaced by each other in different contexts without altering the extent syntactic and semantic structure of the text.From the perspective of word similarity between the two texts,even without the same words,but you can identify these two texts have certain similarities with similar words,which is just make up the lack of cosine similarity.Therefore,this paper use How Net calculate semantic similarity,taking into account the impact of feature weight on calculation of text similarity,weighted semantic similarity based on the text in the cosine similarity calculation method with CF-IDF algorithm.(3)The above two algorithms has been verified through experiment.First of all,comparing the classification accuracy with TF-IDF algorithm and CF-IDF algorithm by text classification.Experimental results show that the proposed CF-IDF algorithm can improve the classification accuracy.Secondly,comparing the clustering quality with cosine similarity and text similarity based on semantic weighting by text clustering.The experimental results show that the proposed method can improve the clustering effect on the F1 value of the evaluation criteria.
Keywords/Search Tags:Complex network, Feature weights, Text similarity, Wikipedia, HowNet
PDF Full Text Request
Related items