Font Size: a A A

Short Text Correlation Calculation Based On Wikipedia

Posted on:2018-05-19Degree:MasterType:Thesis
Country:ChinaCandidate:Q JingFull Text:PDF
GTID:2348330536465904Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the development of mobile communication technology and social media,the information of Chinese short text form has been penetrated into all fields of society and life.The growth of huge amount of information also gives birth to great use value,how to dig out the deep value of these texts has become a hot topic.Therefore,Natural Language Processing has become a hot research topic.As a basic research work in the field of Natural Language Processing,Semantic Relevancy computation is widely used in the fields of query expansion,word sense disambiguation,Machine Translation,knowledge extraction,automatic error correction and so on.However,as a new kind of text information source,short text is less,so it is difficult to extract effective feature information.In view of the limited information presented in this essay,a lot of background knowledge is needed to extend the sample features.Wikipedia is the world's largest multi lingual,and open the online encyclopedia,by many researchers of all ages,so this thesis chooses Chinese Wikipedia as an external corpus,Wikipedia structure information and semantic information also provides the basis for semantic based analysis.The text is divided into two parts of words and sentences,first proposed words correlation calculation method based on Wikipedia.The method is mainly based on the structural information and semantic information in Wikipedia,Wikipedia's main structure including the classification system,the link structure and redirect the disambiguation page,this paper presents a new method of comprehensive category relevance and link correlation calculation between words related degree using the structural information of Wikipedia.In order to explore the deep semantic information,puts forward the method of calculating the relationship of words using association rules.On this basis,this paper puts forward the calculation method of correlation degree between sentences,mainly from three aspects: the sentence structure,the calculation of the clustering and the use of theme words weighted clustering correlation calculation.The sentence structure includes two aspects: morphology and word order.In the calculation of word form correlation,it is mainly reflected by calculating the frequency of word co-occurrence.The calculation of the degree of relevance based on word pairs is mainly concerned with the semantic information of the words in the sentence.Clustering is mainly to the semantic Related words or text into a class or a cluster,this article will be used to calculate the correlation between sentences,to improve the accuracy of sentence relevance calculation.On the basis of forming the theoretical method,the design of the experimental scheme is completed.First,Download Wikipedia Chinese corpus;secondly to complete the calculation of correlation degree between words and sentences;the results were compared with manual annotation,the experimental scheme with artificial translation test set WordSimilarity-353 and the National University of Defense Technology statistics Words-240 as word correlation test set sentence correlation test set selection China database the world wide web knowledge extraction contest provides short text semantic correlation match data sets,the correlation coefficient compared Spearman parameters and accuracy,in calculating the relationship of words,Spearman parameter method in this paper is 2.8% higher than the traditional sentence correlation algorithm,the accuracy rate reached 73.3%,achieved good results.The rationality and practicability of the method are proved.
Keywords/Search Tags:Wikipedia, correlation, short text, semantic association, association rules
PDF Full Text Request
Related items