Font Size: a A A

The Method Of Chinese Synonym Extraction Based On Large-scale Corpus

Posted on:2015-08-03Degree:MasterType:Thesis
Country:ChinaCandidate:H C MaFull Text:PDF
GTID:2298330422983998Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the popularity of computer and the rapid development of Internet, the information on theInternet assumes the exponential order to grow. At the same time, the degree shared the informationresources is more and more high, bring great convenience in People’s Daily life. At present, peopleface a lot of information every day, how to extract valuable information from huge amounts of datathat has become a hot topic in the research of information technology. Chinese synonym extractionis the foundation of Chinese information processing research, it plays a different role in differentapplication fields. As synonym scattered in sea of information, it is to extract synonym as much aspossible, the paper use the large-scale corpus as the research object.The continuous development of Internet technology and the explosive growth of information,Natural Language Processing and information retrival technology play more and more importantrole in deal with and extract information, synonym has important research significance andapplication value in all sorts of Natural Language Processing. According to this, the paper proposestwo kinds of synonym methods, such as literal similarity and PageRank, Pointwise MuturalInformation(PMI) and Latent Semantic Analysis(LSA).Based on Literal similarity and PageRank method, make full use of the literal similarcharacteristics and PageRank semantic relation. Both consider the matching sequence andcompatibility of the two words and relationship.The combination method of PMI and LSA based on the principle of PMI and LSA theory. PMIuse two word mutural information to estimate multiple words simply and effectively. LSAcombines computer science, mathematics, the ideas of information science and technology andmeans to dig the potential meaning of vocabulary. According to the retrieval results of two wordssemantic association to achieve the goal. Synonym extraction method based on LSA starts with themass matrix that word associated with the document and build a semantic space automatically toallow user to find relevant information. As long as it connect with the main body of the document,they are still close to this document in the semantic space. So the position of the words anddocuments in the semantic space can be used to as a kind of idea guidance, the process of extractinginformation is used to identify a point in the space. According to the word vector with the documentvector dot product between the cosine value of the size of the array. This paper presents twofeasible similarity extraction methods.Finally, the two extraction methods are verified through the experiment, the recall rate and the accuracy and F index are improved.
Keywords/Search Tags:Synonym, Synonym extraction, Literal similarity, Pattern matching, PageRank, Pointwise Mutual Information, Latent Semantic Analysis
PDF Full Text Request
Related items