Font Size: a A A

Research And Realization Of Chinese Text Proofing Method For Social Media

Posted on:2016-09-29Degree:MasterType:Thesis
Country:ChinaCandidate:X ZhangFull Text:PDF
GTID:2208330461987639Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of the social media, especially the raise of Web 2.0, an expanding number of people express their feelings and opinions in various social media networks. There is a plenty of wrong words, homophonic words and other ill-formed input in user-generated social media text compared with regular new. How to proofread these ill-formed characters becomes a basic research topic in the field of natural language processing over the past years.In this paper, on the basis of analyzing the characteristics of social media, we explore the classification and the distribution of homophonic words and misspelled words. In order to improve readability and normalization of the text and lay a good foundation of the theory and application for subsequent social media research work, we explore the Social media text proofreading task in the view of homophonic word and misspelled word proofreading. Specifically, this paper launches the research from the following two aspects:1. We proofread the Social media text based on language model. Misspelled and homophonic words are the most common non-standard diction in social media text. In this paper,we construct candidates by using phoneme table, similar font words table and homophonic reduction knowledge with different granularity. Then,we respectively determine the best result from the two candidate sets in the framework of n-grams. The experimental results show that using the knowledge and language model of character can improve proofreading performance.2. We proofread the Social media text based on semantic similarity. The test sentences and standard sentences have the same semantic information. This article embarks from the perspective of semantic, we take use of semantic features to accomplish text proofreading problem. Firstly, we train word vectors by using word2 vec from a lot of untagged corpus. Secondly, we get word vectors of candidate words and their contexts. Finally, we select the best candidate by computing the semantic distance and complete the text proofreading. The experimental results show that the methodsbased on semantic similarity is effective for text proofreading.
Keywords/Search Tags:misspelling correction, homophonic word normalization, language models, semantic similarity
PDF Full Text Request
Related items