Research And Realization Of Chinese Text Proofing Method For Social Media

Posted on:2016-09-29

Degree:Master

Type:Thesis

Country:China

Candidate:X Zhang

Full Text:PDF

GTID:2208330461987639

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the rapid development of the social media, especially the raise of Web 2.0, an expanding number of people express their feelings and opinions in various social media networks. There is a plenty of wrong words, homophonic words and other ill-formed input in user-generated social media text compared with regular new. How to proofread these ill-formed characters becomes a basic research topic in the field of natural language processing over the past years.In this paper, on the basis of analyzing the characteristics of social media, we explore the classification and the distribution of homophonic words and misspelled words. In order to improve readability and normalization of the text and lay a good foundation of the theory and application for subsequent social media research work, we explore the Social media text proofreading task in the view of homophonic word and misspelled word proofreading. Specifically, this paper launches the research from the following two aspects:1. We proofread the Social media text based on language model. Misspelled and homophonic words are the most common non-standard diction in social media text. In this paper,we construct candidates by using phoneme table, similar font words table and homophonic reduction knowledge with different granularity. Then,we respectively determine the best result from the two candidate sets in the framework of n-grams. The experimental results show that using the knowledge and language model of character can improve proofreading performance.2. We proofread the Social media text based on semantic similarity. The test sentences and standard sentences have the same semantic information. This article embarks from the perspective of semantic, we take use of semantic features to accomplish text proofreading problem. Firstly, we train word vectors by using word2 vec from a lot of untagged corpus. Secondly, we get word vectors of candidate words and their contexts. Finally, we select the best candidate by computing the semantic distance and complete the text proofreading. The experimental results show that the methodsbased on semantic similarity is effective for text proofreading.

Keywords/Search Tags:

misspelling correction, homophonic word normalization, language models, semantic similarity

PDF Full Text Request

Related items

1	Chinese Word Semantic Similarity Measure And Its Application In Cross-language Information Retrieval
2	Research And Implementation Of Chinese Abbreviations Reduction Methods Based On Statistics
3	An Algorithm For Optimizing Word Similarity In "Knowledge Network"
4	The Research Of HowNet Based Word Similarity Computation And Its Application
5	Chinese-Old Bilingual Text And Sentence Similarity Calculation Research
6	Research On Normalization Of Microblog Text Based On Distributed Semantic Representation
7	The Research Of Semantic Similarity Computing Algorithm Based On HowNet
8	Semantic Similarity Computation And Application For Text Based On HNC Theory
9	Chinese Sentence Similarity Based On Semantic Role Labeling
10	Research And Application Of Word Similarity Based On Context