Chinese Informal Lexical Normalization Based On O&A Community

Posted on:2019-02-20

Degree:Master

Type:Thesis

Country:China

Candidate:T Tian

Full Text:PDF

GTID:2348330542998689

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

There are a large number of informal words in Internet texts.These non-canonical words contain not only the input errors caused by the unconsciousness of users,but also some common new Internet words used by users for humorous and censoring purposes.It is the existence of these informal words that make traditional NLP tools less capable of handling Internet text.Therefore,in the preprocessing stage,replacing informal words with corresponding formal words is an important method to enhance the performance of downstream NLP tasks.This paper mainly studies the normalization task of informal words,that is,given an informal word to find the corresponding formal word.The main innovations and research results are as follows:1.Proposed the word normalization technology based on the network knowledge base,and study the key problems among them.The past work mainly regards the word normalization as a problem of spelling correction from the phonetic and lexical perspectives.It is difficult to effectively model emerging informal word production mechanisms such as transliteration and synonymous substitution.This paper studies the task of word normalization from the perspective of knowledge extraction.First of all,we collects sentences that explain the meaning of informal words from the network knowledge base,and then use the method of semantic understanding and classification to extract the target formal words and complete the informal word normalization task.In this paper,Q&A community as a network of knowledge base,the proposed technical programs have been validated.2.This paper studied the problem of extracting the target informal words from the answers to the questions and answers community users and designs and implements the target normative word extraction algorithm based on the sentence semantics.After retrieving the user’s answer from a community of question and answer to explain the meaning of a given informal word by searching and crawling,the question turns to how to extract the target formal word accurately from the user’s answer.Based on the sentence semantics,this paper proposes several extraction models based on LSTM,including the extraction model of predictive start and end position and the sorting model of phrase block coding.At the same time,the traditional pattern matching model is realized,and The performances of different models are compared by experiments.3.Based on the generation mechanism of informal words,this paper designs and implements the method of determining the candidate normative words based on the inherent relationship of word pairs.At present,relying on semantic comprehension only makes it difficult to obtain high-quality informal-formal word pairs.Therefore,this article also models the association between the word pairs themselves to further classify the extracted noisy informal-formal word pairs.We designs the corresponding pinyin,glyphs and other features for different categories,the performance of many kinds of classifiers on this task has been compared by experiments and a good classification result has been achieved.

Keywords/Search Tags:

Text Normalization, Q&A Community, Information Extraction

PDF Full Text Request

Related items

1	Research On Topic Feature Extraction And Text Classification In Social Internet Community
2	On The Normalization And Romanization Of Dai Language Texts For Textual Translation
3	Research On Multi-Feature-based Of Social Network Text Normalization
4	Design And Implementation Of Text Information Extraction On Smart Phone
5	Research On Normalization Of Microblog Text Based On Distributed Semantic Representation
6	Research On Text Normalization And Prosody Structure Prediction In Mandarin Text-to-Speech System
7	Text Information Extraction In Colorful Scene Image
8	Reasearch On Video Text Information Extraction Based On Features Integration
9	Text Information Extraction Based On Domain Rules And Deep Learning
10	Research On Text Normalization And Its Key Technologies For Chinese Microblogs