Research On Multi-Feature-based Of Social Network Text Normalization

Posted on:2021-04-10

Degree:Master

Type:Thesis

Country:China

Candidate:G Y Yang

Full Text:PDF

GTID:2428330629484455

Subject:Cyberspace security

Abstract/Summary:

PDF Full Text Request

With the development of the Internet in recent years,social networks have become an important source of real-time information.According to the 44 th China Statistical Report on Internet Development published by China Internet Network Information Center(CNNIC),there are more than 904 million Internet users in China by March 2020.Apart from that,the amount of real-time data that generated in the Internet is far more than those of netizens.In these data,there are numerous nonstandards phrases such as pronouns,abbreviations and spelling variants.In this paper,we have fully analyzed those terms and defined them as non-standard words(NSW).Those kinds of words have bought huge challenges to the downstream tasks of natural language processing(NLP).In order to solve this problem,researchers have proposed many methods to normalize these texts that contain non-standard words.However,with the evolution of social network environment,these state-of-art methods are insufficient to deal with this task as it frequently producing new NSW.In this regard,the main works of this paper are: 1)Introduce the analyze start-of-art methods that aim at text normalization task.By comparing these methods,we find their respective drawbacks in the actual process.According to the characteristics of NSW in social networks,we propose a method based on sequence annotation to identify and classify them.2)Propose a sequence labeling model based on Transformer.The input vectors of the model adopts the method of combining Chinese features and word embedding, which enhances the ability of representing NSW.Moreover,we utilized Stacked Denoising Autoencoders(SDA)to encode feature vector for the sake of improving the efficiency of the model.In the training process,a special design is carried out according to the characteristics of non-standard words,so as to optimize the recognition effect of the model.3)We designed several Comparative experiments for the proposed model.Due to the lack of public corpus containing Chinese NSW,we have crawled a great deal of texts from the popular social network platforms,and constructs the corpus after the operation of preprocessing.After that,we have carried out experiments of comparing our methods,parameters with other models.The F1 score of our method is 85.4%.The results show that our proposed model performs greatly on the recognition and classification of various non-standard words,which exceeded the state-of-art methods and is suitable for the text normalization task of massive data in social networks.

Keywords/Search Tags:

text normalization, natural language processing, transformer, autoencoder, sequence labelling, word embedding

PDF Full Text Request

Related items

1	Unsupervised Extractive Text Summarization Using Sentence Embedding
2	Research On Jointly Learning Word Embeddings And Latent Topics In Text
3	Generative Dialogue System Based On Transformer
4	Deep Contextual Word Embedding In Natural Language Processing
5	Research On Machine Learning For Natural Language Processing And Transmission
6	Sentence Vectorization Modeling And Text Level Application
7	Word Embedding Revision Based On Morphological Information And Semantic Lexicons
8	Improvement And Application Of Text Classification Based On RNN
9	Research On Text Classification Based On Natural Language Processing And Machine Learning
10	Research On Multi-granularity Chinese Word Embedding Based On Glyph Structure