Font Size: a A A

Research On Multi-Feature-based Of Social Network Text Normalization

Posted on:2021-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:G Y YangFull Text:PDF
GTID:2428330629484455Subject:Cyberspace security
Abstract/Summary:PDF Full Text Request
With the development of the Internet in recent years,social networks have become an important source of real-time information.According to the 44 th China Statistical Report on Internet Development published by China Internet Network Information Center(CNNIC),there are more than 904 million Internet users in China by March 2020.Apart from that,the amount of real-time data that generated in the Internet is far more than those of netizens.In these data,there are numerous nonstandards phrases such as pronouns,abbreviations and spelling variants.In this paper,we have fully analyzed those terms and defined them as non-standard words(NSW).Those kinds of words have bought huge challenges to the downstream tasks of natural language processing(NLP).In order to solve this problem,researchers have proposed many methods to normalize these texts that contain non-standard words.However,with the evolution of social network environment,these state-of-art methods are insufficient to deal with this task as it frequently producing new NSW.In this regard,the main works of this paper are: 1)Introduce the analyze start-of-art methods that aim at text normalization task.By comparing these methods,we find their respective drawbacks in the actual process.According to the characteristics of NSW in social networks,we propose a method based on sequence annotation to identify and classify them.2)Propose a sequence labeling model based on Transformer.The input vectors of the model adopts the method of combining Chinese features and word embedding, which enhances the ability of representing NSW.Moreover,we utilized Stacked Denoising Autoencoders(SDA)to encode feature vector for the sake of improving the efficiency of the model.In the training process,a special design is carried out according to the characteristics of non-standard words,so as to optimize the recognition effect of the model.3)We designed several Comparative experiments for the proposed model.Due to the lack of public corpus containing Chinese NSW,we have crawled a great deal of texts from the popular social network platforms,and constructs the corpus after the operation of preprocessing.After that,we have carried out experiments of comparing our methods,parameters with other models.The F1 score of our method is 85.4%.The results show that our proposed model performs greatly on the recognition and classification of various non-standard words,which exceeded the state-of-art methods and is suitable for the text normalization task of massive data in social networks.
Keywords/Search Tags:text normalization, natural language processing, transformer, autoencoder, sequence labelling, word embedding
PDF Full Text Request
Related items