Font Size: a A A

A Bad Text Recognition Based On Multi-feature Graph Convolutional Embedding

Posted on:2022-02-09Degree:MasterType:Thesis
Country:ChinaCandidate:S D DuFull Text:PDF
GTID:2518306605466314Subject:Master of Engineering
Abstract/Summary:PDF Full Text Request
According to the 47 th "Statistical Report on China's Internet Development Status" issued by the China Internet Network Information Center in February 2021,by December 2020,the number of Internet users in China has reached 989 million,and the proportion of Internet users using mobile phones to access the Internet has increased to 99.7%.An open and inclusive Internet environment has enriched people's daily life,improvs the quality of life of residents and work efficiency.However,as a double-edged sword,the content circulating on the network is mixed,the bad information in it will bring a great impact on normal users,infringe on personal safety seriously,corrupt the overall social atmosphere,and even induce others to take the road of illegal and criminal activities.Bad texts usually appear in the form of instant chat or comments,among which pornographic,violent,and illegal content,etc.,has strong negative information,which is characterized by short text length,changeable language structure,and vague semantics.In addition,the kind of text also has a strong evasive nature,and some common bad words are usually replaced by abnormalities and variants,etc.,which are difficult to be effectively identified by traditional artificial rules or machine learning algorithms.Based on the investigation and research of text classification and bad information recognition related technologies,we proposed a short text classification model based on multi-feature graph convolutional embedding,and designed a set of efficient bad lexicon expansion and retrieval schemes combined with engineering ideas.Finally,the bad text recognition system is designed and implemented.The main research work of the thesis is as follows:1)Based on the multi-feature graph convolutional embedding short text classification model.Considering the problems of sparse short text features,variable language structure,and ambiguity in semantics,a graph representation method that combines text multi-features is proposed,which represents text dependencies,documents,and part-of-speech information as heterogeneous graph,then uses Graph Convolutional Network(GCN)captures the characteristics of the graph structure.Finally,the trained word embedding is input into the Transformer model to learn multi-hop information between nodes and reduce the impact of noise feature nodes on classification.The experimental results show that compared with the existing text classification baseline model,the MFGCE-Transformer model proposed in the thesis has improved accuracy and F1 score on multiple sets of public Chinese short text data sets.2)Bad word recognition model based on Trie-tree and term frequency-inverse document frequency.Considering the characteristics of bad text,such as evasiveness,irregularity and other interference model recognition,from the perspective of synonyms,heteromorphic words,and variant words,the design is based on word embedding and locally sensitive hashing,based on hidden Markov model and Pinyin,a bad vocabulary expansion scheme based on the Chinese word splitting dictionary;then the bad vocabulary is modeled by a Trie-tree,and the word frequency-inverse text frequency is used as the stop node mark.Experiments show that the bad word recognition model proposed in the thesis is highly efficient.3)The realization of bad text recognition system.Based on the above ideas,combined with software engineering methods,the bad text recognition system is designed and implemented.The collected and marked microblog comments and social network chat desensitized data are used as test objects.The test results show that the system has excellent effects on bad text recognition task and scalability as well.The topic of the thesis comes from a corporate project.The proposed algorithm improvement plan is applied in actual projects,which improves the recognition effect and efficiency and has strong practicability.
Keywords/Search Tags:Bad text, Short text classification, Graph convolutional network, Text multifeature, Self-attention mechanism
PDF Full Text Request
Related items