Font Size: a A A

A Study Of Offending Words Mining Based On Semantic Similarity

Posted on:2024-09-30Degree:MasterType:Thesis
Country:ChinaCandidate:H FengFull Text:PDF
GTID:2568306938990569Subject:Statistics
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet social media,people will enter an era of rapid change and sharing.The integration of Internet social media and traditional information industry has produced a lot of new applications closely related to life and increased users’ desire to use them.The increasing variety of content on the Internet poses many challenges to the content security of these companies.Among the content security risks,illegal content is a common risk,and the main difficulty lies in the continuous evolution of illegal words,which are difficult to discover through rules and difficult to enumerate.Easy to miss and update in a timely manner.We are trying to provide a solution to optimize the above problem.At present,topic model technology has made great progress,and it has become an important method of text information processing.However,due to the large amount of data between words in the short text,traditional text mining methods can not effectively mine the topic information in the short text.In addition,this thesis also proposes a method to expand the data by using the word co-occurrence information in the text set to obtain the topic distribution.This idea has been improved a lot.At present,the research of this kind of topic pattern rarely involves the semantics of co-occurrence words.Based on the semantic analysis,this thesis introduces a two-word short topic model(SA-BTM model based on Word2VEC),and applies it to the semantic association of two words.At the same time,this thesis also discusses the method of determining the topic dimension which is closely related to the effect of topic mining.The work of this thesis includes:1)The role of word sense association in topic mining is discussed.Based on the training of a large number of text data,this thesis proposes a vector form of word embedding with representational meaning relations,and analyzes and compares the semantic relations between words using semantic similarity.This study compares different semantic similarity in different semantic similarity range expansions.2)On the basis of semantic similarity,a two-word short topic model is established.Through the analysis of the semantic relationship between words in the text,this thesis chooses appropriate two-words as the basis for propositional reasoning.On this basis,the topic information in the text can be mined effectively,and the topic mining can be carried out on this basis.The results are compared with those of other models.3)Compare the classification methods based on the combination of text expression and attention mechanism based on RNN,and finally optimize the specific Word2vec model.In the traditional topic information mining,topic dimension has a great influence on the mining results,and on this basis,this thesis proposes a model-based topic model and mining analysis method,and through an example to verify that the method can quickly find the appropriate topic dimension.In this thesis,the experimental data are established by using a variety of texts such as Zhihu problem set,and the model and method described in this thesis are used to effectively mine the topic and quickly determine the dimension of the topic.Compared with other models,the effectiveness of the proposed method is verified.
Keywords/Search Tags:Similarity, The banned word, RNN, Classification, Mining
PDF Full Text Request
Related items