Font Size: a A A

Research On Word-level Interactive Text Classification Combined With Self-attention Mechanism

Posted on:2022-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:S Y WuFull Text:PDF
GTID:2518306557477254Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Nowadays is a data age.The channels for people to obtain information are more from the news texts on the Internet.However,the complexity of text information and the rapid change of news events in the information age make it particularly important to find useful information efficiently.Text classification is a basic task in natural language processing.Its classification model based on shallow learning has occupied a dominant position for a period of time.However,due to the disadvantages of requiring a large amount of manual extraction of features and relatively sparse data,it cannot be a kind of Efficient classification method.Compared with the shallow model,the deep neural network model can solve these problems very well.Although the deep neural network model is of great significance,its classification mainly relies on the text-level representation,but it does not take into account the fine-grained(matching signals between words and classes)classification clues,and in our news text,there are some words that have strong directivity,such as missiles,atomic bombs,etc.,which strongly point to the military field.These fine-grained classification clues are rarely obtained in deep neural network models.attention.In addition,due to the complexity and diversity of web page data,text classification workers will spend a lot of time searching for suitable data sets in actual research.In response to the above-mentioned problems,this paper starts the research on text classification from two aspects:improving the performance of web crawlers and word-level interactive text classification.One is to propose a word-level interactive text classification model fused with self-attention mechanism based on the strong pointing words in some sentences in news texts or the combination of two words that can clearly point to a certain category.In the calculation process of the self-attention mechanism,the connection between any two words in the sentence can be directly related through a calculation step,so no matter how far apart the two words are,as long as they have a semantic and logical relationship,they can definitely be connected.Correspondingly,make full use of the information between these two words.The word-level interaction layer in the model explicitly calculates the matching scores between words and classes,and combines word-level matching signals into text classification,paying full attention to fine-grained classification clues such as words,making the classification more refined and efficient.The second is to conduct research and analysis from the three perspectives of web crawler crawling strategy,program performance optimization,and data update strategy: First,according to the distribution characteristics of web news,that is,hot and key news always appear in the top web pages,it is decided Adopt the breadth-first traversal crawling strategy;second,in order to solve the problem of multiple address links waiting in the process of crawling data,it was decided to use a non-blocking IO interface to strive to allocate more CPUs to provide program execution efficiency;Third,by comparing and analyzing the existing data update methods,it is finally decided to update the data incrementally.
Keywords/Search Tags:Deep neural network, Text classification, self?attention, Word-level interaction, Web crawler
PDF Full Text Request
Related items