Font Size: a A A

Research On Information Parsing Based On Text Classification

Posted on:2020-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:L FuFull Text:PDF
GTID:2428330575963024Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Information parsing is a very important and challenging task in natural language processing,and it also plays an important role in natural language processing applications,such as public opinion monitoring,web search and Intelligent Question-Answer,etc.In recent years,with the continuous development of deep learning,the research of Information parsing has achieved rich results and has been widely applied in engineering of natural language processing.But there are still some shortcomings in some aspects,for example,the supervised deep learning methods require a large amount of high-quality manual labeled training data,which is time-consuming and labor-intensive.And in the Chinese text,the text data will appear ambiguous in the word segmentation,besides,the meaning of the single Chinese character expression is inaccurate and not rich,furthermore,in different situations,the importance of the Chinese words is not the same as the Chinese characters,which leads to some problems in the application of Information parsing in engineering.In order to solve the above problems,this dissertation first proposes a new active learning method,and combines it with the deep learning method.Then this dissertation also proposes the hybrid of the character-level and word-level features with different weights through concatenation,so that the final result of the model can take into account both word-level feature and character-level feature.This dissertation studies Information parsing based on text classification.The main work is as follows:(1)A new active learning method is proposed and combined with deep learning methods to achieve Information parsing.The supervised deep learning models typically require a large amount of high quality and labeled training sample data during the training process.Obtaining such sample data artificially is cumbersome and unreliable,and the process is also very time consuming and labor intensive.Active learning helps assuage this problem by automatically selecting a small amount of unlabeled samples for humans to correct by hand.It is to continuously select the sample data that needs to be labeled,and then iteratively train the deep neural network using these sample data until the expected experimental results are achieved.This dissertation proposes an active learning method for three sample probabilistic selection strategies based on deterministic criteria,which effectively solves the problem that a supervised deep learning method requires a large amount of manual labeling data.The experimental results show that compared with the case of pure deep neural network,the amount of marker training data required to combine active learning with deep neural networks can be reduced by 45.79%in this dissertation,while achieving a given extraction accuracy.(2)Based on the convolutional neural network and the bidirectional long-term memory network attention mechanism model,this dissertation proposes the hybrid of character-level and word-level features with different weights through concatenation to improve the performance of information source analysis.For Chinese words,it is different from Western languages,because there is no separator between words in Chinese text.Therefore,it is first necessary to perform Chinese word segmentation.However,each sentence may have different semantic relevance in Chinese text,which leads to several different word segmentation results after Chinese word segmentation operation,that is,Chinese word segmentation will appear ambiguity issue.For Chinese characters,there is a separator between characters,so there is no ambiguity in the Chinese character segmentation.However,the meaning of a single Chinese character is not accurate and rich.Moreover,for different situations,the importance of the Chinese words and the Chinese characters is not the same.Therefore,this dissertation proposes the hybrid of the character-level and word-level features with different weights through concatenation,so that the model can consider two levels of features at the same time,and let them make up the respective shortcomings to improve the performance of information source analysis.The experimental results show that compared with the simple of word-level features and character-level features,the proposed method improves by 1.20%and 1.69%on the THU dataset,and improves by 2.28%and 5.13%on the Enterprise announcement dataset respectively.
Keywords/Search Tags:Information extraction, Text classification, Active learning, Deep learning, Natural language processing
PDF Full Text Request
Related items