A Research On Text Analysis And Representation Based On Semantic Infomation

Posted on:2019-12-02

Degree:Master

Type:Thesis

Country:China

Candidate:Y X Sun

Full Text:PDF

GTID:2428330545985295

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Text analysis and representation are indispensable and important steps in natural language processing tasks.Its purpose is to extract useful information related to tasks from the text.Semantic information is very important information in natural language processing,but traditional analysis and representation methods rarely consider the semantic information in texts.With the development of neural networks,natural language processing technology based on deep learning has been widely used and has achieved better performance than traditional methods in many tasks.This is mainly due to the fact that neural networks can make better use of semantic information in texts.How-ever,due to the limitations in the use of neural networks,we may encounter situations where we cannot directly use neural networks.Therefore,exploring how to analyze text based on semantic information is still a question worth studying.In natural language processing,the length of text is short and long and it would be unreasonable to use the same analysis method for all texts.This article studies how to perform text analysis and representation based on semantics in long texts and short texts,and explores effective analysis and presentation methods based on two specific tasks:text categorization and abbreviation generation from english paper title.In addition,some current methods have been investigated and improvements have been made to the deficiencies in these methods.In the analysis of long texts and short texts,the existing methods mainly focus on the analysis in lexical level,and the use of semantic information is somewhat insufficient.In the task with long texts,such as text categorization,the current popular vector space model is actually extracting information related to word frequency and document frequency,and ignores semantic information such as word order and word meaning.In the task with short text,such as english paper title abbreviation generation task,the current method is also based on rules or considering it as a serialized annotation task,without considering semantic related information in the text analysis process.In the process of text analysis and representation,this article introduces the semantic information of the words in the text,which is used to improve the effect of text analysis and expression.In the process of analysis,the relevant structural information of the long and short text is also introduced to enhance the use of semantic information in the analysis.This paper proposes effective semantic-based text analysis and representation methods on long texts and short texts.The main works are as follows:1.For text classification with long texts,this paper makes some improvements to the method of text representation.The theory of multi-instance learning is introduced to reduce the effect of noise on text categorization.Different from the previous text representation method,the text is divided into several segments according to certain rules,and then a document is represented as a bag composed of a plurality of feature vectors.In the process of dividing a document into multiple segments,this paper hopes to reduce the interaction between different topical content as much as possible to reduce the impact of noise in the categorization process.The experimental results show that the method of this paper can effectively improve the effect of text categorization.2.For the english paper title abbreviation generation task with short text,in addition to the lexical level analysis of the given text,this paper also increases the syntactic level and semantic level analysis,making the identification of important words in this paper more accurate.In addition,this paper also modeled the n-gram language model of the existing words,and used the language model to select candidate abbreviations generated by the system.Experiments show that the system in this paper is superior to the previous methods and some online systems in terms of recall,and from the example of system-generated candidate abbreviations,the system-generated abbreviations proposed in this paper are closer to the abbreviations given by the authors.

Keywords/Search Tags:

Text Representation, Text Categorization, Multiple-Instance Learning, Abbreviation Generation, Semantic Information

PDF Full Text Request

Related items

1	Text Representation And Algorithms For Chinese Text Classification
2	Research On Semantic Analysis And Generation Technology For Text Sequence Data
3	Research On The Term Weighting Scheme And Text Representation Strategy For Text Categorization
4	The Study Of Chinese Text Representation And Classification Based On Multi-Instance Learning
5	Studies On Some Essential Problems In Automatic Text Categorization
6	The Research And Implementation Of Automatic Text Categorization For Chinese Web Documents
7	A Study On Text Categorization Based On Machine Learning
8	Researching Text Classification Using Semantic And Sequence Information
9	Study On Text Semantic Representation And Key Techniques Of Hierarchical Classiifcation
10	Chinese Text Categorization Based On Multi-Instance Learning