Font Size: a A A

Representation Learning And Dependency Syntax For Text Summarization

Posted on:2021-04-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:W F LiuFull Text:PDF
GTID:1368330602966034Subject:Network and network resource management
Abstract/Summary:PDF Full Text Request
Text summarization is a key technology in natural language processing(NLP).With explosive growth of text data in recent years,how to obtain its meaning quickly from massive information has received more and more attention.According to the specific processing methods,text summarization has two main paradigms: extractive summarization and abstractive summarization.Extractive summarization can select important sentences as the summary of the document.And the abstractive method mainly obtains summarization by generating and rewriting,it is similar to the Brain Knowledge Extraction of humans.However,extractive summarization often faces small coverage or one-sidedness,and cannot express the overall meaning of the document well.Furthermore,the content generated by current abstractive summarization is often subject to a series of problems such as poor readability,data redundancy,and semantic deviations,it cannot truly express the semantics of the document.To solve these problems,this paper starts with the study of word embeddings,then studies the syntactic structure of sentences and the attention mechanism in documents,finally implements a sentence-level summarization method and a hybrid extractive-abstractive method of summarization.The main contributions are summarized as follows:(1)A novel fine-grained word embeddings of text representation is proposed.Representation Learning is one of basic research in natural language processing or related ones.Aiming at the characteristics of text summarization,this paper researches a combination of feature information such as part-of-speech and position,then constructs a new,fine-grained,more expressive word embeddings of representation method,and then combines a two-dimensional table of word embeddings,which can reduce the size of the word embeddings lookup-table and improve query efficiency.Experiments show this method proposed has better semantic representation capabilities.(2)A novel method for comparing sentence-level similarity based on word embedding and dependent syntactic structures is proposed.Sentence is a basic processing unit of summarization.A meaningful sentence must conform to the syntactic structure of the corresponding language,so it is of great significance to incorporate the syntactic structure when we compare it with the relevant sentences.This paper studies the dependency relationship of words in a sentence,constructs a dependency syntax tree by the dependent syntactic analysis of arc-transformation,and combines the dependency syntactic relationships of words in a sentence to divide into different syntactic components(such as subject module,predicate module,object module,etc.).Followed by preprocessing,passive flipping,normalizing syntactic component blocks,etc.,this paper utilizes the attention mechanism to construct syntactic block embeddings.According to the constructed block embeddings,this paper splices and combines them into sentence-level embeddings.Experiments show that the sentence-level embeddings constructed by this method has a good representation effect.(3)A sentence-level summarization method based on dependency syntax and TreeLSTM is proposed.Based on the previous two parts,the input sentences can be divided into different syntactic blocks.The "hard alignment" mechanism is utilized between the input and output,and the "soft alignment" attention mechanism is used in the inner-syntactic modules.Those parameters are obtained by training Tree-LSTM network.This paper finally constructs a sentence-level summarization model.The dependency syntactic tree is used to ensure the syntactic relationship and readability of the generated sentence.The "hard alignment" mechanism can prevent the syntactic structural components of the long sentences from shifting,and the "soft alignment" mechanism can increase the flexibility of new words generated in syntactic blocks.Finally,the feasibility of this method is verified by experiments.(4)A novel,mixed extractive and abstractive document summarization model is proposed.To address the problems in document-level summarization and fully combine the advantages of the two abstract methods,this paper proposes a two-stage,mixed extractive and abstractive document summarization model.The first phase is to utilize sentence similarities matrix or "pseudo-title" to extract some important sentences from a document.This procedure fully considers the display features(such as sentence position,paragraph position and so on)to extract coarse-grained sentences,concerns the difference of sentences,and selects the most important ones in the document.The second stage is abstractive summarization.The extracted sentences are recombined and rewritten to generate new sentences using beam search algorithms.The optimal result is used as the "pseudo-titles" for the next round.The first and second steps are performed cyclically until the optimal "pseudo title" is obtained,and the final "pseudo title" is used as the summarization of the document.Extensive experiments have been performed on the corresponding English and Chinese data sets,the results show this method can obtain better summarization.
Keywords/Search Tags:Representation Learning, Text Summarization, Word Embeddings, Dependency Syntax, Attention Mechanism
PDF Full Text Request
Related items