Word Embedding And Topic Model Based Biomedical Summarization

Posted on:2016-08-31

Degree:Master

Type:Thesis

Country:China

Candidate:H H Hao

Full Text:PDF

GTID:2308330461483532

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of technology, the resources and number of literature onlinehas been growing exponentially. All the resources can bring vast amount of information,however they also bring the problem of data redundancy and spam at the same time, and itwould take much more time of users to find the materials that they want. Text summarizationcan extract the most important information existing in the corpus quickly and employ a fixedlength fraction to represent the source files, with saving time and increasing work efficiencyfor users. In the biomedical domain, such as MEDLINE Database, one concept can beretrieved thousands of related documents. Thus the exploration of text summarization hasgreat value for biomedical researchers.Since word2vec was proposed in 2013,the model has been widely used with theadvantage of high-efficiency and flexibility. Moreover because the technology of deeplearning has an outstanding performance recently,the study of word embedding has attractedmuch attention from researchers. How to combine word embedding into the technique of textsummarization to have a better performance of summary will be the focus of this paper.Firstly, this paper handled the corpus into a candidate sentence set, every sentence could betreated a node of the graph with an average weight. Then we computed the semanticsimilarities between every two sentences using word embedding as the weight of their edges.Calculated weights of nodes in the graph iteratively based on PageRank. And then the weightof a node could represent the importance of its corresponding sentence. Finally it generatedsummary with maximal marginal relevance to reduce redundancy among sentences. Toexplore the most suitable way of word embedding for this task, this paper has adopted severalways, such as employing the average of word embedding of features, the maximum of eachdimension of features, combing semantic computation or others to represent sentence. Bycomparing the results of three groups,we can conclude that the method of combing semanticsimilarity would contribute best to performance of summarization.Although text summarization can help users to quickly browse information, users ofdifferent roles may have different information needs. For example, as for a disease,the doctorwants to know about its latest research results, whereas the patient may want to leam about itssymptoms, diagnoses or treatments. Hence as for different users, this paper proposed auser-oriented summarization. Take doctors and patients for experiments, this paper parsed thecomments of the two class users about disease “HIV Infections,,,to make up two commentsets. Then it conducted topic model on the two sets with pLSA and LDA, to find the topicterms that the users are really interested in.-Finally we incorporated these terms into thecomputation of importance of sentences. This paper evaluated the method from two aspects;performance of summary and similarity with topic terms, and the results proved the efficiencyof generating user-oriented disease summarization.

Keywords/Search Tags:

Text Summarization, Biomedical domain, Word Embedding, Topic Model

PDF Full Text Request

Related items

1	Research On Text Topic Modeling Based On Word Embedding
2	Unsupervised Extractive Text Summarization Using Sentence Embedding
3	Research On Automatic Text Abstract System Based On Chinese Long Text
4	Research On Short Text Topic Model Based On Semantic Information And Word Triangle
5	Research On Technology Of Automatic Text Summarization Based On Multiple Word Co-occurrence And Mutual Information
6	Research On Topic Model Over Short Texts With Incorporation Of Word Embedding
7	A Study Of Short Text Topic Models Based On Information Of Word Embeddings
8	Research On Short Text Aspect Extraction Base On Topic Model And Word Embedding Mechanism
9	Design And Implementation Of Topic Detection System For Specific Domain
10	Research On Generation Method Of Evolutionary Multi-document Summarization Based On Sub-topic Enhancement