Font Size: a A A

Word Embedding And Topic Model Based Biomedical Summarization

Posted on:2016-08-31Degree:MasterType:Thesis
Country:ChinaCandidate:H H HaoFull Text:PDF
GTID:2308330461483532Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of technology, the resources and number of literature onlinehas been growing exponentially. All the resources can bring vast amount of information,however they also bring the problem of data redundancy and spam at the same time, and itwould take much more time of users to find the materials that they want. Text summarizationcan extract the most important information existing in the corpus quickly and employ a fixedlength fraction to represent the source files, with saving time and increasing work efficiencyfor users. In the biomedical domain, such as MEDLINE Database, one concept can beretrieved thousands of related documents. Thus the exploration of text summarization hasgreat value for biomedical researchers.Since word2vec was proposed in 2013,the model has been widely used with theadvantage of high-efficiency and flexibility. Moreover because the technology of deeplearning has an outstanding performance recently,the study of word embedding has attractedmuch attention from researchers. How to combine word embedding into the technique of textsummarization to have a better performance of summary will be the focus of this paper.Firstly, this paper handled the corpus into a candidate sentence set, every sentence could betreated a node of the graph with an average weight. Then we computed the semanticsimilarities between every two sentences using word embedding as the weight of their edges.Calculated weights of nodes in the graph iteratively based on PageRank. And then the weightof a node could represent the importance of its corresponding sentence. Finally it generatedsummary with maximal marginal relevance to reduce redundancy among sentences. Toexplore the most suitable way of word embedding for this task, this paper has adopted severalways, such as employing the average of word embedding of features, the maximum of eachdimension of features, combing semantic computation or others to represent sentence. Bycomparing the results of three groups,we can conclude that the method of combing semanticsimilarity would contribute best to performance of summarization.Although text summarization can help users to quickly browse information, users ofdifferent roles may have different information needs. For example, as for a disease,the doctorwants to know about its latest research results, whereas the patient may want to leam about itssymptoms, diagnoses or treatments. Hence as for different users, this paper proposed auser-oriented summarization. Take doctors and patients for experiments, this paper parsed thecomments of the two class users about disease “HIV Infections,,,to make up two commentsets. Then it conducted topic model on the two sets with pLSA and LDA, to find the topicterms that the users are really interested in.-Finally we incorporated these terms into thecomputation of importance of sentences. This paper evaluated the method from two aspects;performance of summary and similarity with topic terms, and the results proved the efficiencyof generating user-oriented disease summarization.
Keywords/Search Tags:Text Summarization, Biomedical domain, Word Embedding, Topic Model
PDF Full Text Request
Related items