Font Size: a A A

Query-oriented Summarization Technologies For Text-central XML Data

Posted on:2011-11-01Degree:MasterType:Thesis
Country:ChinaCandidate:S H WuFull Text:PDF
GTID:2178330332966439Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
XML (extensible markup language), is a meta-markup language that allows the user to markup the text used the specific area tag which can be describe the meaning and structure of the context. With the rapid development of the Internet and network technology, the in line data which meet the XML specification has been widely applied in current information society. The Query-oriented XML summarization technology, is a technology that based the users query and can be obtain a complete sense, good readability and meet the user' need summarization from the query-related document sets combination the techniques of the XML information retrieval technology and automatic summarization technology. With the appearance of the technologies of the Query-oriented XML Texts Summarization, people can be find their information from the XML data sea quickly and efficiently and so its can be alleviate user's read burden too.In this paper, the author focuses on study the technology of the Query-oriented XML Text Summarization, the primary research can be summarized as follows:1. Design and Construct a Corpus for Query-oriented XML Text Summarization. In this paper, the author introduction the works on constructing a Corpus for Query-oriented XML Text Summarization, including the selection of topics and XML elements/documents, construction process and the features of the constructed corpus. Up to now, the corpus has 25 English query topics, including 422 elements for summarization, and 32 Chinese topics which including 402 elements.2. A model for Query-oriented XML Text Summarization is proposed. First of all, the query-related document set has been divided into sentences and then divided these sentences into the query-related sentences set and query-unrelated sentences set through the improved ranking method based on density analysis. Second, expansion the user's query keyword through the improved topic signature method from the query-related sentences set, consociate these expansion query keywords to calculate the correlation score between each sentences and the query topic given by user, the correlation score between each sentences of the document set and the topic of the query-related document set, obtain the score of the sentence tag form the probability distribution of the tag in the Query-oriented XML Text Summarization Corpus, used improved Z. Szlavik method to obtain the sentence level score, and then combine these score by linear combination method to give each sentence a score and ranking the sentences by these score. Thirdly, used the content similarity-based method to remove duplication sentences and add the remaining sentences into the summarization collection. The evaluation results on the ROUGE-1 and manual evaluation shows that the proposed model can be obtain a summarization ideally.3. A sentence ordering strategy based on random surfer model in XML Summarization is proposed. The model can be combined the sentence's order relations (chronological, positional, layer) and topic relations between two sentences through linear combination to build a direction graph, where the vertices are sentences and edges are weight of the two relationships' combination. We calculate the scores of the sentences via the Pagerank algorithm and reorder sentences according to their scores. Ranking the sentences can be obtained a sentence sequence and this sentences sequence is the final summarization, Experiments results show our algorithm can significantly improve the logical, coherent and readable of XML summary.
Keywords/Search Tags:Query-oriented XML Text Summarization, corpus, query expansion, linear fusion, random surfing model
PDF Full Text Request
Related items