Font Size: a A A

Research On The Key Issues Of Text Segmentation And Its Application In Multi-document Summarization

Posted on:2009-10-09Degree:DoctorType:Dissertation
Country:ChinaCandidate:N YeFull Text:PDF
GTID:1118360308478435Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the advent of web age,the amount of electronic texts grows explosively, which makes the Internet a huge information resource.The massive online information brings great convenience to people,however it also brings new challenges to the information industry. In the domain of intelligent text processing, it has become a hot research topic to efficiently and accurately find users'desired information from the plentiful information and show it to the users in an appropriate form.Traditional text processing technologies take the whole discourse as the basic processing unit, implicitly assuming that one discourse only discusses one topic.However a natural language discourse usually concerns multiple subtopics, which makes the processing granularity based on the whole discourse too coarse-grained to satisfy users' higher requirements on accuracy. The study of text segmentation arises under such background.In the study of text segmentation, there are two critical problems involving automatic boundary detection and automatic determination of the number of segments in a document. Focusing on the above problems, this paper conducted theoretical analysis on the shortcomings of current text segmentation algorithms.Besides, new models are proposed on the basis of thorough analysis on the text structure characteristics and achieved good performance.Finally, the proposed text segmentation model is applied to the task of query-oriented multi-document summarization. Experimental results show that the subtopic information provides valuable cues to summarization, consequently improving the quality of summary. The main work of this paper includes:1.We proposed a text segmentation model based on regional lexical density. We conducted theoretical analysis on the state-of-the-art text segmentation algorithm-the Dotplotting algorithm.Two problems are found in Dotplotting's measure for assessing density that represents topical coherence.First, the density function is asymmetric,leading to the apparent false phenomenon that forward scan may result in different segmentation with backward scan. Second, while determining next boundary, the assessing strategy doesn't adequately take the previously located boundaries into account. On the basis of the aforementioned analysis, this paper proposed MMD model that remedies these problems. We also make use of segment length to further improve segmentation performance. 2.We proposed a new statistical text segmentation model based on Multiple Discriminant Analysis (MDA).In our MDA model,four Multiple Discriminant Analysis criterion functions are defined and used to evaluate segmentations globally. The criterion function focuses on three factors,namely within-segment distance,between-segment distance and segment length. It is assumed that the segmentation with the smallest within-segment distance and the largest between-segment distance is the best segmentation. According to the evaluation result,the segmentation with highest score is taken as the best segmentation, thus determining segment boundaries and the optimal number of segments.3.We proposed a global optimization text segmentation model (MMS model) based on Dynamic Programming.On the basis of thorough analysis on the lexical distribution and structural characteristics of texts,we defined a segmentation criterion function and used the dynamic programming algorithm to find optimal segmentation globally. The number of segments can also be determined automatically. In our model,the segmentation criterion function considers multiple factors such as within-segment similarity, between-segment similarity, segment lengths and the effect of sentence distance on lexical similarity, so as to accurately identify the subtopic changes in texts.Compared to MDA model, the computational complexity of MMS model is much lower. In fact, MDA model looks for best segmentation through full search, and it is an unordered model with exponential complexity. In contrast, MMS model is an ordered model which adopts dynamic programming as the searching strategy.4.On the basis of the text segmentation model proposed in this paper, we constructed SEG_SUM multi-document summarization system focusing on the query-oriented multi-document summarization task.First we use the text segmentation model to perform text segmentation on each document. Then the generated segments are clustered in order to gather the text fragments that discuss the same subtopic among different documents. Multiple segment clusters are then generated, in which each cluster represents a subtopic. Next the subtopics that are not relevant to the query are filtered out. The remaining subtopics are ranked according to their importance.At last summary is generated through selecting sentences sequentially from the important subtopics.Since the summary covers multiple subtopics relevant to the query and considers the importance of subtopics,it can include more information under the premise of closely related to the specific query. Besides, the summary also covers important information as much as possible.
Keywords/Search Tags:natural language processing, text segmentation, segment, Dotplotting, Multiple Discriminant Analysis, Dynamic Programming, multi-document summarzation
PDF Full Text Request
Related items