Font Size: a A A

Improved Methods Based On Statistical Model For Text Segmentation

Posted on:2015-01-19Degree:MasterType:Thesis
Country:ChinaCandidate:X J LiFull Text:PDF
GTID:2268330431956900Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The text segmentation problem is an important preprocessing step in information retrieval and multi-document abstraction. One document usually contains multiple topics. The task of text segmentation is to segment the text into several parts while each part representing one topic by analyzing the document structure to automatically identify the boundaries between each topic in a document.This paper focuses on the probabilistic model proposed by Masao Utiyama and Hitoshi Isahar based on the existing topic segmentation methods and the lexical similarity measurement. By means of the Bayes formula and words clustering methods in topic segmentation, the model defined an probabilistic formula of describing the same topic using one paragraph of text as an criterion of determining the segmentation point, and abstracted the text into a weighted and directed graph when determining the boundaries of segment. It computed the minimum-cost path between two nodes in the weighted and directed graph by means of the dynamic programming and obtained better results.This paper presented some improved suggestions based on the probabilistic model proposed by Masao&Hitoshi. The previous model only used the similarity in paragraphs but did not consider the non-similarity, so we defined the non-similarity between adjacent sentences. The previous model did not consider the influence of the length of the final segments while calculating the probability of the segmentation, so we first determine the range of length of the segments in the preprocess stage and define a piecewise function to calculate the probability of the segmentation to improve the probability model. Sometimes the words describe the same topic may be disperse so it is difficult to identify the boundaries, so we add weight to expand the effect of whether the words describe the same topic or not. The experiment result showed that the improved probability model increases the accuracy of the segmentation to some extent.Masao&Hitoshi used the dynamic programming algorithm when identifying the boundaries, which did not need to set prior parameters. So we still adopt the dynamic programming algorithm to determine the boundaries. In order to test the result of the improved model, we also use the particle swarm optimization algorithm to identify the boundaries as the contrast experiments. The experiment results showed that our work improved the performance of text segment to some extent.
Keywords/Search Tags:text segmentation, probability statistical model, dynamic programming, particle swarm optimization
PDF Full Text Request
Related items