Improved Methods Based On Statistical Model For Text Segmentation

Posted on:2015-01-19

Degree:Master

Type:Thesis

Country:China

Candidate:X J Li

Full Text:PDF

GTID:2268330431956900

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The text segmentation problem is an important preprocessing step in information retrieval and multi-document abstraction. One document usually contains multiple topics. The task of text segmentation is to segment the text into several parts while each part representing one topic by analyzing the document structure to automatically identify the boundaries between each topic in a document.This paper focuses on the probabilistic model proposed by Masao Utiyama and Hitoshi Isahar based on the existing topic segmentation methods and the lexical similarity measurement. By means of the Bayes formula and words clustering methods in topic segmentation, the model defined an probabilistic formula of describing the same topic using one paragraph of text as an criterion of determining the segmentation point, and abstracted the text into a weighted and directed graph when determining the boundaries of segment. It computed the minimum-cost path between two nodes in the weighted and directed graph by means of the dynamic programming and obtained better results.This paper presented some improved suggestions based on the probabilistic model proposed by Masao&Hitoshi. The previous model only used the similarity in paragraphs but did not consider the non-similarity, so we defined the non-similarity between adjacent sentences. The previous model did not consider the influence of the length of the final segments while calculating the probability of the segmentation, so we first determine the range of length of the segments in the preprocess stage and define a piecewise function to calculate the probability of the segmentation to improve the probability model. Sometimes the words describe the same topic may be disperse so it is difficult to identify the boundaries, so we add weight to expand the effect of whether the words describe the same topic or not. The experiment result showed that the improved probability model increases the accuracy of the segmentation to some extent.Masao&Hitoshi used the dynamic programming algorithm when identifying the boundaries, which did not need to set prior parameters. So we still adopt the dynamic programming algorithm to determine the boundaries. In order to test the result of the improved model, we also use the particle swarm optimization algorithm to identify the boundaries as the contrast experiments. The experiment results showed that our work improved the performance of text segment to some extent.

Keywords/Search Tags:

text segmentation, probability statistical model, dynamic programming, particle swarm optimization

PDF Full Text Request

Related items

1	Improved Particle Swarm Optimization Algorithm Under Dynamic Environment
2	Study On Particle Swarm Optimization Algorithm With Multi-strategy
3	Researches On Bilevel Multiobjective Programming Problem:the Particle Swarm Optimization Algorithm And Applications
4	Design And Implementation Of Particle Filter Based On Evolutionary Programming And Particle Swarm Optimization
5	Research On Power Control In Cognitive Radio Systems
6	Image Segmentation Research Based On Particle Swarm Optimization Algorithm
7	Multi-objective Optimization Of WSN Region Segmentation Based On Improved Particle Swarm Optimization
8	Research On Distributed Characteristics Of Particle Swarm Optimization And Its Applications
9	Niching Particle Swarm Optimization Algorithms For Multi-Modal And Dynamic Problems
10	The Research Of Particle Swarm Optimization Solving Nonlinear Programming Problems