Font Size: a A A

Research On Subsequence-based Text Segmentation And Topic Labeling

Posted on:2010-02-25Degree:DoctorType:Dissertation
Country:ChinaCandidate:X ChenFull Text:PDF
GTID:1118330332985675Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Since the end of the 20th century, radio, television, Internet, electronic equipment and other media tell people information on different subjects in various ways every day, and large part of information is shown in the form of documents. How to find really useful information becomes the focus of attention.Intelligent text processing system generally uses whole documents as the basic processing units, and implicitly assumes that a document mainly focuses on one topic. In fact, a document often involves one or more sub-topics. Therefore, the processing granularity based on the entire document is difficult to meet higher and more accurate user requirements in many practical applications. In order to meet these requirements, technologies based on content understanding have gained wide attention.Text topic analysis intend to determine the topic structure of a text, that is, to identify the discussed topics, determine the extension of topics, track the topics conversion, detect the relationship between topics and so on. As important components of the technologies, text segmentation and topic labeling are widely used in information retrieval, information integration, question answering and summarization systems.Text representation is the basis of text intelligent processing systems. Most of the existing text intelligent processing systems are using word based vector space model to represent text. However, this method can not express the word order information. Therefor, this paper discusses text topic segmentation, topic passage extraction, named entity recognition and text topic labeling, including the following aspects:(1) This paper studies suffix tree document model and introduces subsequence-based text representation method. This method can take full advantage of the order information between words in texts. This information in important for text segmentation, especially for descriptive text segmentation. For the Chinese natural language processing, this method can reduce the impact caused by the uncertainties of Chinese word definition and the errors of the Chinese word segmentation system. It can make the text segmentation and labeling systems achieve the same performance without Chinese word segmentation, and reduces text pre-processing time. (2) This paper introduces subsequence-based sentence coherence metric, and a text segmentation algorithm based on maximized cut. It determines the optimal segmentation borders through maximizing of the coherence cut of the text and determines the granularity of text segmentation through the gradient of the coherence cut. In the topic segmentation experiments of a middle school chemistry E-book, our approach achieves higher accuracy, no matter using actual segmentation granularity or automatically determined segmentation granularity.(3) This paper introduces subsequence-based query-sensitive sentence coherence and a sentence relevance metric, and a passage extraction algorithm based on normalized cut. It simultaneously maximizes text coherence cut and relevance cut to extract the most query-relevant passage using multiobjective optimization. And we introduce the corresponding query expansion method using the subsequence labels of the passage to expand the query. The passage extraction experiments of the middle school chemistry E-book show that the accuracy of passage extraction can be improved through utilizing sentence coherence and relevance simultaneousely as well as query expansion using subsequence.(4) This paper introduces k-similar conditional random fields model, discusses the inference and training algorithms, and uses it in named entity recognition. This method calculates the word similarity in unlabeled text, and labels current word taking the features of similar words into account. It reduces the manual labeling work. The experiments using standard named entity dataset illustrate that the accuracy of named entity recognition using conditional random fields can be improved through utilizing word similarity information.(5) This paper introduces subsequence-based label feature weight and label significance metrics, and sebsequence-based text topic labeling algorithm. It chooses multiple labels of the text using maximal marginal relevance criterion, and makes a mutiple document collaboration labeling utilizing document similarity. We introduce subsequence significance metric based on term glossary. The term weighting can find topic subsequence more accurately, and eliminate most ill-structured subsequence naturally. We also study the influence of subsequence position on the subsequence significance. The result of our approach is approving in the topic labeling experiments of the middle school chemistry E-book. In brief, this paper delve into text topic relative models and algorithms mainly using statistical methods. It improves the accuracy of text topic segmentation and labeling, and lays the foundation for the implementation of text intelligent processing systems.
Keywords/Search Tags:text segmentation, topic labeling, subsequence, normalized cut, conditional random fields
PDF Full Text Request
Related items