Font Size: a A A

Study On Method To Automatically Analyze The Text Structure Based On The Relevancy Computing Of Text Content

Posted on:2011-07-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:M S ZhongFull Text:PDF
GTID:1118360305956797Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Text structure can be considered to have both the Physical and Logical structure. The Physical structure of a text is a structure determined by the actual location of the basic components of the text (such as titles, paragraphs, sentences, vocabularies, punctuations, etc.), and Vector Space Model can be used to denote that structure. The Logical structure of a text is a logical relation or logical structure built by subjects, levels, paragraphs, sentences and keywords, which together, based on the concept meaning, reflect the topic or clou of the text, usually expressed by a tree-diagram or a graph. The automatic analysis of text structure is to use the computer to divide a text into a number of disjoint text units (semantic paragraphs), or to parse it into a hierarchical tree based on meaning, so that people can obtain the original logical relation or logical structure of the corresponding text.The automatic analysis of text structure is a very significant step to achieve the automatic text understanding. Since only by holding the logical organizational structure of the article in a macro level, the topic or clou of the article can be more easily understood from the overall perspective. At the same time, the results of the text structure analysis have an important influence on many other natural language processing tasks, such as automatic text summarization, information retrieval, topic detection and tracking, etc. However, the understanding of a text, which is beyond the capability of computers, is the basis for text structure analysis. Therefore, it is a tough job for computers to analyze the logical structure of a text as accurately as possible without the understanding of a context.Based on the theory of text organizational structure and the characteristics of text structure, this paper divides the text structure analysis into two tasks: linear structural analysis and hierarchical structural analysis. Hereby, we firstly researched and proposed some approaches to calculate the degrees of semantic relevancy between words in Chinese, to recognise the semantic relation between sentences in a context, and to calculate the degrees of semantic relevancy between sentences, for analyzing the semantic relevancy of the context and calculating the relevancy degrees of the context. Then, based on the relevancy analyzing and calculating of the context content, we in-depth study on the theories and methods concerning linear structural analysis and hierarchical structural analysis. To be specific, the present paper mainly contributes: (1) To the abstractive description of a text structure. The concepts of'sentence','title','paragraph','article','topic or sub-topic', etc. are described formally; new concepts like'basic argument structure','recursive argument structure','text-structure tree','text-topic units'etc. are proposed and described formally. At the same time, a method to quantitatively describe or calculate'the level of a topic'and'the granularity of a topic'is presented. All of these serve for the premise or basis to carry out structure analysis.(2) To the semantic relevancy relation and relevancy degree between words. By analyzing the relevancy degree and similarity between between words, we propose the concept of'broad-sense relevancy degree'of word meanings. For calculating the broad-sense relevancy degree between words, we first propose a corpus-based method to calculate the semantic relevancy degree between words through constructing bipartite graph of lexical semantic relation, which is also known as narrow-sense relevancy degree. Secondly, based on the idea of Concept Intersional Logical Model of the Chinese Language, we propose a method of calculating semantic relevancy degree between words in light of the definitions of a lexical item or its sub-item in a dictionary. The results of the calculation stress more on similarity or relevancy between words in their conceptual meanings. Finally, we combine the above two results to form a broad-sense relevancy degree. Tested by the data in the Standard M&C Chinese Version, the results show that the above first and second approaches can complement each other and the combination of which can achieve the result of broad-sense relevancy degree, which is close to what achieved by man's cognition or judgments.(3) To the semantic relevancy relation between sentences in a context. First, according to the inter-sentence semantic relationship and its corresponding word-form tags summarized by the specialists in linguistics, we propose an automatic recognition (qualitative) method to recognize the semantic relation between sentences in a context, including the approach to obtain the templates of word-form tags, the approach to resolve the conflict between templates, and the algorithm to recognize the inter-sentence semantic relations. The method of automatic analysis is then tested for its validity and effectiveness. Second, we propose a calculating (quantitative) method based on the generalized semantic relevancy between words to calculate the relevancy degree between sentences. The tests show that the results of the relevancy degree calculation are closer to the man's judgment than the existing method of similarity calculation that calculates similarity between sentences(4) To the linear structural analysis of the discourse and its related issues. Based on the above method to calculate the broad-sense relevancy degree between words and to calculate or analyze the semantic relevancy between sentences, we carry out the linear structural analysis of texts and study its related issues, and then presented a text linear segmentation method based on the content relevancy analysis in the context. Tests show that our method is better in segmenting texts than the classic method of TextTiling algorithm, and also better than the existing text segmentation algorithm already reported for Chinese texts. (5) To the hierarchical structural analysis and its related issues by confirming the idea those texts of the same type should have the same or similar structural mode. Accordingly, we first propose a text hierarchical structural analysis method based on Na?ve Bayes model, namely, to learn text organizational structural mode from the training corpus by using Na?ve Bayes model, and then recursively merge the nodes upward until a tree of text structure with a root node is generated. Moreover, we propose a text hierarchical structural analysis method based on the bio-sequence alignment algorithm. That is, by using the sequence alignment algorithm, it finds the most similar text in text structure from the training corpus as the test text, and acquires its text structural mode. Thus the structure of the test text can be automatically analyzed in the light of structural mode. The test results show that the above two methods work the same. But from the current test data set, the former has better performance than the latter.
Keywords/Search Tags:text structure analysis, text understanding, text structure, word semantic relations, inter-sentence semantic relations, text segmentation, and text hierarchical structural analysis
PDF Full Text Request
Related items