Font Size: a A A

Text Style Analysis Based On Statistical Method

Posted on:2013-01-22Degree:DoctorType:Dissertation
Country:ChinaCandidate:J M ZhangFull Text:PDF
GTID:1118330374980735Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Computational linguistics is an interdisciplinary of computer science, mathematics and linguistics, which combines formal mathematical model with computer technology to carry out natural language process and analysis. Text analysis is an important research field of computational linguistics. It has achieved a lot with single word, phrase and sentence for the study of languages. That laid a good foundation for whole text analysis using computational linguistics and also prompted the need of the analysis of the text style research in this paper.From the point of view of methodology, research methods of computational linguistics can be divided into two categories:rule-based and statistics-based. Approaches of computational linguistics for text analysis are mostly evolved from statistical method based on different kinds of language corpus. This paper creates and updates corpora of different sorts and purposes by way of retrieving, crawling, and extracting process of cyber texts. Corpora in use involve public corpus, course materials corpus, character/word frequency, Chinese and English dictionary corpus, standard Chinese vocabulary and phrases, grading English vocabulary, terminology of professional English, mnemonic corpus for English learning, testing corpora for English language acquisition etc. The construction and maintenance of such corpora deals with four aspects:(a) Web crawling and local processing, such as standard Chinese vocabulary and phrases etc.(b) Web-based knowledge-discovery, such as public corpus, dictionary corpus and something like that provided by Google,(c) Continually updating Web discovered materials, such as character/word frequency libraries etc.(d) Course materials collected from Web and then processed using intelligent algorithms, such as professional English corpus which was processed upon algorithms like Conditional Random Field (CRF) and Hidden Markov Model (HMM) and dynamically updated through the network. Upon the corpora and the algorithms to deal with course materials and some pretreating systems mentioned above, this paper complished the statistical analysis of the writing style for Chinese texts and the teaching style for English texts, and explored the network application in the teaching of English language.The work of this paper consists mainly of two aspects. One is language materials preparation and technical support for style analysis. It deals mainly with the approaches of corpora construction and text pretreating and statistical algorithms. The other is style analysis theory and application probes for Chinese and English texts. The essential contents are shown as follows.1. The corpora construction approaches aiming to text style analysisThe establishment of copora is preliminary to text analysis using computational linguistics statistical methods. Although there exist a lot of practical corpora at present, this paper presents a standard Chinese corpus of vocabulary and phrases and an English corpus based on grading vocabulary because of the neads of synchronic and diachronic explanation of text styles analysis. For the stylistic analysis of professional English texts, algorithms of Conditional Random Field together with Hidden Markov Model are applied. The XML/RDF descriptions suggested by international organization WWW consortium are adopted to create corpora to meet the need of different applications.2. The text preparation and statistical analysis algorithmIn order to support subsequent text style analysis based on the above corpora, two kinds of algorithms including the pretreatment and statistics of texts. The pretreatment process of texts, in other words, text preparation process, includes mainly text regularization processing algorithms, text truncating algorithm, word segmentation algorithm of Chinese texts and so on. Statistical analysis algorithms include primarily character statistics, word statistics, distinct word statistics, frequency statistics of distinct characters or words, sentence number statistics, and statistics of sentence lengths etc.3. Statistical analysis method for Chinese text writing styleUnder the above pretreatment analysis and statistics, style analysis modeling of Chinese texts is then worked out to reveal statistical features of texts about words, phrases, sentences in quantitative analysis. The defined text popularity, conformity and text rhythm are figured out as the parameters of the writing style:word frequency entropy, word and sentence clustering degrees, and sentence dispersion degree of a specific text.4. The statistical analysis method for teaching style of English textsAccording to the English text characteristics, this paper put forward an analysis framework and established the analysis mathematical model for English textual teaching style. We also proposed general quantitative parameters analysis of English texts about words, rank coverage of new words, text difficuly degree, and average coexisting rate for professional English texts. They give a good description of the teaching style of English texts and the application in network teaching systems.As a branch of linguistics science, stylistic research can be traced back to the18th century. Based on computational linguistics method, this paper applies computer technology and formal mathematic model to text style analysis for a quantitative research. The innovation points are illustrated as follows:1. This paper has proposed an integrated adaptive optimized word segmentation method based on "best-first" strategy for Chinese word segmentation.Considering the fact that writing style analysis involves phrase, idioms, Chinese Allegorical Sayings, proverbs, maxims, parallel prose and other complicated language materials, this algorithm adopts an adaptive multi-pass strategy to segment Chinese texts into best word clusters. Different from other popular segmentation systems, this algorithm improves the recall rate and disambiguation precision, and enables next stage of text analysis to scrutinize the characteristics of writing style of materials.2. This paper has constructed a statistical analysis model for writing style of Chinese texts.The definitions and evaluation parameters of word frequency entropy, words clustering degree and sentence dispersion degree are designated to evaluate text styles like text popularity, conformity to massive writings and text rhythm. Four Chinese versions of French masterpiece Ball-of-Fat by Guy de Maupassant are chosen as experimental samples. The case results verified established model of statistical analysis on Chinese text writing style, and illustrated the practical application of the model.3. This paper has developed an extraction model based on Conditional Random Fields for professional English texts.Upon the improvement and implementation of CRFs, HMM, conditional entropy, maximum entropy, an extraction miodel of professional terminology was built embedded with natural language grammrs. It has greatly improved the effectiveness of professional vocabulary recognition and the applicability of word grading. On this basis, a professional English corpus was then the result.4. This paper has established a statistical analysis model for English text teaching style.In this model, the rank of words, the difficulty coefficient and the new words coverage evaluation were adopted to indicate the level of general English vocabulary, the effectiveness of English reading, and the difficulty of a text, and the average co-occurrence of professional terminology to indicate the professional degree. The application of English teaching style on Web learning is also explained. The experimental results and analysis show that the method is effective and practical.Using more advanced computer technology to solve more linguistic problems is the goal of computer linguistics research. Further works could be strengthened as follows:1. New network language materials will emerge day after day. Some idioms may be no longer used, which leads to the changeability of text style. Corpora whether Chinese or English should change themselves to fit to ever-changing network languages. Therefore, establishing dynamic evolution model and method identical to the network language changes are also a main task of this paper in the future work.2. Usually speaking, a relationship among different text features exists in the statistical extraction methods. Therefore, establishing the analysis algorithm of multiple parameters oriented to text style oriented isanother future work of this paper. Of course, a correlation stdu among the statistical characteristics, structure characteristics and semantic characteristics can also be an important topic in the future research.3. Chinese text measurement index and calculation method proposed in this paper could be used to further Chinese text classification based on the writing style, information retrieval, and the author identity authentication as well as the criminal psychological analysis of texts and other fields. Therefore, the application of relevant research can be included in the further work.4. The English text statistical analysis method established in this paper can also be used for the style analysis of English examination papers and English text writing style analysis, network blog English text analysis like author identification ets. To extend its application fields is also considered as a part of future work.
Keywords/Search Tags:Computational Linguistics, Text Style Analysis, Corpus, Chinese Word Segmentation, Information Extraction
PDF Full Text Request
Related items