Font Size: a A A

Study Of Language Networks

Posted on:2011-11-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:W LiangFull Text:PDF
GTID:1100360305951305Subject:Basic mathematics
Abstract/Summary:PDF Full Text Request
Complex networks are abstracts and descriptions of some complex systems. A complex network can be regarded as a system, which is formed by some nodes and edges among them. Nodes stand for individuals in the real system, and edges mean interrelations among the individuals. A large number of complex systems can be regard as complex networks in the real world. They widely appear in social, economy, biology, and other fields. Such as World Wide Web and Internet [1,2,3,4, 5], biological networks [6,7,8], collaboration networks [9,10], and public transport networks [11,12]. In recent years, the network science has been developed rapidly both at home and broad. It has become a cross-discipline, which provides new ideas and methods for studying complex systems in many fields [13,14].The graph representation of the real network can be went back to the study of "Konigsberg bridge problem" by the famous mathematician-Euler in the 18th century. This study developed a branch of mathematic-graph theory. Graph theory had not made a great progress until ER random graph was built by two mathematicians of Hungary-Erdos and Renyi in 1960 [15]. In ER model, suppose that there are N nodes, and a pair of nodes are connected with probability p. Therefore, there are about pN(N-1)/2 edges. It has been found that some properties of ER model are spring up, that is, for any given probability p, either almost every one has a certain property, or almost every one does not have this property. In the last 40 years of the 20th century, random theory had been a basic theory for studying the structures of complex networks. Because most real complex networks are not random, ER model has some shortcomings if it is regarded as a basic model of real networks. Therefore, some improved models of ER model have been built to make it closer to the real networks [16]."Small-world"effect was studied at the same time. A famous experiment was done by Milgram [17,18]. The aim of the experiment is to probe the distribution of path lengths in an acquaintance network by asking participants to pass a letter to one of their first-name acquaintances in an attempt to get it to an assigned tar-get individual. Most of the letters in the experiment were lost, but about a quarter reached the targets and passed on average through the hands of only about six people in doing so. This is the popular concept of the "six degrees of separation". This experiment result reflects that there exists the "small-world" effect in in-terpersonal relationship. The WS small-world model was proposed by Watts and Strogatz in 1998 [19] in order to describe the transformation from a full regular network to a full random network. After that, Newman and Watts improved the WS model and built the NW model [20]. The idears of the two small-world mod-els are similar. They all reflect the property of complex networks:most nodes are connected to theirs neighborhoods, and at the same time, some nodes can be linked to the other nodes. Degree distributions of WS and NW models are Poisson: whereλis a parameter, p(k) is defined as the probability that a randomly chosen node in the network has exactly degree k. However, lots of research works have shown that degree distributions of most real networks are not Poisson but power-law: whereγis a positive constant. This kind of network is said to be scale-free. In order to explain the mechanism of causing a power-law degree distribution, Barabasi and Albert built the BA scale-free network model in 1999 [21]. It is found that the increasing of nodes and the existence of preferential attachment result in a power-law degree distribution of the network, there are relatively a few number of nodes having very large number of connections, while most nodes have few connections in the network (that is, there exists "fat-tail"). In addition, it has been shown that hierarchical organization [22], node merging and regeneration [23,24,25], and node copying [26,27] can make networks having the scale-free feature too.Language is a common object studied in the fields of linguistics, psychol-ogy, and biology. It is a quintessence of the civilization of human beings, and is a complex system by throughing an odyssey of evolution [28]. Sole thought that the properties of complex networks are embodied in language, including speech sounds, syntactic, and semantic [29]. Some studies have been done in co-occurrence, syntactic dependency, and semantic dependency both at home and abroad.There have been got fruitful results in the research of English language net-works. For example, based on the 107 words of the British National Corpus, two word co-occurrence networks were constructed by Cancho and Sole in 2001 [30]. It was found that they exhibit the small-world and scale-free features. Furthermore, each network has two power-law exponents 1.5 and 2.7. In 2002, Motter and Moura et al used an online English thesaurus dictionary, which has over 3000 en-tries, to bulid a conceptual network by defining two words to be connected if they express similar concepts [31]. The network exhibits the small-world and scale-free features. A semantic network was constructed by Sigman et al in 2002 [32], according to the semantic relationship of nouns (such as antonym) in WordNet, which has 66,025 nouns. It was shown that the network exhibits the small-world and scale-free features.Some good results have been obtained in the study of Chinese language networks. Syntactic dependency networks were constructed from a set of words by Wei et al [33,34], where two words are connected if they contain the same character, such as "rule of law" and "the arm of the law". Global semantic structures of two large semantic networks, HowNet and WordNet, were analyzed by Tang et al in 2006 [35]. Four word co-occurrence networks were bulit by Liu et al in 2007 [36], according to 13,000,000 characters in《People's Daily》(the first half of 1998) and 50,000,000 characters in the corpus of Chinese language. Based on the largest tagged corpora of Chinese PFR1.0, two word co-occurrence networks were constructed by Zhou in 2008 [37]. These networks exhibit the small-world and scale-free features.In recent years, some other language networks were studied. For example, syntactic dependency networks were built for Czech, German, and Romanian by Cancho et al [38]. These networks have the small-world and scale-free features. In 2006, according to the texts of Slovak in Internet, Markosova et al constructed two word co-occurrence networks, which exhibit the small-world features [39].Sentences in Chinese are formed by characters and words, while sentences in English are formed by words. Character co-occurrence networks can be con-structed in a likewise manner as in the construction of word co-occurrence net-works. As yet, no study has been performed on the character networks except for our conference paper [40]. In the existing literature, all the studies have fo-cused on a single network that was constructed from a large number of articles, which were selected from tagged corpus, WordNet, online English dictionary, etc. However, a character co-occurrence network and a word co-occurrence network can be constructed from a single article in Chinese, and a word co-occurrence network can be constructed from a single article in English. Do these networks still exhibit small-world and scale-free features? Can useful conclusions be ob-tained by comparing network parameters corresponding to two or more languages from a complex network perspective? In order to answer these questions, we have constructed 114 networks from collections of 53 Chinese articles, including essays, novels, popular science articles, news reports, and 4 concatenated articles of each type [40]. We found that these character and word co-occurrence networks are qualitatively equivalent, i.e., they exhibit small-world and scale-free features.There are at least 6800 different languages in the world [41]. The Chinese and English languages are two of the mostly spoken ones. Are there any commonalities and differences between Chinese and English, and among four types of articles, i.e., essays, novels, popular science articles, and news reports, in each language from a complex network perspective? China has a long history and its culture is long-standing and well-established. Are there any commonalities and differences among Chinese articles in different periods from a complex network perspective? So far as we know, there are not any results about these problems.On the part of evolving language network models, Dorogovtsev and Mendes built the DM model in 2001 to analyse the degree distribution in [30]. The DM model was obtained by adding ct (c is a constant) edges to the network at t time on the basis of the BA model. The power-law exponent is 3 in the region of kernel lexicon, andγ=1.5 in the region of the other lexicon [42]. In order to better simulate the degree distribution in [30], Markosova built a new model by adding rewriting edges on the basis of DM model in 2007 [43]. Networks were constructed from the inclusion relationship of Chinese characters or phases by Yu et al in 2008, and a model including increasing and preferential attachment was built [44]. Chinese language goes through more than 5000 years development. How to build a model to depict the development of Chinese language? To the best of our knowledge, there are not any other models to depict the evolving language networks especially the evolution of Chinese language except for references [42, 43,44].This thesis is divided into four chapters. We study the commonalities and differences between Chinese and English languages, the commonalities and differ-ences among Chinese articles in different periods, respectively. Furthermore, an evolving language network model according to development of Chinese language is built, and degree distributions of character co-occurrence networks in different periods are provided with computer simulations.In Chapter one, we introduce some basic concepts of complex network, in-cluding the average shortest path length, clustering coefficient, degree distribution, etc.In Chapter two, we study some commonalities and differences between Chi-nese and English languages, and among the four types of articles:essays, novels, popular science articles, and news reports in each language from a complex net-work perspective. Co-occurrence networks of Chinese characters and words, and of English words, are constructed from collections of 200 Chinese and 200 English articles, respectively. It is found that the character and word networks of each type of article in the Chinese language, and the word network of each type of article in the English language all exhibit small-world features, and most of them have scale-free features. It is shown that expressions in English are briefer than those in Chinese in a certain sense; essays and popular science articles in the Chi-nese language share some common features, whereas news reports and popular science articles in the English language share some common features.In Chapter three, we study some commonalities and differences among ar-ticles in different periods of China from a complex network perspective.561 co-occurrence networks of Chinese characters are constructed from collections of essays in the 11 different periods of China:Spring and Warring Period, West-ern and Eastern Han Dynasties, Three Kingdoms, Eastern and Western Jin Dy-nasties, Southern-Northern Dynasties, Tang, Song, Yuan, Ming, Qing, and the modern China.550 networks come from the single article in different periods, re-spectively. The other 11 networks are from concatenated fifty articles at the same period, respectively. We found that 99.6% networks have scale-free properties, and 95.0% networks have small-world effects in these 550 networks. This study provides some important data for studying the development of Chinese language by constructing language network model. In addition, there has been a contro-versial question that the articles in Wei, Jin, and Southern-Northern Dynasties should belong to the ancient Chinese language or the recent Chinese language in the linguistic study. Our study shows that statistical parameters of networks in Wei, Jin, and Southern-Northern Dynasties are clearly different from those of networks in the other periods of China, and it seems more reasonable that the ar-ticles in Wei, Jin, and Southern-Northern Dynasties belong to the recent Chinese language.In Chapter four, based on the features of Chinese language development, we build an evolving language network model. The model including a new node is added, and edges are added, rewired, and deleted. Furthermore, we calculate the degree distribution of the model. We find that the degree distribution of the model is power-law in some case, where the range of the power-law exponent is between 1 and +∞, and the degree distribution of the model is a exponent distribution in other case. The parameters of the model are determined by the statistical parameters of the 550 essay character co-occurrence networks, which are from the 11 periods. We found that when a new word or expression is formed in the Chinese language development, the selection of character has a weaker randomness and stronger preference.
Keywords/Search Tags:Chinese, English, Co-occurrence network, Small-world, Scale-free
PDF Full Text Request
Related items