Font Size: a A A

A New Method To Identify The Boundary Between High-Frequency And Low-Frequency Words In Corpus Based On Zipf’s Law And Application

Posted on:2015-01-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:F YeFull Text:PDF
GTID:1228330467965576Subject:Library science
Abstract/Summary:PDF Full Text Request
Scientometrics method is wildly used in various subjects, with plenty of related papers publishing. Confirming some high-frequency words is the basic work to identify the research focuses using scientometrics method, and the number of high-frequency words directly influences the research results, but how to objectively and effectively distinguish the dividing point between the high-frequency and the low-frequency words is a question which still puzzled the researchers, or ignored by most researchers. Most researchers judge dividing point between the high-frequency and the low-frequency words by his/her experiences. Some researchers judge dividing point between the high-frequency and the low-frequency words using h-index method or g-index method. Obviously, all above methods are lack of theoretical foundation. Pao and Sun proposed a new method to judge dividing point between the high-frequency and the low-frequency words based on the Zipf s second law. The author confirm the work of Pao and Sun, but disagree with their theory of "same-frequency word", which based on the researchers’imagination and have no theoretical foundation. And what’s more, the number of high-frequency words is too large or too small and not stable among each year by all above methods, showing they are difficult to use. Zipf’s law is considered to be one of the fundamental findings in the scientometrics research field since the middle of the last century. Although a lot of researchers pay attention to the law, however, the meaning of it is far from clear. Even so, the question of what exactly is the constant "C" still puzzled the researchers.This study chose two subjects as the analysis corpus:scientometircs and polluted environment remediation, because the author know well about those two subjects, and what’s more the corpus property of those two subjects is quite different, which can prove the universality of the new method. A total of934papers about scientometircs published during the year2002to2011are retrieved as the corpus to create the new method. The constant in the Zipf’s law is analysed using the corpus. Some regulars about the value of C in Zipf s law are recognized, and then a new method to identify the boundary between high-frequency and low-frequency words in corpus based on Zipf’s first law is proposed. The new method was proved to be of advantage and universality through the examination by corpus from both scientometrics and polluted environment remediation, which is worth to be applied. So this study apply the new method to the scientometrics and polluted environment remediation, and get the development course of those two subjects during past ten years, which prove the advantage and applicability of the new method further. The main conclusions about this study are:1.The value of C in Zipf’s law is a parameter instead of a constant. The parameter C is fluctuating followed by the scale of the corpus. This conclusion accords with the opinion of Zipf that C was a parameter, but disagreed with the opinion of Zipf that C had a range of0<C<0.1.2.Compared with the other methods, the new method had the advantage both on quantity and stability in confirming the number of high-frequency words, and didn’t affect by the scale and the character of the corpus. The new method also showed it’s universality by examination by both the corpus of scientometrics which consist of title and abstract, and the corpus of polluted environment remediation, which consist of only keywords. The method composed by Pao and Sun could only apply to the corpus consist of title and abstract., All in all, compared with the other methods, the new method showed obvious advantage in the judge dividing point between the high-frequency and the low-frequency words in corpus, be worthy of promotion.3. Applying this new method to the scientometrics corpus, we found that after ten years of development, the scientometrics had formed some basic research issues, for example, the impact factors, citation analysis, research performance and so on, and the scientometrics was still in developing. Some new research issues, for example the co-citation analysis, the h-index and so on were leading the scientometrics to go deeper.4. Applying this new method to the polluted environment remediation corpus, we found that the soil was the main medium, the heavy metals and the PAHs was the main contaminant studied by the researchers. The phytoremediation, electrokinetic remediation and the bioremediation were the main polluted environment remediation method. Various contaminants were detected or attached the attention by researchers with the development of economy and detect technology, so the remediation methods were persistently improved to deal with the pollution. The research of polluted environment remediation showed a tendency of different remediation methods combination and the innovation of the remediation agents. New contaminants and new remediation methods emergeing made the research of polluted environment remediation booming. Some countermeasures were put forward in the end based on the research results.The main innovative points of this study reflected in:1. This study redefined the characteristics of the group of the high-frequency words and provided a new idea for confirming high-frequency words. This study also expanded the application of Zipf’s law, provided some reference for the related research.2. Compared with the other methods, the new method showed obvious advantage in the judge dividing point between the high-frequency and the low-frequency words in corpus. Nowadays, the scientometrics need a scientific and advanced method to make the data standard. The new method of this study just met this need. If the new method was accepted by the most researchers, it will affect the scientometrics greatly, and can promote the application of scientometrics in larger scale.3. The countermeasures based on the research results of the polluted environment remediation provided some valuable reference to the related researchers.
Keywords/Search Tags:Zipf’s law, high-frequency words, low-frequency words, dividingpoint, polluted environment remediation
PDF Full Text Request
Related items