Font Size: a A A

Chinese Text Mining

Posted on:2010-10-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:L L HaoFull Text:PDF
GTID:1118360272497273Subject:Probability theory and mathematical statistics
Abstract/Summary:PDF Full Text Request
With the rapid development of computer networks and information technology, the speed of information generation increases a lot. The government, companies and other institutions have accumulated a large amount of text data. As a result, it is very urgent and challenging for us to find a method to get valuable information from the data. Text data mining uses text information as data mining objects and uses quantitative calculation, qualitative analysis to look for the information structure, models which have new information of potential value. The importance of text mining increases as well as the increase of text data. Two basic tasks in text data mining are classification and clustering which can not be separated from all the application area. This paper is from the specific practical problems in mayor's public access line project and the studying is based on the real data sets. This paper finds out the rules hidden in massive text data and provides information for the government to make decisions. This paper has great practical significance for this study. As the public telephone services develops, data can be further accumulated, which provides possibilities to mine more periodicities and trends in the data. Hao Lizhu made basic research of the mayor public telephones in his doctoral thesis. I make further study based on his platform system. I make new improvements and explore to mine information in hotspots, to improve the accuracy of classifiers, to establish Bayesian networks and to do large-scale parallel computing. In the first chapter, the paper talks about the development direction and the development process of segmentation in Chinese word, feature dimensionality reduction, classification algorithms and performance evaluation. This article describes the Mayor's public telephone data sets, and does some work in constructing thesaurus and frequency curve-fitting. In the aspect of thesaurus construction, the rate and the effect of segmentation is not very satisfactory because the original word thesaurus is so big. According to the actual needs of text messaging, I removed some of the long-term non-use words and designed a purification algorithm to deal with the thesaurus. So this thesaurus is formed as the special thesaurus. After purification, the number of the words is still large and there is enormous word ambiguity. I combine automatic segmentation and artificial combination together to improve thesaurus further. The construction of the specialized thesaurus greatly reduced the dimension of vector space model and improved the segmentation speed and accuracy. The second chapter of this article focuses on how to find hot issues of common concern from massive data. Although public individual complaints are random events, the hidden statistical regularity shows when hundreds of thousands of public complaint the issues. The hot issues refers to complaint focused on the issue happened in a short period of time, such as noise problems during the college entrance examination; making up classes during the winter and summer vacations: the spring of the three agricultural issues, garbage problem and so on. Such problems have a large number and high speed; they will cause serious negative impact if not treated with, for example, a collective petition, blocking traffic, or s striking. The effective extraction of hot issues can help the government to make new polices which can deal with the difficulties encountered by the public. It is time-consuming and low efficient to analyze hot issues subjectively. It is difficult to find hot issues objectively among the tens of thousands of complaints. Often we will miss the right time even not find the problem. To face with tens of thousands of complaints from the public and to find the same theme is actually a clustering problem. How to find hot issues from the mass text data? If we use traditional clustering method that is document clustering method to directly distill the hot issues, since the high dimension causes clustering results to be poor, this paper transforms the pick-up of the hot issues to finding the hot issues words first which are used to cluster the variables. Then the paper gathers the hot issues words which reflect the same subject, draws clustering tree, further to extract hot issues through the clustering tree. This paper discusses the extraction of hot issues, talks about the results of analysis of practical application and compares the results of manual extraction. The results turns to be good, it is more accurate and saves more human resource and material resource, in the end the paper shows a method to determine the numbers of clustering.The events which reflect the hot issues all have some important key words. The key words have strong statistical correlation with some months. We call these words hot words. Using the statistical methods to extract the hot words, we need to put forward a distinguishing criterion:1. The characteristic word in the current month has a strong statistical correlation with the current month.2. The dispersion of the word in the current month is low, that is the distribution of the documents which include the key words concentrated in some continuous days.Extraction of hot words must be done before the extraction of stop word. To avoid the disturbance of the hot words extraction using stop word, we distinguish stop words we use the following conditions:1. The words have high document frequency.2. The words have low statistical correlation with all the months.To distinguish the statistical correlation between the key words and the months we use hypothesis methods.H0 : wr is independent with mi, i = 1,…, 12; H1: there is at least a k so that wr is not independent with mi. Table 1 The 2 * p contingency table of wrwhere The chi-square statistic in the contingency table and the weighed chi-square statistic are as follows:where nr represents the document frequency of the word x-r, (?) , is the weighted factor.∑nr is the normalizing factor.I order words according to the weighed chi-square statistic from small to big and drop the first 800 words.Before picking up of the hot words, we need to find correlation degree between the certain word and the current month. We test the Hypothesis in the 2*2 contingency table.H0 : wr is independent with mk: H1 : wr is not independent with mk.Table 2 The 2*2 contingency table of the month mk.The calculating formula of chi-square statistic between characteristic word and its correlative cluster mk is as follows:This paper reviews the dispersion of the word Wj. I use the random variable T to represent the date of the document and the sample space isΩ= {1.2,…, k}, 1≤k≤31, so the probability distribution, the expectation and the variance are as follows:The variance reflects the dispersion degree of the data. The calculating formula of the hot word's abstraction is as follows: where the x2 represents the x2 value of word Wj,σis the standard deviation, 1 is the smoothing factor, N is the document frequency of the word in the month.I use the log(N/10) as the weighed coefficient, the 10 is the expertise value.I get the word list of the hot words by ordering the xσof each month from big to small, for example: the word list of May.Figure 1 The word list of May in March 2008In 12th May, there was a world-shaking earthquake in Wenchuan which causes citizen's great reflection and attention in Changchun. The citizens express their will to join the help campaign by the hot line "12345". The data in Mayor Public Telephone of May showed that although there were only 100 documents which include these words the most events were the unexpected incidents or the emergencies. From the second day of the earthquake, that is May 13th, to the end of that month, the increase number shows that the event reflected by these words is indeed the hot issues. There was a record in the month paper of the Mayor Public Telephone. As follows: Complains which have obvious changes are as follows:(1) "there is great increase in civil administration and the reflection related to the earthquake relief stands out."This event is listed the first among all the complaints which have obvious changes so it is clear that this event is important. At the same time it shows the importance of the event by manual work and by automatism using our way is consistent. From the hot words table we can see that sometimes several words relate to the same event, if we use all the documents related to hot words to analyze its intent, the analysis efficiency will be lowed by the massive documents. So we need to cluster the hot words using clustering methods to get the hot words which get together to reflect the same theme. So we can extract hot issues successfully by looking up the related document. The cluster analysis in multivariate statistical analysis is based on the actual needs, and it has two major research directions, the first sample clustering is called Q-type clustering and the second variable clustering is called R-type clustering. The purpose of this paper is to cluster the hot words in the hot words table, so the R-type clustering is more suitable. Each hot word can be regard as 0 or 1 binary variables. The hot words table is W = {w1,…,wn}, where n represents the capacity of words table, W1 represents the ith hot word, the variable set related to the hot word corresponding to W is X = {x1.…, xn}. That is if the hot word Wi appears in the document, then its corresponding variable is xi = 1, otherwise xi = 0. Then we cluster all the variables in X. In the selection of the similar coefficient between each two variables, we often use Jaccard Coefficient to match less similar words. In the variable clustering, we use dijto represent the similarity coefficient between xi and xj, G1,G2,…to represent the clusters. We use the most similar variables's coefficient to define the similarity coefficient between the clusters. We use Dpq to represent the similarity coefficientbetween Gp and Gq.That is We use system clustering algorithm to get the clustering tree of the hot words in March 2008. which we can see in figure 2. From Figure 2 we can see that the "bean oil, fertilizers, prices, too fast" getting together to show that they are saying the same theme, their corresponding hot sentences are shown in Figure 3 which saying that "Oil, fertilizer prices too fast": and "poor, color TV, helping the poor" to form one cluster, that is the "color TV to help the poor doesn't be put into place". From the analysis above we can see that the result of the variables clustering is good and the unnecessary duplication of effort using the single hot word is avoided. Figure 3 The result of the hot sentences which include the "bean oil,fertilizers"In the third chapter, we studied how the design of the Chinese language text classifiers is used in the Mayor public phones. Naive Bayesian classifier is a simple, effective and successful classification algorithm which has actual usage. In the situation of the text clustering which have plenty characteristic words and anfractuous relations, the introduction of new information can improve the clustering results. After introducing the time information I provide two improved Bayesian classifiers.The first is weighted Naive Bayesian classifier whose weighted parameters are used in the cluster nodes. After calculating the clusters' posterior probability using the classifier, we use the second power to adjust the posterior probability and then finish the clustering:The adjusted coefficient is fixed by the distribution of the complain samples of different types and in different times. The result comparisons can be seen in Table 3.Table 3 The result comparisons between Naive Bayesian and weighed Naive BayesianAs table 3 shows, the accuracy using our methods is improved than using Naive Bayesian. The accuracy rate using Naive Bayesian is improved by 1.32% than using weighted Naive Bayesian.Another improvement considers the seasonal characteristics of the actual data and it tests the independence between the categories and the time. The paper finds that they are not independent most of the rime. We construct the cluster model using the important prior information and establish a Naive Bayesian classifier based on time-series data whose prior parameters are estimated using kernel regression method to get. Comparing the results using Naive Bayesian classifier method in the data of Mayor public telephone, we find that the classification accuracy is greatly improved.I do experimental test and performance analysis of the text classification based on support vector machine which is used in the Mayor public telephone. The results prove that its correct rate is higher especially in the smaller cluster. The results are much better than that of Naive Bayesian. However. due to large amount of data which leads to long time of training and optimization process of parameters, we can't meet the needs of night-time study, so I propose a parallel processing strategy using support vector machines.Aimed at the good generalization ability in small spaces of the support vector machines, I divide the data in the Mayor public telephone into 12 sub-sets and construct support vector machines in each set. In the actual running, I finish the clustering by using each set's related support vector machines. The practical results showed that there was small decrease of correct rate in small data sets but the night-study time was greatly reduced. In this way, the need of clustering in the daytime and studying at night can be met. What's more. I designed the method to abstract characteristic words using binomial test and characteristic phrase based on the words' frequency. Then using the acquired information of decision-making tree and the Bayesian multi-network we provide a score text classifier based on some rules. In the fourth chapter, aiming at the large-scale data classification and learning problems, using the high-performance computer. I provide an effective parallel learning algorithm based on task- driven. Dealing with the large-scale data classification and learning problems, the time complexity and space complexity caused by the size of sample set is the limitation of the algorithm's application. As a result, based on the machine-learning algorithm people are paying more attention to pursue more efficient parallel cluster algorithm. The increase of the data and the development of the high-performance computing also make it possible for the realization of the parallel algorithm to come true. The currently used parallel processing algorithms are mainly to divide the data, their advantages are that they have more nodes and they can do parallel work with less time and they are more suitable for the large-scale database. For example, the most representative of the parallel processing algorithms are SLIQ algorithm SPRINT algorithm. But in the actual environment, people often share the same cluster and multi-processes are running in the computer at the same time. As a result, the running process and the CPU utilization are not the same. The fast node process is forced to wait for the slow one, which seriously influenced the efficiency of the computer. Further more, we need to divide the total sample into small ones. The differences in the distribution of each sample can impact the performance of the classifier potentially. The combination strategy used in the sub- classifier can have serious affect on the accuracy of classification. For this reason I propose parallel program design based on task-driven. The idea of parallel program design based on task-driven is to cut large tasks into small tasks, then assign these small tasks to the each node to run. In this way, the first node of the process will get new and more tasks. What's more, a dominant process divides the tasks which achieve the message passing and data exchange of calculation in the form of task-driven. So we can achieve the coordination of multi-computer and the effective load balancing in the complex environment. Based on the idea I provide a parallel learning algorithm which is based on task-driven decision-making and a Bayesian network learning algorithm. As the data in Mayor public telephone increases, traditional decision tree algorithm which need to scan and sort the data many times in the tree structure will make the algorithm less effective. In order to meet the needs of large-scale data sets, I find a decision-making tree based on task-driven. Maintaining the accuracy of the original serial, the author does the test in the data of Mayor public telephone and finds the operating speed significantly increased. (In the high-performance computer, 210 process are started at the same time each process using a CPU core,) It only took 50 minutes to complete the decision tree with 50.000 nodes. In this paper, a parallel algorithm based on the sorting of chi-square sequences which is using the K2 algorithm to construct Bayesian network. First using the characteristic words in each cluster to calculate its chi-square value then order the chi-square value and use them as the priori order of the Bayesian network. We use K2 parallel algorithm to learn the construction of Bayesian network. K2 parallel algorithm which is based on the serial process uses the main process control to assign the tasks. The calculation process is specially used to deal with the tasks which are sent from the main process control. After the treatment the calculation process sends the results back to the main process control. The work of the calculation process includes calculating the score of each node, the score of the nodes which have one parent node or two, then starting the 15 days learning work of Bayesian multi-network with 99 sub- networks which having 277 progresses each using a CPU core in the high-performance computers. Further more, using decision tree and support vector machine algorithm to test data from Harbin and Changchun, the result shows that the former which has large noise is not as better as the latter. On the bases of this. I suggest that we use the acquired information of the decision tree, the Bayesian Multi-networks, the characteristic words and words of different meanings extracted using binomial test to establish the rules of the documental input. The rules have been used in Changchun and Harbin and laid a good foundation for the improvement of precision in text classification. In the research of neural networks, to simulate weight regulation in biological neurons, the paper provides a new idea which realizing the measure of the biological neurons weight relying on the consequences in the transmitting progress. This paper also does some computer simulation experiments which can provide a text classification reference for the future in which the simulation of human's brain can realized.
Keywords/Search Tags:text categorization, text clustering, Mayor's public access line project
PDF Full Text Request
Related items