| Collocation is a combination of words which occurs repeatedly and follows some certain syntactic structure but arbitary and not analogical.Collocation extraction is to extract collocation from the corpus automatically by computer computing ability and programming language.With the rapid development of computer technology,people has paid more and more attention to automatic collocation extraction.On one hand,collocation extraction plays an important role in many applications in the field of natural language processing,such as machine translation,word disambiguation,language generation and retrieval etc.In addition,it also plays a very important role in language teaching and second language acquisition.On the other hand,as the Internet data and massive corpus become an important source for the research of computing linguistics,the rapid increasing of Internet data and the expansion of corpus scale make it necessary to develop effective methods to extract collocation automatically.In this paper,we used the key technology of Hadoop distributed computing platform,combined Chinese linguistics and referred to the statistical methods from the 3-gram of n-gram corpus by Google Research Institue to extract Chinese typical collocation automatically.We studied the distributed collocation extraction system basing on Java Web and Hadoop,and provides a new way for users to get the collocation and its information intelligently and conveniently.The main research contents are as follows.First of all,we expounded the existing statistical methods of collocation extraction and the key technology of Hadoop distributed platform.Also,we compared and analysed the advantages and disadvantages of these methods.We introduced the evaluation index for collocation extraction: precision rate,recall rate and F value.Additionally,combining with the Chinese linguistics and contents of corpus,we chose Chinese typical collocation types and gave a description of part of speech composition by analysing the rules of word composition between the collocates.Finally,we gave the specific ways to extract Chinese typical collocations from the n-gram corpus in the experiment.The main research achievements are as follows:Firstly,we extracted collocation automatically with the specific procedure by referring to statistical extraction methods and related technology of Hadoop distributed platform and combining part of speech composition rules of collocation in Chinese liguistics.In this paper,we removed the sparse data and non-Chinese data,adopted the NLPIR for word segmenation and POS tagging to preprocess corpus,selected span to extract candidate collocation set,used part of speech composition rules of collocation to filter collocation and calculated the statistical values of co-occurrence frequency,mutural information and chi-square test under the MapReduce mode.We used the distributed database to store the intermediate results and the final results and constructed Chinese collocation dictionary for users.Secondly,we developed the front desk for the Chinese collocation extraction system based on Hadoop so that usres can access to collocation effevtively.We used the bootstrap developing framework to design the front page and achieved these functions to set condition in the word searching area and display retrieval collocation in the result displaying area.Thirdly,we summerized a method to extract typical collocation and applied the combination method of big data technology,Chinese linguistics and statistical methods to extract noun,verb,adjective and adverb collocation in the experiments.It is concluded that the results of co-occurrence frequency for collocation extraction are best by the quantitative comparision and analysis.The precision rate of noun collocation extraction is 86%,its recall rate is 59.72% and its F value is 70.49%.The precision rate of verb collocation extraction is 80%,its recall rate is 65.57% and its F value is 72.07%.The precision rate of adjective collocation extraction is 82%,its recall rate is 78.85% and its F value is 80.39%.The precision rate of adverb collocation extraction is 88%,its recall rate is 43.56% and its F value is 58.28%.The presicion rate of noun and adverb is higher than that of the existing extraction software by 2 percent to 4 percent.It shows that this method has a certain value for extracting Chinese collocation automatically. |