Font Size: a A A

Extraction Of Multiword Expressions Based On Web Text

Posted on:2019-03-10Degree:MasterType:Thesis
Country:ChinaCandidate:S S GongFull Text:PDF
GTID:2348330542487568Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Multi word expression(MWE)is a fixed or semi fixed collocation unit in natural language.Especially in network texts,MWEs appear frequently and often lacks annotation information,which brings great challenges to word segmentation tasks and subsequent text comprehension.At the same time,the MWEs' extraction of network text plays a vital role in the tasks of hot tracking and information retrieval of social networks.At present,research on extracting methods of MWEs has certain achievements,but there are still a lot of room for improvement,and the main problems are:the methods of MWEs' extraction using pure rule or less in Web text;the effect of MWE' extraction using pure statistics or pure rules is not good;the calculation of the lexical relations of MWEs' structures often rely on artificial rules and templates,and it is not suitable for the extraction of MWEs in network text which was covered and contained everything.Therefore,this paper focuses on the research of MWEs'extraction from Web Texts,and the purpose is to explore how to combine regular and statistical methods to extract MWEs,and how to reduce artificial dependence and achieve automatic extraction of MWEs.This paper analyzed and summarized the detailed structure and linguistic features of Chinese MWEs in network text,on this basis,designing the method of combining rules with statistics for extracting MWEs,and set the regular expression templates according to the combing rules of Constituent words' speech of MWEs,and improve the NC-value statistical model,which combining the mutual information that is MI/NC to extract MWEs.After experimental tests,on the 10 thousand micro-blog corpus,the F value of MWEs reached 85.85%based on the rules and statistics integration method,and the performance was greatly improved compared with the baseline system.Further,in order to reduce the dependence on artificial rules,and improve the accuracy of MWEs' extraction,this paper proposes a MWE extraction method based on Double-deck strategy.First,we use the method of entropy combing with enhanced mutual information algorithm to extract the MWEs;the second level,we use the support vector machine(SVM)classify the candidate list that obtained from the first level method,as well construct the context and word vector features of MWEs to achieve further filtration.It is verified by experiments that the F value of MWEs based on Double-deck strategy is 89.58%,which has been further improved compared with baseline system and the method of rules combing with statistics.In summary,in the aspect of the method of rules combing with statistics and the calculation algorithm improving of the MWEs' Constituent words and introducing word vectors for filtering MWEs in candidate lists,this paper carries out a series of innovation works.The experimental results show that the method of combing rules with statistics and the method of Double-deck strategy,both of them can get good results for extracting MWEs.In addition,the experimental results of combining the MWEs'extraction and word segmentation show that the effect of word segmentation is improved after adding the results of MWEs.
Keywords/Search Tags:MWE, Network text, Rule and statistics fusion, mutual information, Entropy combining with Enhancement of mutual information, Support vector machine, Words segmentation
PDF Full Text Request
Related items