Font Size: a A A

A Study On Chinese Terms Extraction And Their Application

Posted on:2018-08-03Degree:MasterType:Thesis
Country:ChinaCandidate:M P WangFull Text:PDF
GTID:2335330512492743Subject:Library and Information Science
Abstract/Summary:PDF Full Text Request
Term is a basic unit representing the theme in a particular domain.The main point in the dissertation focus on term automation extraction,the text semantic unite being extracted from the professional document sets,which has been studied on all kinds of subject,text clarification,sentence analysis,natural language generation,corpus linguistics study,statistical machine translation,information retrieval,question answering system.Just for unifying the naming,the term extraction mentioned in the below sentence means Chinese term extraction.To get a more efficient way for extracting Chinese term,the author discussed the method in the case of less training corpus and even no training corpus.With the help of conditional random fields,its parameters and the preparing documents make a big difference,therefore,they will be a key point analyzed in the chapter three.Then the application base on the cSharp windows forms will developed,making up by data preprocessing,machine learning,result analysis,open testing.In order to make sure that the term extracted can apply into practice,the example of patent semantic retrieval will explained in the final.There are the main ideal in my dissertation.(1)The domain term extraction model will be finished.The steps and reasons will be explained for the below operation,such as the token algorithms and features templates.(2)Results will be analysed for different parameters in CRFs that judged by index,precious rate,recall rate.Taken the parameter observing sequence as an example,it was only contains the word within one column at first,after taken features from Chinese metallurgy filed,added to 7 columns,actually the columns is not proportional to the index.Thus,the order of the sequence may be count for a better result.(3)The application for term extraction based on the model will be finished.It contains the data preprocessing,machine learning,results analysis,testing for combine rules,and the open testing parts.The new parts which is valuable missed in the model is the open testing part,which will assist to testing and make up for the model.(4)Term will be put into the patent information semantic retrieval system.The structure of the retrieval system will introduced,especially the role of the term played in the system.It is turn out that the method came up in the dissertation,using core glossary instead of corpus token by humans,can make sense in a degree.Briefly summarizing the processing,preparing parameters for CRFs under the domain features,which extract terms with a higher precious rate.To be frankly,with the inaccurate corpus that is the basic source for CRFs,it will make the term less precious and hard to extract all of them.The method and conclusion in the dissertation can be a reference for others,thought it is not perfect.
Keywords/Search Tags:Term Extraction, Patent Term, CRFs, Term Combination
PDF Full Text Request
Related items