Font Size: a A A

Research On Chinese New Words And Expressions Identification

Posted on:2005-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:G ZouFull Text:PDF
GTID:2178360185495542Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the fast development of our economy and society, more and more new words and expressions come out in our life. They make our language more colorful, but identification of them brings new challenges to dictionary compilation and natural language processing. There is not a clear and commonly accepted definition for them now. According to the definition of them in word segmentation and in linguistic field, this thesis divides them into three categories. The first is named entity, the second is word or expression with new morphology and the third is word or expression with new meaning or new usage. In this thesis, we focus on the identification of the word or expression with new morphology.While the research on named entity identification attracts much concentration, few researches are found to identify the word or expression with new morphology. And almost no research is relating to identify the word or expression with new meaning or new usage. One of the deficiencies of the current researches on the word or expression with new morphology is that the new words and expressions found are restricted by the length or the field.In this thesis a method is proposed to find new words and expressions without the limitation of length and field after a given date in the webpages grabbed from Internet. Our implementation is composed of three parts. The first is webpage grabbing part, the second is webpage analysis part and the third is new words and expressions finding part. In webpage analysis part, the date and the content are extracted from the webpage. After segmentation, an algorithm is run on the content to find repeated strings. Finally the repeated strings found and words are stored into a database called original information database with dates. In new words and expressions finding part, the original information database is divided into backup database and filtering database based on the given date. After evaluating every word and string in the filtering database, a candidate set is built. After automatic filtering and POS estimation on the candidate set, the final results can be outputted.In our experiments done in Jiangnan Times and East China News, the precision rate is between 30% and the recall rate is about 90%. The system built has already applied to assist the compilation of Modern Chinese New Words And Expressions Information (Electronic) Dictionary.
Keywords/Search Tags:Chinese New Words And Expressions, Automatic Identification, Statistics of String Frequency, Repeated String Finding
PDF Full Text Request
Related items