Research On Chinese New Words And Expressions Identification

Posted on:2005-03-26

Degree:Master

Type:Thesis

Country:China

Candidate:G Zou

Full Text:PDF

GTID:2178360185495542

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

With the fast development of our economy and society, more and more new words and expressions come out in our life. They make our language more colorful, but identification of them brings new challenges to dictionary compilation and natural language processing. There is not a clear and commonly accepted definition for them now. According to the definition of them in word segmentation and in linguistic field, this thesis divides them into three categories. The first is named entity, the second is word or expression with new morphology and the third is word or expression with new meaning or new usage. In this thesis, we focus on the identification of the word or expression with new morphology.While the research on named entity identification attracts much concentration, few researches are found to identify the word or expression with new morphology. And almost no research is relating to identify the word or expression with new meaning or new usage. One of the deficiencies of the current researches on the word or expression with new morphology is that the new words and expressions found are restricted by the length or the field.In this thesis a method is proposed to find new words and expressions without the limitation of length and field after a given date in the webpages grabbed from Internet. Our implementation is composed of three parts. The first is webpage grabbing part, the second is webpage analysis part and the third is new words and expressions finding part. In webpage analysis part, the date and the content are extracted from the webpage. After segmentation, an algorithm is run on the content to find repeated strings. Finally the repeated strings found and words are stored into a database called original information database with dates. In new words and expressions finding part, the original information database is divided into backup database and filtering database based on the given date. After evaluating every word and string in the filtering database, a candidate set is built. After automatic filtering and POS estimation on the candidate set, the final results can be outputted.In our experiments done in Jiangnan Times and East China News, the precision rate is between 30% and the recall rate is about 90%. The system built has already applied to assist the compilation of Modern Chinese New Words And Expressions Information (Electronic) Dictionary.

Keywords/Search Tags:

Chinese New Words And Expressions, Automatic Identification, Statistics of String Frequency, Repeated String Finding

PDF Full Text Request

Related items

1	Approximate String Matching For Chinese Characters By Combining Filtering And Bit-parallelism
2	Bank Cheque In Handwritten Application Domain String Recognition
3	Research And Application Of String Approximate Matching Algorithm Based On Multivariate Information
4	Design And Development Of Video Reading System
5	Discriminatively Train Classifiers Embedding On Synthetic String Samples For Chinese Handwritten String Recognition
6	Researches Into New Chinese Words Identification Based On Large-Scale Corpus
7	Research And Application Of Statistical Language Model
8	Algorithms On Motif Finding And Closest String Problems
9	The Computer Automatic Recognition Character Of Numeral Instrument Dynamic Displayed
10	Approximate Chinese String Matching Techniques Based On Pinyin Input Method