Font Size: a A A

New Words Discovery Research For Specific Areas

Posted on:2013-06-07Degree:MasterType:Thesis
Country:ChinaCandidate:M LiFull Text:PDF
GTID:2248330362971176Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
Along with development of information technology,electric documents in all areas become moreand more rich, it’s difficult to deal with the information of these documents.Chinese has its ownorganizational structure,and its information processing has a high demand for word segmentationtechnology,so the processing of Chinese is more difficult than English.It’s very important to find thenew professional terms against the grammatical features.The subject of this thesis is not multi areas,but single specific area. The area could expand, it maybe the area of financial, or IT, depend on yourdemanding.If you determined your subject area and have the documents of this area, the processing of thisthesis could devide into two steps. Firstly, word segmentation. In this step, the thesis uses the methodof N-Gram, through what we could find new words in the documents.The second step is extracting the new professional terms. In this step, we find the new professionalterms according to the professional terms in the dictionary. The thesis uses the method of Apriori inthe step, firstly, find the frequent itemsets, and then generate association rules to extract professionalterms. There is a problem in the step, that’s the filtering of noise words, considering that thelow-frequency noise words could be filtered in the processing of Apriori, the main problem ishigh-frequency noise words. For these words, the thesis use the way of classification. Cut a big area toseveral subdivision areas, and the words which belong to several areas are high-frequency noisewords.Based on algorithm studing, designed a prototype system to test the effectiveness of the algorithm.The system includes pretreatment, word segmentation, documents cutting, high-frequency noisewords filtering, frequent itemset finding, new professional terms extracting and so on. The result oftesting proves that the algorithm could find the new professional terms effectively.The creative works of this thesis are:(1) Combine methods N-Gram and Apriori, and improve the algorithm.Use N-Gram method anddictionary to segment Chinese documents,use Apriori method to extract professional terms of thearea. The two algorithms combined and formed a complete processing of documents information,which is helpful for practical application.(2) Design a intelligent professional terms-finding system.Used the algorithm of the thesis.Thissystem will not only adjust to sample data,it could find new words depending on your needs.Thisfunction could used in search engine,so the engine could capture the new words from search results.If your change a default professional terms dictionary,it will adjust to another area,so it has goodscalability.
Keywords/Search Tags:Chinese word segmentation, N-Gram, Apriori, New words discovery technology
PDF Full Text Request
Related items