Font Size: a A A

Technology Of Network Information Collection And Study Of Chinese Unknown Word Algorithm

Posted on:2013-10-17Degree:MasterType:Thesis
Country:ChinaCandidate:H ChenFull Text:PDF
GTID:2248330371966305Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With the rapid growth of Internet, network information has witnessed expl osive growth. As network has grown into a huge information database, how t o process internet information has become a hot issue. The primary of network information processing is to how to select information, which means the colle ction and selection of data from various network information resources. Networ k is now a prime media of information creation and dissemination. Everyday a large number of new words emerge on the internet. These new network word s are called unknown words. How to collect these words in-time and precisely is a key research issue in the field of network information processing. By the means of recognizing unknown words, text information processing can be mor e precise.This paper focuses on two issues. One is to study the collection of various netwo rk information media and precise information extraction. The field includes network f orums, news, blogs, micro-blogs, LANs and so on. The other issue is to address an eff icient recognition method for network new words recognition.The major researches of the paper:Designed and implemented the data collection methods and the structured methods for online forums, news portal and blog. These methods can automatic ally collect the forum post information, the reply information, the network new s, the information of blogs and store the information in the fixed format accor ding to the custom of Chinese.Design and implement the data collection method especially for micro-blog s, the new media. This method can automatically collect information about spe cific topics from various blog networks.Design and implement the spider procedure for the whole network. This method can automatically discover and collect all networks of the LANs and r ealize the structured data collection to some extent. In the accordance of the concept of max cliques, to study the unknown w ords recognition technology and bring up a method especially for network new words, which can efficiently collect the emerged network new words.Currently, methods mentioned in this paper have been applied to spider system s of various projects and achieved encouraging results. However, some technical det ails are relatively simple. For example, accurate information extraction method is mai nly based on fixed rules. Although this techonology is effective, it comsumes large am ounts of manpower and becomes invalid easily with slight changes of forums’ struture s. It is hoped that more researches can be devoted in this field so as to realise more ef ficient and user-friendly collecion method.
Keywords/Search Tags:data collection, information extraction, unknown word recognition, max cliques
PDF Full Text Request
Related items