Technology Of Network Information Collection And Study Of Chinese Unknown Word Algorithm

Posted on:2013-10-17

Degree:Master

Type:Thesis

Country:China

Candidate:H Chen

Full Text:PDF

GTID:2248330371966305

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

With the rapid growth of Internet, network information has witnessed expl osive growth. As network has grown into a huge information database, how t o process internet information has become a hot issue. The primary of network information processing is to how to select information, which means the colle ction and selection of data from various network information resources. Networ k is now a prime media of information creation and dissemination. Everyday a large number of new words emerge on the internet. These new network word s are called unknown words. How to collect these words in-time and precisely is a key research issue in the field of network information processing. By the means of recognizing unknown words, text information processing can be mor e precise.This paper focuses on two issues. One is to study the collection of various netwo rk information media and precise information extraction. The field includes network f orums, news, blogs, micro-blogs, LANs and so on. The other issue is to address an eff icient recognition method for network new words recognition.The major researches of the paper:Designed and implemented the data collection methods and the structured methods for online forums, news portal and blog. These methods can automatic ally collect the forum post information, the reply information, the network new s, the information of blogs and store the information in the fixed format accor ding to the custom of Chinese.Design and implement the data collection method especially for micro-blog s, the new media. This method can automatically collect information about spe cific topics from various blog networks.Design and implement the spider procedure for the whole network. This method can automatically discover and collect all networks of the LANs and r ealize the structured data collection to some extent. In the accordance of the concept of max cliques, to study the unknown w ords recognition technology and bring up a method especially for network new words, which can efficiently collect the emerged network new words.Currently, methods mentioned in this paper have been applied to spider system s of various projects and achieved encouraging results. However, some technical det ails are relatively simple. For example, accurate information extraction method is mai nly based on fixed rules. Although this techonology is effective, it comsumes large am ounts of manpower and becomes invalid easily with slight changes of forums’ struture s. It is hoped that more researches can be devoted in this field so as to realise more ef ficient and user-friendly collecion method.

Keywords/Search Tags:

data collection, information extraction, unknown word recognition, max cliques

PDF Full Text Request

Related items

1	The Research Of Unknown Chinese Work Recognition And Its Application To Chinese Input Method
2	Hybrid models for Chinese unknown word resolution
3	Extended Information Retrieval Model Based On Markov Cliques
4	Research On Unknown Words Recognition And Word Meaning Discovery Based On Short Text Of Micro-blog
5	TF-IDF And Rules Based Automatic Extraction Of Chinese Keywords
6	Based On Dictionary And Word Frequency Analysis Of The Unknown Words From The Bbs Of Corpus Recognition Research
7	Study On Extension Of Unknown Words Based On Cyber Source
8	Research And Application On Chinese Automatic Word Segmentation In Full Text Retrieval
9	The Research Of Open Information Extraction System
10	Research And Implementation Of Chinese Word Segmentation Algorithm