Font Size: a A A

On Extraction Domain-based Topic And Feature Relations From Textual UGC

Posted on:2017-01-07Degree:DoctorType:Dissertation
Country:ChinaCandidate:H L XuFull Text:PDF
GTID:1108330485488437Subject:Management Science and Engineering
Abstract/Summary:PDF Full Text Request
In Web2.0 times, social media urges users to be both the users of information and the publisher of information. There are new data generated every minute in the network, a large number of network data resources accumulated, the human enter the Era of Big Data, which has two sides: on one hand, has huge value, on the other hand, puts forward a great challenge toward the information processing when facing huge amount of data and complex data structure.Text is one of the most ancient ways of storing information, and among Network data resource, UGC text occupies a large proportion, contains a wealth of information, especially domain information. In recent years, text mining technology, as an advantageous tool, is introduced to the research on artificial natural language processing to handle how to excavate useful informaiton from text. However, as to different writers, UGC text has different writing norms with free content, which brings an incredible problem to the work finding information from massive UGC text. Moreover, at the age of information explosion, the information mined by TMT has to meet the need of users, easy to be understood and memorized. Therefore, it is crucially important to establish a system of management and information discovery based on the needs of users.According to what concerned above, this paper launches an in-depth study on the information extraction and related research of massive UGC text. The concrete research and related conclusions are as followed:(1) the discovery of new compound words based on the dependent relationbetween wordsThe effect of word segmentation determines the effect of the final results of the text mining. Since the traditional word segmentation software can’t deal with new compound words well in UGC text, this paper, based on statistic, points out a completely new discovery method: FPS&MC, which doesn’t ask for dictionary, corpus training on early stage. This method first uses sequential frequent pattern to mine candidate new compound words, then screen these candidate words by calculating their sequential max confidence repeatedly, finally obtain these words existing in the text.The experimental result indicates that FPS&MC has a good effect in extracting new compound words in UGC text, which compared with other algorithm, is much better at extracting named entity, such as people’s names, places’ names, organizations’ names, proper names, time, etc. In general, most named entity are topic words in UCG text, so FPS&MC can find users’ behavior preference shown in UGC text, and lay a good foundation for subsequent topic identification, feature extraction, business application analysis.(2) the division of domain text topics and extraction of their featureTopic is an important information element hided in UGC text, so it is more convenient for users to obtain thorough information after reorganized UGC text based on topic information. Comparatively, those topics extracted by the traditional text mining technology are interfered by the public hot words. Furthermore, extracted information related to certain topic is coarse, and features are too general. With problems above, this paper points out a new analyzing method based on document data association, which analyzes and gets the hot topic words and related local feature words in massive UGC text.The experiment shows that TVS can eliminate the interference of high frequency words, extract hot topic words and related local feature words from the massive network data. Meanwhile, adaptability experiment and expansionary experiment show that this calculation method is available to different types of text data, and this method can be realized by parallel computing and also has good function on individual computer.(3) the application study of the relation among multi-topics of UGC text and theextraction of their featureThe traditional discovery of topics and extraction is hard to distinguish and straighten up the relation among topics of UGC text, which surely contains information and can effectively help users obtain and understand information. This paper, on the basis of tourism blog text data, combined with text mining technology, extracts popular tourist attractions, local features of scenic spots, and the relation among scenic spots, and builds a tourism information extraction and management system on the needs of tourists. This system, beginning from three basic demands of users: where to play, what to play and how to play, builds four modules: pre-treatment of the tourism blog, TOI extracting of hot scenic spots, regionalizing hot scenic spots, and the discovery and recommendation of tourism routes. This paper, borrowing Beijing tourism blog data set, carries out a sample experiment based on the four modules and displays the experimental result by visualization technology.The experimental result proves that this system can effectively extract the tourist-needed tourism information from massive tourism blog text data, and can much better help tourists plan their tourism routes.
Keywords/Search Tags:domain text, UGC, information extraction, user decision, Intelligent Tourism
PDF Full Text Request
Related items