Font Size: a A A

Research And Implementation On Public Opinion Analysis And Attribute Discovery Orientied Internet Text Mining

Posted on:2012-05-10Degree:DoctorType:Dissertation
Country:ChinaCandidate:J M HuangFull Text:PDF
GTID:1118330362460198Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
The socialized medias such as micro-blogs, instant communication tools, BBS,blogs, etc., and entity database based Internet applications such as coordinated tag,online store, digital library, etc., have been deeply associated with human daily lifeand become an important platform for people to publish, transform informationand acquire knowledge. The network text is the main carrier of Internet information.Thus text mining has great value for industry and academy in the publicopinion analysis, attribute discovery for object track and other network securityapplications.Text message and entity information are two main types of Internet text data.Text messages on the socialized medias, which are usually quite short, are organizedas text streams according to their time stamps. These text streams containa plenty of opinions and sentiments of netizens. However, the incompletness, singularity,massiveness and dynamic features of text messages make it difficult todiscover topics, analyze sentiments and mine hotspot in text streams.On the other hand, entity database based websites contain a great number ofsocaliazed labels, digital books, attribute informations of entities such as houses,cars, goods, entertainment and people, etc. The attribute information of entitiesdistributes among pages, and is always covered by massive page contents. Especially,for those websites that support exploratory search, a lot of attribute information ofentities is used for interactive search, thus such attributes are hiden in the interactionprocess that users search the entities and would not appear in the final pages fordescribing the detail of the entities. We call it latent attribute information in thispaper. Currently, there is no research about mining the latent attribute information.Our research is based on the features of text streams and entity information,and is used for public opinion analysis and entity attribute discovery. We focuson four techniques of network text data mining including short text conversationdiscovery, hot phase mining, latent attribute information discovery and focusedcrawling of massive attributes. The main contributions of this paper are as follows:1. proposed an algorithm for discovering text conversations based on the producespeed of massages and the context correlation. The goal of text conversation discovery is to divide the messages into several different conversations, and it is thefoundation of topic discovery, sentiment analysis and social network analysis. Asthe produce speed of messages can potencially reflect the boundaries of conversations,we smooth the produce speed using n-order moving average method. Thendetect the troughs of the curve of produce speed. The moments of the troughsare considered as conversation boundaries. After that, for those fine-grained conversationsegments discovered previously, we cluster them based on their contentcorrelation so as to get more complete conversations. To do this, we introducethe concept of conversational context correlation degree among messages based onthe adjacent level, for adjacent messages in the message stream are more likely tocompose a conversational context. We compute the correlation degree among similarmessages in the massive historical message stream conprehensively to objectivelymeasure the context correlation of any two messages. Finally, we get the correlationdegree between fine-grained segments by integrating the degree among messages inthe segments. The results of our experiments show the performance are promotedas 30% compared to the cluster algorithms based on the pure content similarity oftext.2. proposed a hot phase mining technology based on AC-Trie tree to dealwith the chinese text message stream in micro-blogs. The hot phrases are definedas the substrings that suddenly appear frequently and last for a span in themessage stream. They can reflect the hot topics and sudden events hiden in themessage stream. Based on the sample datas gathered from text message stream insome typical peroid, we first construct an AC-Trie prefix tree with the finite automatastructure. And then, based on such sample tree, we record the occurancefrequency of phrases in the corresponding nodes by scanning the following streamin single-pass. Three classic methods including frequency, amplification ratio andacceleration are used to measure the hot degree of each phrase, so as to mine thehot phrases. Note that the transformation of hot topics will lead the changes of thehot phrases, thus AC-Trie needs to be reconstructed using the new samples in thetext stream, to discover new hot phrases. We start the reconstruction dynamicallyaccording to the occurance frequency of the missed phrases, which are recoreded onthe nodes of Trie. The results of experiments on the text message stream from the sina micro-blog show that our mining technology is fast, and can mine hot phrasesefficiently, with the cost of larger space consumed.3. proposed a method to discover the latent attribute information based onthe semantics of hyper links in exploratory sites. The exploratory sites not onlycontain entity pages that describing the detail of the entity, but also contain a greatnumber of list pages that appear in the exploratory search process. The list pagescontain hyperlinks used fro exploratory search. We fisrt find all list pages basedon some significant features of the websites. Then, according to the semantics of"roll-up"and"drill-down"hyperlinks in the list pages, we find all"drill-down"linksby comparing the relation of entity set hidden in the list pages. Finally, the anchortextof the"drill-down"link is mapped to the entity hidden in the list page pointedby the link, thus becomes the attribute set of the entity. The latent attributediscovery is very important for deeply mining features of public opinion and hottopics. Although the dynamic update of the website would potencially bring someerrors for latent attribute discovery, the experiments show that our method cantolerate the affect of website dynamic update, thus is practical, and can achieveprecision at 98% and callback at 97% averagely.4. proposed a optimization method for latent attribute discovery based on thedynamic pruning of query tree. Since different list pages in exploratory websitewould contain the same entity set, we design a pruning mechanism for query tree toavoid unnecessaly duplication in attribute discovery. Each node in the query treerepresents a list page, and the edge from the parent node to the child node representthe"drill-down"relation between corresponding list pages, while the value ofan edge is the corresponding latent attribute. All latent attributes from the rootto a node compose the attribute set of the node. The query tree is constructeddynamically. It begins from the root list page of the website, and with a deep-firstmanner, it constructs the child nodes according to the semantic of"drill-down"hyperlinks.Then it compares the new generated child node with all existed nodes,and prunes the new node if it is the same as an existed node. We call such dynamicconstruction process of query tree as attribute focused crawling if it includespruning mechanism. When such focused crawling is finished, we can get all unduplicatedentity pages(child nodes). Finally, all explicit attributes can be gathered using traditional crawling and extraction technics and all explicit and latent attributestogether compose all attribute information of the entity. The results fromexperiments show that our refined method for discovering latent attribute, can betteradapt the dynamic changes of websites and achieve the precision and callbackratio both at 99%, since it can significantly increase the speed for discovering latentattributes.5. implemented a mining system for text message stream and entity informatiionbased on UIMA. UIMA is a distributed opensource platform for mining massiveunstructrued data based on middleware. Our system is based on UIMA, and implementsa Internet text information mining system including four research aspectsof this paper using responsability-chain design pattern. The system contains fourparts: web crawler component, pre-process subsystem, natural language processsubsystem and the mining subsystem including all keypoint of the research in thepaper. The web crawler component crawls pages of specified website and store themin the Hadoop file system. The pre-process subsystem preliminarily fills out the uselessinformation and simplily segments the information according to the configuredrules of the content in the pages, and extracts metadatas such as author, time, titleand hyperlink to wrap them as CAS data packages of UIMA. The natual languageprocess subsystem is used to segmentation, NER and POS tag for the text contentin the data package, and put the results into the CAS data package. Mining subsystemfirst gets text messages and hyperlinks from CAS package. And then accordingto the technologies proposed in this paper, it sorts the messages in the stream intoconversation list, mines hot phases, and discovers the entity attribute information inthe exploratory websites. Finally, it adds the results into the database. Meanwhile,the entity attribute information stored in the database is used to feedback to thepre-process subsystem and natural language process subsystem to assist identifyingentities and attributes. In addition, we implement a concise visurlization interfaceto show the mining results. The system is successfully applied into the Yin He BoShi public opinion system which is developed by NUDT.
Keywords/Search Tags:text mining, text message stream, conversation detection, hotshort phrase, entity attribute, hidden attribute, exploratory search
PDF Full Text Request
Related items