Font Size: a A A

Found The Blog Knowledge-based Information Extraction Technology

Posted on:2008-09-24Degree:MasterType:Thesis
Country:ChinaCandidate:S ZhangFull Text:PDF
GTID:2208360215974896Subject:Computer applications and technology
Abstract/Summary:PDF Full Text Request
Blog as representatives of Web 2.0, since appears, it has caused a kind of change of circulation ways of mass media and has influenced and changed the mode of internet continuously. More and more scholars have paid attention to the discovery of hot topic and discernment of relevant person on Blog.The discovery of hot topic and discernment of relevant person can as other further application's foundation. Search engine can carry on the theme search, present the information correlating with hot themes to users. The data mining system can carry out the depth mining based on the hot theme. By analysing the hot theme according to the time, we can find or predict the fashion trend in the network . Discern relevant person have more important meanings, and there are more extensive applications using it. Search engine can prevent from name ambiguous through the recognition technology of person name, raise recall and precision for name search, increase user satisfaction. After recognizing person names, we can construct social network according to the relation among the theme and person. Person name can also use for the standard of cluster, we can improve the accuracy of the cluster further through the recognition of the name. Further more, discovery of the hot theme and discernment of relevant person name can all be used as the foundation of"ask-answer"system. So, it has important academic value.In addition, with the enlargement of Blog, find the hot theme and relevant person can already as a knowledge system alone, also can as a subsystem for other recommendation system. So the study of these contents will have important commercial value.So, we have done certain studying to the discovery of hot theme and discernment of relevant person name in this paper.The groundwork of this thesis includes the following respect:(1) Topic Identification We have studied the characteristic and deficiency of existing topic identification algorithm, putting forward a new topic identification algorithm based on document semantic. New algorithm utilize the principle of word co_occurence, construct the co_occurence matrix and through calculating the x 2 value to extract the keywords. The experimental results show that new algorithm can get better recall and precision than TF, thus can judge the topic of the article effectively.(2) Hot themes findingThrough the analysis of Blog's themes, construct the distribution model of themes, defined the overall theme change factor and individual theme change factor during two time range. Then we can confirm the right way of judging hot themes under different situations. The experiment shows, the new algorithm can find hot topics in Blog effectively.(3) Person name extractionCombined rule_based method and statistics_based method to extraction person name on Blog. New algorithm improve the recall and precision rate of name extraction though has increased time complicated. Meanwhile, we utilizes supervises method to semi-automatic to make the extraction rules.(4) Construct person's related networkWe propose an algorithm of building person relation based on the person name that has extracted from Blog. We also have proposed an algorithm to judge person's related credibility and construct an authoritative person's related network finally.
Keywords/Search Tags:Information extraction, keyword extraction, hot topic identification, person name extraction, social network, Blog
PDF Full Text Request
Related items