Font Size: a A A

Research On The Fingerprint Of Short Texts

Posted on:2013-01-08Degree:MasterType:Thesis
Country:ChinaCandidate:X Q ZhaoFull Text:PDF
GTID:2218330371957341Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The rapid development of information technology in 21 century, lead to the globalinformation revolution .The popularization and sharing of the global information network bring alot of convenient to people's daily life.Besides this,Internet as the main body of the informationhighway has penetrated into every field of the society,which provides mankind with informationsharing and communication of the modern channel.Mobile phone short messages,instant messages,Internet relay chat logs,blog comments ,newscomments , BBS titles and so on all produce tens of thousands of texts.which have shortlength,broad extension .This texts are called short texts.Fingerprint of the short text as a uniqueidentifier to verifies the short text plays a decisive role in the short text field.Only the same shorttext has the same fingerprint , and vice versa.Once you establish the one-to-one relation between thefingerprint and the short text,then taking the advantage of the short text fingerprint in various ofdata mining,such as clustering ,de-emphasis ,the redundant.Inspired by Hownet and Wordnet,the paper analyed the relations between concepts.Bycomparing JSON and XML , JSON is used to build concept dictionary,and give the algorithm tobuild it. In order to improve the retrieval efficiency and get high- efficiency boil down, A indextable is designed ,which could make the concept of coding and indexing structureto reach consensus.Later, using ICTCLA word segmentation,so that the short text simplified to a single efficientsegmentation gradually.The more import point is that,wo do some processing in some specialcases,such as splitting and merging number,time,date and reduplicated words.After theprocessing,the segmentation results have a high precision.As it is known to all, feature word is thecore and nature of different field,so the paper give the definition of feature word and specific way toextract feraure word.On the basis of the feature extraction,the paper give the concept of thefigerprint of short text and the STF method.Last, introducing the runtime environment of theexperiment and functional modules to analyze the feasibility of semantic similarity between shorttexts.In addition, the paper compare with the existing four kinds of short text similarity computingmethod to prove the STF method can effectively analysis the uniqueness of short text.All thesecan improve the accuracy and effectiveness of the late short text mining.
Keywords/Search Tags:the fingerprint of short text, domain dictionary, ICTCLA word segmentation, feature extraction, Semantic similarity
PDF Full Text Request
Related items