Font Size: a A A

Research On Short Text Language Computing

Posted on:2009-05-15Degree:DoctorType:Dissertation
Country:ChinaCandidate:C C GongFull Text:PDF
GTID:1118360242997500Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The rapid technological improvements in Internet and telecommunication have led to an explosion of digital data. A large proportion of such data are short texts, such as mobile phone short messages, instant messages, Internet relay chat logs, BBS titles, news comments, blog comments and so on. Short texts have become an important communication channel accepted by billions of people. Short text databases are usually extremely huge. All kinds of topics and opinions are expressed in short text databases, including political affairs, economic affairs, military affairs, entertainment affairs, as well as private life affairs. Short text language computing technologies may be widely used in topic tracking and detection, catchword analysis, public opinion forecasting and other applications.Short texts have attracted numerous researchers'attention since the Web2.0 came into being in 2005. The unique language characteristic of short texts makes short text language computing quite different from traditional natural language processing. A single short text is usually very short. The feature space is so sparse that it is notoriously difficult to extract effective language features. Short texts are extremely huge in number, which requires that short text language computing algorithms must be efficient enough. Short texts tend to be much more informal than traditional texts. Informal abbreviations, transliterations, network languages are prevailing in short text databases. The main contributions of this dissertation are as follows:First, the concept of short text network is presented, and as well as algorithms to construct two common short text networks. They are short text fingerprint network and short text coexistence network respectively. These two short text networks are used to detect and eliminate duplicated short texts in large scale databases. Via a short text fingerprint network, short texts are mapped into vertices in the network, and there will be an edge between any two short texts with the same fingerprint. Exact duplicate detection problem is transferred into connected component mining problem of short text fingerprint network. Via a short text coexistence network, short texts are mapped into vertices in the network, and there will be an edge between any two texts that share some common language units. Near-duplicate detection problem may be transferred into connected component mining problem of short text coexistence network.Second, a frequent pattern mining algorithm─Crusher─for large-scale corpus is presented. A logical partition strategy is presented in Crusher to divide the original corpus into a certain number of sub-corpuses, so that the union of frequent patterns found in all sub-corpuses is the set of frequent patterns of the original corpus. Low-frequency patterns can be pruned in Crusher which makes Crusher quite efficient. Meaningful string mining is another language computing problem in short text databases. Locality is observed in meaningful string distribution, including temporal locality, spatial locality, regional locality, speaker locality, and session locality. Locality is used in meaningful string mining, and both precision and recall can be improved.Last, some pilot study is conducted in humorous mobile phone short message recognition. Humorous mobile phone short messages are divided into two categories, formal humour and content humour. Features are extracted, and a humour index is associated for each mobile phone short message. Experiments indicate that the humour index associated reflects well the humour degree of short texts. Experimental results show that the precision is promising in practical applications.
Keywords/Search Tags:Short Text, Language Computing, Duplicate Detection, Frequent Pattern, Meaningful String, Humour Recognition, Humorous Mobile Phone Message
PDF Full Text Request
Related items