Multi-label text classification is one of the important tasks in the field of natural language processing.The explosive growth of text data and expensive computational cost are well recognized as challenges in the field of natural language processing.The number of class labels for multi-label text classification is gradually thousands or tens of thousands.Thus,multi-label text classification tasks with more than 1000 class labels,is called extreme multi-label text classification(XMTC)task.The key problem of XMTC task is the long tail problem.As an important external knowledge in the task,label knowledge is a potential factor to alleviate the long tail problem.Existing techniques cannot easily scale to XMTC problems of a severe power-law distribution of labels in the datasets.They focus on the use of label cluster structure knowledge,while making the balanced predictions through the co-occurrence of head labels and tail labels in the same label clusters.However,the above methods solidify the structure of label cluster,and information gains from label knowledge cannot apply to the dynamic and rich real semantic scene,which fails to achieve the ideal classification effect.To solve these problems,this paper explores feasible ways of label knowledge usage to mitigate the long tail problem of XMTC task under consideration of the important roles of lable knowledge.The research is as follows.1)Aiming at limitations of using label cluster structure knowledge to alleviate the long tail problem,an XMTC promotion strategy based on label knowledge presents to improve the poor performance resulted from the fixed label cluster structure knowledge.The teacher knowledge generated by text modeling optimizes the text representation and improves the prediction performance of tails labels.The experimental results show that the promotion strategy can effectively improve prediction performance of the existing XMTC methods on the tail labels and the whole labels.2)Aiming at the problem that the methods of introducing teacher knowledge strategy in1)has simple structure,and insufficient ability of network expression and feature extraction,an XMTC algorithm TReader XML based on teacher knowledge strategy is proposed.TReader XML proposes a framework that enables dual-stream collaborative network,in a way that naturally allows the teacher knowledge and text features to be embedded into the shared semantic space to achieve feature interaction.The experimental results prove that TReader XML has achieved the state-of-the-art results on the whole labels and the tail labels.3)Aiming at the cost and risk problems during enterprise deployment of XMTC achievements in academia,based on the research results of 1)and 2),a toolkit of XMTC LKRoad based on label knowledge is proposed.LKRoad formulates data standard and realizes tools of data analysis,data preprocessing,classification algorithms based on label knowledge and result evaluation.The experimental results prove the rationality of the framework design and its important value for the industrial implementation of academic methods. |