In Chinese documents(especially Chinese newspapers), the non_text regions and text regions always interleave each other. The nonjext regions will disturb the pick_up of the text regions. We provide a Chinese layout analysis method with antecedent non_text regions for this characteristic. First we extract the nonjext regions and remove them to avoid the disturbance of them to the pick_up of the text regions. Then we apply a method based on run_length smoothing and minimal spanning tree clustering to process the text regions. We apply different means to different aligned text regions. In the end, the text regions gained in the clustering are segmented according to the position of the non_text regions. We can infer from experiments that the mothed is better to segment the documents in which the horizontal aligned and vertical aligned text regions are blended ,and the text regions and nonjext regions intermix.
|