探索用户自然输入标记及其在构建分词语料库中的作用
张大奎;尹德春;汤世平;毛煜;樊孝忠
【期刊名称】《中文信息学报》 【年(卷),期】2018(032)002
【摘要】With the optimization of Chinese word segmentation algorithms,the performance of a word segmenter is more dependent on the coverage and completeness of the training corpus.Therefore,how to quickly,effectively,au-tomatically build word segmentation corpus has become a pressing issue to be addressed.This paper aims to explore the valuable natural word segmentation information,which is produced when users type in Chinese text.This infor-mation provides a new perspective for building Chinese segmentation training corpus,which is less touched in the lit-erature.In this paper,we have shown that user-produced word segmentation information can be used to segmenta-tion corpus,and its performance is acceptable.Moreover,some texts with this information from the excellent users are very close to the gold standard segmentation result.In this study,we use the classification model and the voting mechanism to find three of these excellent users,and get texts with natural word segmentation information.Experi-mental results show that these texts can be used to build segmentation training corpus,which greatly improves the accuracy of the segmenter.%当分词算法优化到接近极限时,分词器的性能指标就较多地取决于训练语料的覆