摘 要:IG算法是一种有效的特征选择算法,在文本分类研究领域中得到了广泛应用。本文针对IG算法的不足,提出了一种基于词频信息的改进方法,分别从类内词频信息、类内词频位置分布、类间词频信息等方面进行了改进。通过实验对改进的算法进行了测试,结果表明,改进的算法相对传统算法更有效。
关键词:词频信息;IG算法;特征选择;文本分类 中图分类号:TP391.1 文献标识码:A
Research on the Application of the IG Feature Selection Algorithm Based on Word Frequency Information Improvement in Text Classification NIU Yuxia
(Nantong Science and Technology Academy,Nantong 226007,China)
Abstract:As an effective feature selection algorithm,the IG algorithm has been widely used in the field of text classification.Aiming at the shortcomings of the IG algorithm,this paper proposes an improved method based on word frequency information,which improves the intra-class frequency information,the intra-class word frequency location distribution and the inter-class word frequency information.Experiments are carried out to test the improved algorithm,and the results show that the improved algorithm is more effective in comparison with the traditional one.
Keywords:word frequency information;IG algorithm;feature selection;text classification 1 引言(Introduction)
随着信息技术的飞速发展,互联网信息资源呈爆炸式增长。面对海量信息,如何合理管理资源,使人们能够快速、准确地获取有效信息,已经成为IT行业的研究热点之一[1]。 文本分类技术是文本信息处理的关键技术之一,能够很好地解决上述问题,在文本分类中,通常用向量空间模型来表示结构化文本,其中,文本特征的高维性和特征权值的稀疏性直接影响文本分类精度。因此,设计合理的特征降维方法可以提高文本自动分类的效率。特征选