基于无监督学习的中文电子病历分词
张立邦;关毅;杨锦峰
【期刊名称】《智能计算机与应用》 【年(卷),期】2014(000)002
【摘要】电子病历中包含大量有用的医疗知识,抽取这些知识对于构建临床决策支持系统和个性化医疗健康信息服务具有重要意义。自动分词是分析和挖掘中文电子病历的关键基础。为了克服获取标注语料的困难,提出了一种基于无监督学习的中文电子病历分词方法。首先,使用通用领域的词典对电子病历进行初步的切分,为了更好地解决歧义问题,引入概率模型,并通过 EM算法从生语料中估计词的出现概率。然后,利用字串的左右分支信息熵构建良度,将未登录词识别转化为最优化问题,并使用动态规划算法进行求解。最后,在3000来自神经内科的中文电子病历上进行实验,证明了该方法的有效性。%Electronic medical records ( EMR) contain a lot of useful medical knowledge .Extracting these knowledge are im-portant for building clinical decision support system and personalized healthcare information service .Automatic word seg-mentation is a key precursor for analysis and mining of Chinese EMRs .In order to overcome the difficulties of obtaining la-beled corpus , the paper proposes an unsupervised approach to word segmentation in Chinese EMRs .First, the paper uses a lexicon of general domain to generate an initial segmentation .To deal with the ambiguity problem , the paper also builds a probabilistic model .The probabilities of words are estimated by