基于关键词指导的图像中文描述生成

由天下分享时间：2025/3/20 18:37:41 加入收藏我要投稿点赞

史秀聪

通过计算每个n-gram的权重，计算参考句子与模型生成的句子的余弦相似度来度量图像标签的一致性。

4.2. 数据集

我们在一个汽车图像数据集上对提出的方法进行了实验。我们开发了一个爬虫程序采集了一些图像数据，原始图像格式如图4所示。

Figure 4. Raw data format 图4. 原始数据格式

每张图像上带有对图像的描述，我们对这些图像进行了重写以避免太长和口语化表述，然后对图像进行截取以使得图像不再附有文本描述。经过处理后，我们获得了一个“图像–描述句子”映射关系的数据集。对于关键词信息的提取，我们将数据集中所有的描述句子合并成一个文档，然后使用Jieba [24]工具库中的关键词提取接口提取了该文档的关键词。关键词按照权重由高到低排列，然后我们遍历了每个描述句子中的词语以确定词语是否是关键词。通过处理，我们为图像确定了其描述的关键词，关键词是一个集合，由1~4个词语组成。

对数据进行处理后，我们构建了一个“图像–关键词–描述句子”映射关系的数据集。数据集的大小为2100条数据，我们随机选择2000条作为训练集，100条作为测试集。由于图像到描述的映射关系较弱，即使生成的描述质量很好，通顺流畅，如果和测试图像的参考描述差别很大，那模型性能很难在评估结果得到体现，因此本文在与测试图像最相似的20张图像的描述句子中随机选择5条作为测试图像的参考描述。这些信息写入了一个JSON文件中，其结构如下所示： { images: [

{\[keywords of image]}, ...... ],

annotations: [

{\“description \], ...... }

4.3. 图像编码器

在本文中，我们使用了在ImageNet [25]图像数据集上进行了预训练的VGG-16作为图像编码器。我

DOI: 10.12677/csa.2020.106113

1092

计算机科学与应用

史秀聪

们选择了最后一个卷积层的输出作为图像的特征。其维度为14 × 14 × 512。512是特征图的通道数目, 14×14 是特征图的尺寸的大小，分别表示特征图的高度和宽度。

4.4. 关键词预测

对于测试图像关键词预测，我们通过图像检索的方式实现。通过VGG-16模型提取了测试图像和训练数据集中的图像的特征。其中，VGG-16和图像编码器是同一个模型。然后将图像的特征转化成向量表示，通过计算测试图像的特征向量和训练数据集中图像的特征向量的余弦相似度，找到训练数据集中前10个和测试图像最相似的图像，将这10个图像的关键词信息作为测试图像的候选关键词信息。余弦相似度计算公式如(10)所示。

cos=(θ)I1I2=I1I2∑k=1i1k?i2knn2ii2∑∑1k=k1=k12kn (10)

其中I1和I2是图像特征的向量表示，记为I1={i11,?,i1n}和I2={i21,?,i2n}。

图5展示了图像#(183)和#(174)的检索结果。最上方的一张图像是测试图像，中间区域的10张图像是检索得到的与测试图像最相似的图像，最下方是10张图像对应的关键词信息。结果表明，VGG-16模型能够准确提取图像的特征，通过余弦相似度计算的方式能够有效检索出相似的图像从而确定测试图像的关键词信息。

Figure. 5. Retrieval results of image #(183) and #(174)

图5. 图像#(183)和#(174)的检索结果

4.5. 文本编码器

对于词语的语义表示，一种有效的方式是将词语映射到高维度的词向量。所有这些词向量构成一个词向量空间，词向量之间的余弦距离可以反映词语之间的语义相似度。因此，连续词袋模型(Continuous Bag-of-Words, CBOW) [26]可以满足要求。通过在大规模的语料库中训练词向量模型，可以使得词向量模型包含丰富的语义信息从而可以充分表示文本信息。

我们利用Genism [27]工具库部署了一个词向量模型用做文本编码器。我们设置了窗口大小为5，词向量维度为512。然后在维基百科中文数据集上对词向量模型进行了训练。通过训练后，我们获得了一个词向量模型，该模型可以将词语表示为一个512维的向量。为了观察模型的性能，我们测试了与词语

DOI: 10.12677/csa.2020.106113

1093

计算机科学与应用

史秀聪

“座位”最相似的词语，如图6所示。

Figure. 6. Word distribution similar to “seat” 图6. 与“座椅”语义相似的词语分布

此外，为了观察词向量的分布情况，我们随机选择100个词语的词向量表示通过TSNE工具降维可视化，如图7所示。

Figure. 7. Word vector visualization 图7. 词向量可视化

图6显示词向量模型可以有效找出与“座椅”相似的词语，图7中语义相近的词语，如“方向”、“指向”、“转向”等词语集中在很近的位置。结果显示，词向量模型可以很好地对词语进行表征。

4.6. 图像描述生成

训练完编码器后，再结合解码器进行训练。解码器的主要模型是LSTM，其目的是将编码器得到的上下文向量解码成自然语言句子。对于LSTM模型的训练，我们做出了如下处理。首先，将所有的描述句子进行分词处理，构建了一个包含2561个词语的词典，每个词都映射到一个整型的数字，表示该词在词典中的位置。在模型中，对于词语的表示，采用2561 × 512维的嵌入矩阵表示，每个词语由

DOI: 10.12677/csa.2020.106113

1094

计算机科学与应用

史秀聪

一个512维的词向量表示，该嵌入矩阵使用均匀分布初始化器初始化，然后在模型训练的过程中不断优化。LSTM的隐藏单元的维度设置为512，隐藏层层数为1。初始学习率设置为0.001，采用Adam梯度下降优化算法对模型参数进行更新。我们在初始学习率为0.001时迭代了5000次，然后使用学习率0.0005迭代了4000次。我们在NIC和Soft Attention方法上做了对比实验。本文模型采用了10组关键词分别结合图像作为输入，然后使用BLEU、Rouge-L和CIDEr评价方法对模型进行评价。NIC方法是基础的编码–解码模型，Soft Attention方法是在NIC方法的基础上引入了软注意力机制。评估结果如表1所示。

Table 1. Evaluation of experiments on BLEU-n (n = 1, 2, 3, 4), ROUGE-L and CIDEr 表1. 模型在BLEU-n (n = 1, 2, 3, 4)，ROUGE-L和CIDEr上的评估结果

模型 NIC Soft attention 本文模型

B@1 0.311 0.330 0.418

B@2 0.212 0.214 0.287

B@3 0.161 0.153 0.165

B@4 0.075 0.079 0.107

ROUGE-L 0.076 0.092 0.153

CIDEr 0.337 0.342 0.394

由表1结果可以看出，本文模型在各个评估指标上的评估结果比NIC、Soft Attention的性能好。通过引入关键词信息，可以加强图像到图像描述的映射。图8展示了同一张图像在不同关键词信息下的描述情况，结果显示不同的关键词对图像描述的侧重点产生了一定的作用。

Figure 8. Effect of different keyword information on image description 图8. 不同关键词信息对图像描述的影响

DOI: 10.12677/csa.2020.106113

1095

计算机科学与应用

史秀聪

5. 总结

本文提出了一种将图像和关键词信息一起输入从而生成图像描述句子的新方法。根据实验结果，本文模型的性能比NIC和Soft Attention模型要好，能够生成流畅通顺的图像描述句子，并且同一张图像结合不同的关键词信息可以控制描述的侧重点，一定程度上增加了图像描述的多样性。虽然我们取得了一定的进展，但还存在一些问题，数据集不够大，生成的描述句子偏短。未来我们会扩充数据集，同时对模型进行优化以获取更好的性能从而满足实际的应用需求。

参考文献

[1] Johnson, M., Schuster, M., Le, Q.V., Krikun, M., Wu, Y., Chen, Z., et al. (2017) Google’s Multilingual Neural Ma-chine Translation System: Enabling Zero-Shot Translation. Transactions of the Association for Computational Lin-guistics, 5, 339-351. https://doi.org/10.1162/tacl_a_00065 [2] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014) Microsoft COCO: Common Ob-jects in Context. In: Fleet, D., Pajdla, T., Schiele, B. and Tuytelaars, T., Eds., European Conference on Computer Vi-sion, Springer, Cham, 740-755. https://doi.org/10.1007/978-3-319-10602-1_48 [3] Flickr Image Dataset. Kaggle.com. https://www.kaggle.com/hsankesara/flickr-image-dataset

[4] Vinyals, O., Toshev, A., Bengio, S. and Erhan, D. (2015) Show and Tell: A Neural Image Caption Generator. Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, 7-12 June 2015, 3156-3164. https://doi.org/10.1109/CVPR.2015.7298935 [5] Karpathy, A. and Li, F.-F. (2015) Deep Visual-Semantic Alignments for Generating Image Descriptions. Proceedings

of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, 7-12 June 2015, 3128-3137. https://doi.org/10.1109/CVPR.2015.7298932 [6] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., et al. (2017) Attention Is All You Need.

In: Advances in Neural Information Processing Systems, 5998-6008. [7] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., et al. (2015) Show, Attend and Tell: Neural Image

Caption Generation with Visual Attention. International Conference on Machine Learning, June 2015, 2048-2057. [8] You, Q., Jin, H., Wang, Z., Fang, C. and Luo, J. (2016) Image Captioning with Semantic Attention. Proceedings of the

IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 27-30 June 2016, 4651-4659. https://doi.org/10.1109/CVPR.2016.503 [9] Lu, J., Xiong, C., Parikh, D. and Socher, R. (2017) Knowing When to Look: Adaptive Attention via a Visual Sentinel

for Image Captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, 21-26 July 2017, 375-383. https://doi.org/10.1109/CVPR.2017.345 [10] Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W. and Chua, T.S. (2017) SCA-CNN: Spatial and Channel-Wise

Attention in Convolutional Networks for Image Captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, 21-26 July 2017, 5659-5667. https://doi.org/10.1109/CVPR.2017.667 [11] Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S. and Zhang, L. (2018) Bottom-Up and Top-Down

Attention for Image Captioning and Visual Question Answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 18-23 June 2018, 6077-6086. https://doi.org/10.1109/CVPR.2018.00636 [12] He, C. and Hu, H. (2019) Image Captioning with Text-Based Visual Attention. Neural Processing Letters, 49, 177-185.

https://doi.org/10.1007/s11063-018-9807-7 [13] He, X., Yang, Y., Shi, B. and Bai, X. (2019) VD-SAN: Visual-Densely Semantic Attention Network for Image Caption

Generation. Neurocomputing, 328, 48-55. https://doi.org/10.1016/j.neucom.2018.02.106 [14] Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollár, P., et al. (2015) From Captions to Visual Concepts

and Back. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, 7-12 June 2015, 1473-1482. https://doi.org/10.1109/CVPR.2015.7298754 [15] Li, N. and Chen, Z. (2018) Image Cationing with Visual-Semantic LSTM. IJCAI, July 2018, 793-799.

https://doi.org/10.24963/ijcai.2018/110

[16] Wang, Y., Lin, Z., Shen, X., Cohen, S. and Cottrell, G.W. (2017) Skeleton Key: Image Captioning by Skele-ton-Attribute Decomposition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Ho-nolulu, HI, 21-26 July 2017, 7272-7281. https://doi.org/10.1109/CVPR.2017.780

DOI: 10.12677/csa.2020.106113

1096

计算机科学与应用

史秀聪

[17] Ren, Z., Wang, X., Zhang, N., Lv, X. and Li, L.J. (2017) Deep Reinforcement Learning-Based Image Captioning with

Embedding Reward. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, 21-26 July 2017, 290-298. https://doi.org/10.1109/CVPR.2017.128 [18] Zhang, L., Sung, F., Liu, F., Xiang, T., Gong, S., Yang, Y. and Hospedales, T.M. (2017) Actor-Critic Sequence Train-ing for Image Captioning. arXiv preprint arXiv:1706.09601 [19] Simonyan, K. and Zisserman, A. (2014) Very Deep Convolutional Networks for Large-Scale Image Recognition. ar-Xiv preprint arXiv:1409.1556 [20] Hochreiter, S. and Schmidhuber, J. (1997) Long Short-Term Memory. Neural Computation, 9, 1735-1780.

https://doi.org/10.1162/neco.1997.9.8.1735

[21] Papineni, K., Roukos, S., Ward, T. and Zhu, W.J. (2002) BLEU: A Method for Automatic Evaluation of Machine

Translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, July 2002, 311-318. https://doi.org/10.3115/1073083.1073135 [22] Lin, C.Y. and Och, F.J. (2004) Looking for a Few Good Metrics: ROUGE and Its Evaluation. NTCIR Workshop,

Tokyo, 2-4 June 2004. [23] Vedantam, R., Lawrence Zitnick, C. and Parikh, D. (2015) Cider: Consensus-Based Image Description Evaluation.

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, 7-12 June 2015, 4566-4575. https://doi.org/10.1109/CVPR.2015.7299087 [24] Sun, J. (2012) Jieba Chinese Word Segmentation Tool. https://github.com/fxsjy/jieba

[25] Deng, J., Dong, W., Socher, R., Li, L.J., Li, K. and Li, F.-F. (2009) ImageNet: A Large-Scale Hierarchical Image Da-tabase. 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, 20-25 June 2009, 248-255. https://doi.org/10.1109/CVPR.2009.5206848 [26] Ling, W., Dyer, C., Black, A.W. and Trancoso, I. (2015) Two/Too Simple Adaptations of Word2Vec for Syntax Prob-lems. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Lin-guistics: Human Language Technologies, Denver, Co, May-June 2015, 1299-1304. https://doi.org/10.3115/v1/N15-1142 [27] gensim: Topic Modelling for Humans. Radimrehurek.com. https://radimrehurek.com/gensim/models/word2vec.html

DOI: 10.12677/csa.2020.106113

1097

计算机科学与应用