面向端到端目标检测的水下声光信息多模态深度融合方法

刘禄; 陈燕斌; 张祥越; 朱红全; 石廷超; 范慧丽; 陶浩; 孔诗涵

doi:10.19693/j.issn.1673-3185.04920

面向端到端目标检测的水下声光信息多模态深度融合方法

A Multimodal Deep Fusion Method for Underwater Acoustic and Optical Information Towards End-to-End Object Detection

摘要

摘要: 【目的】多模态融合技术的发展，打破使用单一声纳图像或单一光学图像进行水下目标检测的方法。然而，现有多模态目标检测技术中的对齐、融合算法不适配水下检测场景，难以解决光学与声学设备之间的成像差异带来的空间位置失准、融合效果退化的问题。为了应对上述挑战，【方法】本文提出一种利用声纳图像和光学图像进行多模态信息融合的端到端水下目标检测框架，该框架由“输入层—特征提取层—多模态融合层—目标感知层—输出层”构成，并创新性地设计了空间位置对齐模块和动态权重融合模块，同时解决水下物体在不同成像系统中的空间失准和融合退化的问题。【结果】本文利用实际水下环境采样的声光配对数据集，与主流的单模态水下目标检测算法和其他领域先进的多模态融合算法进行了对比实验。实验结果表明，本文提出的方法在mAP_50指标中达到了95.6%，对比现有的水下单模态目标检测算法模型YOLO12X、YOLO13X、RTDETR有24.4%、6.8%、7.9%的提升，对比现有的多模态融合目标检测算法DenseFusion、U2Fusion、SwinFusion有3.9%、2.9%、5.2%的提升；本文方法在mAP_(50-95)指标中达到了50.7%，对比现有主流的单模态水下目标检测算法模型YOLO12X、YOLO13X、RTDETR分别有18.6%、7.4%、4.5%的提升，对比现有的多模态融合目标检测算法DenseFusion、U2Fusion、SwinFusion分别有7.9%、7.0%、8.2%的提升。【结论】实验结果验证了本文所提出的声光融合的端到端水下目标检测框架有效性，其中所提出空间对齐模块和动态权重融合模块能够提升复杂场景的检测精度，解决声纳和光学图像存在的空间位置失准和融合退化的问题，为水下基于多模态数据融合的感知任务提供有效的对齐融合范式。

Abstract: Objective The development of multimodal fusion technology has broken the paradigm of underwater target detection using only single sonar images or single optical images. However, the existing alignment and fusion algorithms in multimodal target detection techniques are not suitable for underwater detection scenarios, and cannot solve the problems of spatial position misalignment and degraded fusion performance caused by imaging differences between optical and acoustic devices. To address the above challenges, Methods this paper proposes an underwater target detection framework utilizing multimodal information fusion of sonar and optical images. The framework consists of "Input Layer — Feature Extraction Layer — Multimodal Fusion Layer — Target Perception Layer — Output Layer" and innovatively incorporates a spatial registration module and a dynamic weight fusion module, which simultaneously resolve the issues of spatial misalignment of objects in different imaging systems and low scene adaptability. Results Comparative experiments were conducted on an acoustic-optical paired dataset sampled from real underwater environments, comparing with mainstream unimodal underwater target detection algorithms and advanced multimodal fusion algorithms from other fields. The experimental results show that the proposed method achieves 95.6% in the mAP_50 metric, representing improvements of 24.4%, 6.8%, and 7.9% compared to existing unimodal target detection algorithms YOLO12X, YOLO13X, and RTDETR, respectively, and improvements of 3.9%, 2.9%, and 5.2% compared to existing multimodal fusion target detection algorithms DenseFusion, U2Fusion, and SwinFusion, respectively; the proposed method reaches 50.7% in the mAP_(50-95) metric, with improvements of 18.6%, 7.4%, and 4.5% over the existing unimodal target detection algorithms YOLO12X, YOLO13X, and RTDETR, and improvements of 7.9%, 7.0%, and 8.2% over the existing multimodal fusion target detection algorithms DenseFusion, U2Fusion, and SwinFusion, respectively. Conclusions Experimental results validate the effectiveness of the proposed end-to-end underwater object detection framework based on acoustic-optical fusion. The designed Spatial Alignment Module and Dynamic Weight Fusion Module can improve detection accuracy in complex scenarios, and address the problems of spatial position misalignment and fusion degradation existing in sonar and optical images. This work provides an effective alignment and fusion paradigm for underwater perception tasks based on multimodal data fusion.

HTML全文

参考文献(0)

施引文献

资源附件(0)