Abstract:
Objective The development of multimodal fusion technology has broken the paradigm of underwater target detection using only single sonar images or single optical images. However, the existing alignment and fusion algorithms in multimodal target detection techniques are not suitable for underwater detection scenarios, and cannot solve the problems of spatial position misalignment and degraded fusion performance caused by imaging differences between optical and acoustic devices. To address the above challenges, Methods this paper proposes an underwater target detection framework utilizing multimodal information fusion of sonar and optical images. The framework consists of "Input Layer — Feature Extraction Layer — Multimodal Fusion Layer — Target Perception Layer — Output Layer" and innovatively incorporates a spatial registration module and a dynamic weight fusion module, which simultaneously resolve the issues of spatial misalignment of objects in different imaging systems and low scene adaptability. Results Comparative experiments were conducted on an acoustic-optical paired dataset sampled from real underwater environments, comparing with mainstream unimodal underwater target detection algorithms and advanced multimodal fusion algorithms from other fields. The experimental results show that the proposed method achieves 95.6% in the mAP_50 metric, representing improvements of 24.4%, 6.8%, and 7.9% compared to existing unimodal target detection algorithms YOLO12X, YOLO13X, and RTDETR, respectively, and improvements of 3.9%, 2.9%, and 5.2% compared to existing multimodal fusion target detection algorithms DenseFusion, U2Fusion, and SwinFusion, respectively; the proposed method reaches 50.7% in the mAP_(50-95) metric, with improvements of 18.6%, 7.4%, and 4.5% over the existing unimodal target detection algorithms YOLO12X, YOLO13X, and RTDETR, and improvements of 7.9%, 7.0%, and 8.2% over the existing multimodal fusion target detection algorithms DenseFusion, U2Fusion, and SwinFusion, respectively. Conclusions Experimental results validate the effectiveness of the proposed end-to-end underwater object detection framework based on acoustic-optical fusion. The designed Spatial Alignment Module and Dynamic Weight Fusion Module can improve detection accuracy in complex scenarios, and address the problems of spatial position misalignment and fusion degradation existing in sonar and optical images. This work provides an effective alignment and fusion paradigm for underwater perception tasks based on multimodal data fusion.