Abstract
Tomato, as a globally important economic crop, requires precise and timely disease management to secure yield and quality. Yet segmentation robustness is often limited by weak semantic understanding from single-modality images, narrow receptive fields of convolutional structures, and discontinuous boundary predictions. To address these issues, we propose the Multi-scale Linear Cross-modal Fusion Architecture for Tomato Leaf Disease Segmentation (MS-LCFNet). We construct a real-world field dataset covering five major tomato leaf diseases, annotated by experts with detailed textual descriptions to enable multimodal learning. MS-LCFNet strengthens semantic representation via cross-modal fusion, captures local and global context through an Adaptive Long-short Distance Perception module, and improves boundary continuity with a Physics-informed Smoothness-constrained Loss. Experiments show that MS-LCFNet achieves 87.13 % mIoU on our dataset and 90.78 % on PlantVillage, improving over previous state-of-the-art methods by + 4.62 % and + 4.48 %, respectively, and demonstrating superior accuracy and robustness in complex agricultural scenarios.