Construction of a multiscale feature fusion model for indoor scene recognition and semantic segmentation
Abstract:
Fast and accurate extraction of multi-scale semantic information is crucial for indoor applications suchas navigation and personalized services. This paper focuses on large-scale indoor scene categories andsmall-scale indoor element semantics, proposing an RGB-based dual-task model for scene recognitionand semantic segmentation. The model employs a pyramid pooling module to create a shared featurelayer for both tasks. The scene recognition branch incorporates an SE attention mechanism and aTransformer module to enhance scene understanding, while the semantic segmentation branchfuses low-level features from ResNet50 to improve learning of shapes and textures. This paperinvestigates the fusion methods of scene-level and element-level features and explores the optimaltraining strategy for the dual-task model. It also discusses and analyzes the impact of differentweight combinations of dual-task Loss values on the model’s performance. The experimental resultsdemonstrate that the overall accuracy (OA) of scene recognition and semantic segmentation of thismethod is 98.4%, and the overall accuracy (OA) of semantic segmentation is 82.6%. The average timefor processing a single scene is approximately 0.037 s, which is superior to the latest models comparedin this paper. It is found that the optimal training strategy for a single task is conducive to enhancingthe accuracy of dual tasks. Moreover, when the loss weights of the two tasks in this method are both 1,the comprehensive effect of the model is the best.
Keywords:
Multi-scale semantics, Semantic information extraction, Scene recognition, Semanticsegmentation, Dual-task model