With the availability of enormous numbers of remote sensing images produced by satellites and airborne sensors, high-resolution remote sensing image analyses

Description

With the availability of enormous numbers of remote sensing images produced by satellites and airborne sensors, high-resolution remote sensing image analyses have stimulated a flood of interest in the domain of remote sensing and computer vision (Toth and Jo´´zkow, ´ 2016), such as image classification or land cover mapping (Cheng et al., 2017; Gomez ´ et al., 2016; You and Dong, 2020; Zhao et al., 2016), image retrieval (Wang et al., 2016), and object detection (Cheng et al., 2014; Han et al., 2014), etc. The great potential offered by these platforms in terms of observation capability poses great challenges for semantic scene understanding (Bazi, 2019). For instance, as these data are obtained from different locations, at different times and even with different satellites or image semantic segmentation (Xia et al., 2015), or migrate the model of multi-label data training to other visual tasks (e.g., image object recognition) (Gong et al., 2019). Therefore, multi-label datasets now attract increasing attention in the remote sensing community owing to that they are not expensive but have a lot of research potential. For these reasons, multi-label annotation of an image is necessary to present more details of the image and improve the performance of scene understanding. In addition, the multi-label annotation of an image can produce potential correlations among the labels, such as “road” and “car” tend to occur synchronously in a remote sensing image, and “grass” and “water” often accompany “golf course”. This will provide a better understanding of scene images, which is impossible for single-label image scene understanding. Therefore, annotating images with multiple labels is a vital step for semantic scene understanding in remote sensing. What is more, previous studies have proven that traditional machine learning methods cannot adequately mine ground object scene information (Cordts et al., 2016; Jeong et al., 2019; Kendall et al., 2015; Zhu et al., 2019). Recently, deep learning approaches, as a popular technology, have shown the great potential of providing solutions to problems related to semantic s