Wenxiao Cai, Wankou Yang
Southeast University
Corresponding author: wkyang@seu.edu.cn
Abstract
The topic of stitching images with globally natural structures holds paramount significance, with two main goals: pixel-level alignment and distortion prevention. The existing approaches exhibit the ability to align well, yet fall short in maintaining object structures. In this paper, we endeavour to safeguard the overall OBJect-level structures within images based on Global Similarity Prior (OBJ-GSP), on the basis of good alignment performance.Our approach leverages semantic segmentation models like the family of Segment Anything Model to extract the contours of any objects in a scene.Triangular meshes are employed in image transformation to protect the overall shapes of objects within images.The balance between alignment and distortion prevention is achieved by allowing the object meshes to strike a balance between similarity and projective transformation.We also demonstrate that object-level semantic information is necessary in low-altitude aerial image stitching.Additionally, we propose StitchBench, the largest image stitching benchmark with most diverse scenarios.Extensive experimental results demonstrate that OBJ-GSP outperforms existing methods in both pixel alignment and shape preservation.Code and dataset is publicly available at https://github.com/RussRobin/OBJ-GSP.
1 Introduction
Image stitching aims to align multiple images and create a composite image with a larger field of view. This method is widely utilized across diverse domains, including smartphone panoramic photography[40], robotic navigation[8], and virtual reality[1, 19].In recent years, the problem of alignment has largely been addressed.Methods such as APAP[45] and GSP[6] divide the images into multiple grids, compute local transformation matrices within each grid, and combine them with global transformation information to achieve precise alignment in overlapping regions. Thus, the main concern of image stitching nowadays is to prevent distortion on the basis of good alignment performance.
Existing works extract lines in images are preserve them in image transformation. LPC[17] extracts and matches lines in alignment. Based on good alignment performance of GSP, GES-GSP[11] adds the similarity transformation of line structures into considerations.However, (a) they only preserves line structures, ignoring overall and object-level structures, (b) focusing only on individual lines can be quite chaotic and mislead the model (Fig.2), (c) straight or curved structures do not exist in some scenes.
Since an important criterion for humans to judge whether an image looks natural is the naturalness of the object structures within the image, our key insight is to extract these structures and preserve them during stitching.Nowadays, state-of-the-art segmentation models can identify almost any object with superior performance. We use them to get object shapes, which represents the image structure, and then use triangle meshes to preserve these segmented object shapes during the stitching.We generate triangle meshes within each object. During image transformation, these triangle meshes tend to reach a balance between projection and similarity transformations, effectively preserving the structure of the objects.As demonstrated in Fig.1 (f), our method excels in maintaining the overall structure of images by preventing distortion of prominent object shapes.OBJ-GSP capitalizes on object-level preserving, and we adopt leverage segmentation models to extract geometric information.As shown in Fig.1 (e), segmentation models treats objects as cohesive entities, transcending the segmentation of individual lines and curves adopted in previous works[11, 17].This allows for a more nuanced understanding of the relationships between individual geometric structures, and superior to previous work, it works even when there are no prominent linear structures in the images.
Previous works often used their own collected images without testing on datasets from other papers. We unified the datasets from previous works and incorporated our own collected hand-held camera and aerial images to create StitchBench, the most complete benchmark to date.We also demonstrate that in low-altitude aerial image stitching, semantic segmentation in OBJ-GSP pipeline is necessary. When the drone flies at a low altitude, the camera moves significantly, and there is a considerable distance difference between the roofs and the ground relative to the camera.These conditions do not satisfy the assumptions of image stitching[2], which assumes a fixed camera optical center or distant scenes, making stitching unfeasible. In this case, it is necessary to use a semantic segmentation model to identify the houses, then perform orthorectification to project them onto the ground before stitching.
To summarize, the main contributions of the proposed OBJ-GSP include:
- •
We propose to preserve object contours before and after image transformation to maintain the overall structure of the image. Object shapes are not limited to images with obvious linear structures and are not misled by excessively noisy line structures.
- •
We introduce the segmentation models into image stitching, facilitating the extraction of any object in the scene. Furthermore, we demonstrate that segmentation and OBJ-GSP are crucial for low-altitude aerial image stitching.
- •
We collect StitchBench, which is by far the largest and most diverse image stitching benchmark.
2 Related work
2.1 Grid-based image stitching
Autostitch [2], a pioneering work in image stitching, matches feature points and aligns them by homography transformation. Building upon this foundation, numerous stitching algorithms partition images into grids, compute geometric transformation relationships for each grid, and combine them into a global transformation to align overlapping regions and seamlessly transit the transformation to non-overlapping areas.APAP[45], AANAP[23], and GSP[6] have evolved over time, essentially addressing most alignment problems in images.However, their grid deformation methods have no knowledge of object shapes. They pay too much attention on alignment and thus causes geometric distortion.To address this, LPC[17] and GES-GSP[11] propose to preserve line structures.However,(a) their method only preserves line structures, ignoring the overall structure of objects,(b) an excessive number of lines without object structure information can mislead the model,(c) some scenes do not contain straight or curved structures.We find that the large segmentation models like SAM[20] can segment all types of objects and provide their contours. This helps image stitching maintain shape consistency, so we have incorporated the family of SAM into our method.We use triangular grids to protect the overall object-level geometric structure, and establish connections between dispersed geometric transformations, achieving superior results.
2.2 Geometric structure extraction
Previous works employ Line Segment Detector[37] to detect straight lines in images, and edge detection methods like Canny[5] and HED[39] to identify edges. However, these methods require line structures to be present in the image.In cases where textures are unclear or lighting is poor, conventional methods cannot extract lines effectively, whereas large models can still operate successfully in these scenarios.We employ the family of SAM and EfficientSAM[41] to extract object-level structures and preserve them during stitching.It is notable that segmentation models are not limited by line structures and can segment almost any object.In the future, the accuracy and speed of SAM-type methods will both improve[41, 46], further enhancing the quality and speed of our image stitching techniques.
2.3 Deep-learning based stitching
In recent years, several methods[18] like UDIS[30] have attempted to model certain image stitching steps as unsupervised deep learning problems, leading to notable advances in this field.UDIS++[31] also addresses the distortion problem on the basis of good alignment performance, which aligns perfectly with our goals.We adhere to the traditional approach in the stitching domain by preserving results in grid transformation, while UDIS++ provides a completely new deep learning-based pipeline, although currently its performance is not as good as ours.
3 The proposed method
OBJ-GSP introduces SAM to segment objects to obtain their structural contours and preserve object-level structures as well as aligning feature points in stitching.Locally, our approach retains the original perspective of each image. On a global scale, it seeks to preserve overall structure[6].Moreover, at the object-level, we ensure the integrity of objects within the images, preventing distortion.To this end, we take four aspects into consideration: alignment, global similarity, local similarity and object-level shape preservation.A grid mesh is adopted to guide the image deformation, where and represent the sets of vertices and edges within the grid mesh, as shown in Fig.3. Image stitching methods aim to find a set of deformed vertex positions, denoted as , that minimizes the energy function .
Alignment term extracts feature points by with an extractor (e.g. SIFT[sift]) and matches feature point pairs with matcher .For each feature point pair , represents the position of as a linear combination of four vertex positions, and represents the set of all feature point pairs. The algorithm linearly combines the coordinates of the four vertices of each grid to represent the position of through bi-linear interpolation. By optimizing the positions of grid vertices after geometric transformation, it aims to bring as close as possible to . Therefore, the energy equation is defined as:
(1) |
Local similarity term aims to ensure that the transition from overlapping to non-overlapping regions is natural. Each grid undergoes a similarity transformation to minimize shape distortion. For an edge , represents its similarity transformation. Suppose transforms to after deformation, and the energy function is defined as:
(2) |
Global similarity term operates on a global scale to ensure the entire image undergoes a similarity transformation. GSP algorithm evaluates the scale and rotation within the global image transformation and computes parameters and for similarity. Thus, the energy function is defined as:
(3) |
After obtaining the contours, we generate a triangular mesh for each semantic object, preserving the shape of the object through similarity transformations within the triangle mesh. Unlike the As-Rigid-As-Possible (ARAP) [15] method, we simplify computational complexity by directly locating the center of the object and connecting it to sampling points on the object’s semantic boundary to form a triangular mesh. In Fig 3, represents the object’s center, while and are sampling points on the semantic boundary of the object, forming a triangular mesh with these three points. refer to the known coordinates of a feature point in the local coordinate plane. One vertex, , of the triangle can be represented using the edges of the triangle and an orthogonal coordinate system obtained by rotating this edge counterclockwise by 90 degrees:
(4) |
After the mesh deformation, and are transformed into and . To preserve the shape of the segmentation result, we aim for the triangle to undergo a similarity transformation, keeping and unchanged. Therefore, we desire to transform into:
(5) |
The corresponding energy term for the transformed is calculated as:
(6) |
Similar definitions for energy terms are applied to and , resulting in the error sum for a triangle:
(7) |
Initially, our approach constructs the triangular mesh by selecting sampling points and the object’s center. Unlike ARAP [15], we do not employ equilateral triangular meshes, as objects segmented from the image often lead to very small equilateral triangles. Experimental results demonstrate that this approximation not only has no adverse impact on the final outcome but also reduces computational complexity:
(8) |
We extract semantic object structures from a single image using semantic segmentation, and represents the total number of all sampling points within geometric structure i. Similar to GES-GSP [11], is a coefficient calculated based on the positions of the sampling points. Consequently, the total error equation is as follows:
(9) |
To conclude, our objective function is given by:
(10) |
Eq.10 can be solved with linear optimization.For fair comparison, our parameters are identical to those of GES-GSP: , . Our corresponds in to GES-GSP.
Models / Datasets | OBJ-GSP | AANAP | APAP | CAVE | DFW | DHW |
---|---|---|---|---|---|---|
Mean Distorted Residuals (MDR ) | ||||||
GSP ECCV16 | 1.15296 | 1.06183 | 1.25495 | 0.90884 | 0.98457 | 1.08755 |
GES-GSP CVPR22 | 1.14366 | 1.06213 | 1.24249 | 0.90821 | 0.98034 | 1.05619 |
OBJ-GSP (ours) | 1.12229 | 1.05930 | 1.20123 | 0.89731 | 0.97259 | 1.00496 |
Improvement(%) | 1.9 | 0.3 | 3.3 | 1.2 | 0.8 | 4.9 |
Naturalness Image Quality Evaluator (NIQE ) | ||||||
UDIS TIP21 | 3.69421 | 3.01517 | 3.69421 | - | 5.74137 | 3.28645 |
UDIS++ ICCV23 | 3.34003 | 2.95493 | 3.56812 | 4.07702 | 5.09680 | 3.23392 |
GSP ECCV16 | 2.66597 | 2.84241 | 3.4356 | 4.04708 | 5.61905 | 2.75485 |
GES-GSP CVPR22 | 2.64986 | 2.77220 | 3.48713 | 4.03835 | 5.71544 | 2.70838 |
OBJ-GSP (ours) | 2.54906 | 2.74965 | 3.39280 | 4.01565 | 5.69104 | 2.60825 |
Improvement(%) | 3.8 | 0.8 | 2.7 | 0.6 | 0.4 | 3.7 |
Models / Datasets | GES-GSP | LPC | REW | SEAGULL | SVA | SPHP |
Mean Distorted Residuals (MDR ) | ||||||
GSP ECCV16 | 1.06986 | 1.30562 | 1.16192 | 1.14467 | 1.51158 | 1.17784 |
GES-GSP CVPR22 | 1.15462 | 1.11993 | 1.47197 | 1.14340 | 1.04473 | 1.22256 |
OBJ-GSP (ours) | 0.98288 | 1.10622 | 1.08635 | 1.08296 | 1.47813 | 1.07699 |
Improvement(%) | 5.9 | 9.5 | 5.9 | 3.3 | -0.4 | 5.8 |
Naturalness Image Quality Evaluator (NIQE ) | ||||||
UDIS TIP21 | 5.02442 | 3.76994 | 3.61888 | 4.67437 | 8.02090 | 4.35149 |
UDIS++ ICCV23 | 4.93279 | 3.66565 | 3.67661 | 4.38520 | 7.5419 | 4.1248 |
GSP ECCV16 | 3.84897 | 4.28546 | 3.18549 | 5.10784 | 7.00495 | 3.04781 |
GES-GSP CVPR22 | 3.79240 | 3.35315 | 2.75234 | 4.69390 | 6.96670 | 2.97173 |
OBJ-GSP (ours) | 3.70041 | 3.23057 | 2.81480 | 4.08903 | 6.96149 | 2.49712 |
Improvement(%) | 2.4 | 3.7 | -2.2 | 12.9 | 0.5 | 16.0 |
4 Experiments
4.1 StitchBench
Previous work often collected a small number of images themselves and performed qualitative tests only.Meanwhile, they have different focuses, such as parallax between the foreground and background, sparse features in natural scenery, precise alignment and no distinct structures to preserve, and distinct line structures, without comprehensively evaluating models’ performance in a wide range of scenarios.To address the issue, we present the most extensive image stitching benchmark to date: StitchBench, which include 122 pairs of images from 12 works.We collect 18 pairs of images captured by cameras, in which the preservation of object structures is crucial. StitchBench also includes 7 sets of urban scenes captured by low-altitude drones, featuring tall buildings and requiring the assistance of segmentation models.To overcome our subjective preferences and the limited locations where we collected the images, we also collect test images used in previous state-of-the-art works, namely AANAP[23], APAP[45], CAVE[32], DFW[22], DHW[12], GES-GSP[11], LPC[17], SEAGULL[25], REW[21], SVA[26] and SPHP[24].StitchBench is currently the most comprehensive stitching test dataset. An algorithm should demonstrate general applicability to perform well on all subsets of StitchBench: aligning well and preventing distortion naturally.
Evaluation metrics.We quantitatively assess the quality of our stitching results from two perspectives: distortion prevention and alignment.First, we employ the Mean Distorted Residuals (MDR) metric to measure the degree of image distortion.In intuitive terms, if the points on the same side of the mesh were originally collinear and remain collinear after stitching, it implies that the stitching result has minimal distortion.Furthermore, we employ the Naturalness Image Quality Evaluator (NIQE)[29] metric to evaluate alignment performance.We argue that NIQE is a more intitutive and better indicator of alignment than RMSE, SSIM and PSNR, as it measures image clarity, and stitching results with misalignment will produce blurry areas, leading to worse NIQE scores.
4.2 Baselines
We compare with GSP[6] and GES-GSP[11].UDIS[30] and UDIS++[31] are famous works in applyig deep learning into image stitching. Since they do not explicitly use feature points, we are unable to measure its quality with MDR.We provide a detailed comparasion between OBJ-GSP and UDIS++ in the supplementary materials.
4.3 Results
Quantitative results.Table 1 shows MDR and NIQE results on datasets used in other stitching algorithms and our own dataset. We outperform GSP and GES-GSP in both alignment and shape preservation. UDIS++[31] is a good try in deep learning based image stitching, but the performance is still no better than ours.
Qualitative results.
Fig.1 and2 elucidates the reasons behind the superior performance of our OBJ-GSP method. With the assistance of semantic segmentation techniques, we place greater emphasis on preserving critical structures and ensuring holistic protection at the object-level for objects within the images.Fig.4 and5 illustrates the stitching outcomes of six different methods, where we use straight lines and boxes to demonstrate the effects of alignment and distortion.Please kindly refer to our supplementary material for more quanlitative results.
4.4 Low-Altitude Aerial Image Stitching
Image stitching requires meeting one of two conditions[2]: either the camera’s optical center remains stationary while the camera rotates, or the scene only consists of objects that are far from the camera.Existing stitching algorithms mainly address the issue of stitching when these conditions are slightly violated.For low-altitude aerial images, where the flight height of the aircraft is around 100 meters but the height of buildings is no less than 20-40 meters, the camera’s optical center moves significantly during drone shooting, thus completely failing to meet the two assumptions for stitching.Moreover, if the left and right walls of a building are captured in two separate shots, it would be a logical error to include both walls in a panorama (for a cube, at most three faces can be seen at a time, and it’s impossible to see two opposing faces simultaneously).For stitching low-altitude aerial images, we first use a semantic segmentation model to segment the roofs and walls. We then calculate the height of the roofs and orthographically project the buildings onto the ground plane before stitching.In this scenario, the semantic segmentation model is essential for the stitching process[3].The stitching pipeline and result is in Fig.6.
5 Ablation Studies and Discussions
5.1 Lightweight SAM backbones
To assess the influence of semantic segmentation results on the stitching performance, we conducted a comparative analysis across three different backbones of the SAM[20] model, namely ViT-B, ViT-L, and ViT-H[10], with a progression from smaller to larger models. Larger models are inherently capable of capturing more fine-grained semantic details. The corresponding stitching results are depicted in Fig.7. It is worth noting that the models based on ViT-B and ViT-L exhibit some blurriness and minor distortions. From the perspective of MDR, our model, under the three aforementioned backbone configurations, achieved improvements of 1.9%, 2.7%, and 3.6% over the baseline model GES-GSP[11], respectively. The application of EfficientSAM[41], time consumption of OBJ-GSP, and its real-world applications are included in supplementary materials.
5.2 Sampling strategies
Incorporating SAM[20] with ViT-H [10] backbone, triangular mesh sampling yields superior results compared to the triangular sampling proposed in GES-GSP[11].The triangular mesh preserves the shape of lines but fails to maintain the overall geometric structure of the image.As illustrated in the Fig.8, we outline the structural elements in the original image using red dashed lines and then superimposed them onto the combined results of triangular sampling and triangular grid sampling.Triangular grid sampling retains the positional relationships between the lines present in the original image.
6 Conclusion
In this paper, we propose OBJect-level Geometric Structure Preserving for natural image stitching (OBJ-GSP) algorithm, which stands as a novel approach to achieving natural and visually pleasing composite images.OBJ-GSP protects object shapes by first segmenting them out, and then preserve the structures with triangle meshes.We also demonstrate that semantic segmentation is necessary when it comes to low-altitude aerial image stitching.We collect new test image pairs in common scenes and aerial imaging, and choose images from previous works, to establish the most comprehensive image stitching benchmark by far: StitchBench. Detailed experiments with comprehensive baselines in StitchBench demonstrate the effectiveness of OBJ-GSP.
7 Extensive Discussions
7.1 Limitations
While OBJ-GSP has demonstrated state-of-the-art performance in image stitching by extracting object-level geometric structures with semantic segmentation and preserving them with triangle mesh sampling, it introduces a large semantic segmentation model into image stitching, resulting in higher computational costs.However, with the development of semantic segmentation techniques, lighter versions of SAM will emerge[41, 46], enhancing the speed of our work.According to our analysis, the effectiveness of geometric structure extraction significantly impacts the final results. Our method is constrained by the quality of SAM’s results. Smaller models like SAM[20] with Vit-B/L[10] do not perform as SAM with Vit-H.To stitch a pair of 800*600 images, SAM spends 25s on RTX2090 with 8-24G GPU memory, depending on the backbone. For C++ ONNX implementation, SAM VIT-B spends 1.5 min on CPU.Mesh optimization and image processing cost less than 4s on an Intel i5 CPU, almost the same as GES-GSP[11].OBJ-GSP needs more computational resources and time than GES-GSP, but the stitching quality is also better.The time cost of SAM is larger than line detection methods in GES-GSP.The time used for triangle mesh optimization is almost the same as that in GES-GSP.
7.2 Applications of OBJ-GSP
In many fields, there is a need for high-quality stitched images, even at the expense of long time costs and significant computational resources.In medical image processing[36, 34, 48], for instance, stitching multiple pathological slice images to reconstruct the entire tissue structure or organ’s three-dimensional model demands high precision and quality. These tasks typically necessitate precise alignment and seamless fusion, and can tolerate longer computational times to ensure accuracy. Similarly, in photography[27] or virtual tourism[35, 38] applications, stitching numerous high-resolution images is necessary to generate high-quality panoramic images. This holds true in the fields of remote sensing[44, 7, 47] and movie industry[28, 14] as well.Currently, the emergence of faster and more accurate segmentation models, such as EfficientSAM[41], makes our method even more promising. We provide detailed comparisons in the supplementary materials.
7.3 Post-processing in image stitching
Our method primarily addresses computing transformation matrices to achieve alignment and shape preservation. Post-processing techniques can be combined with our approach to achieve a more natural stitching effect. Blending[2, 42, 43] and seam-driven[13] methods can be used to further reduce blurring, while global straightening[12] can decrease distortion.
7.4 Broader impacts
It is important to acknowledge that we do not explicitly discuss broader impacts in the proposed OBJ-GSP image stitching algorithm, such as fairness or bias. Segment Anything Model (SAM) [20] has discussed its broader impacts regarding geographic and income representation, as well as Fairness in segmenting people. Further research into how our algorithm may interact with other aspects of image stitching is encouraged.
8 Ablation of EfficientSAM and mesh sampling
We propose the utilization of SAM-based methods and mesh sampling to address distortion and misalignment during stitching. It is important to emphasize that both components are indispensable for object-level shape preservation. Fig.9 illustrates the distinct results achieved under the same semantic segmentation output using line-based triangle sampling and object-level mesh sampling. Mesh sampling can recognize object structures and effectively preserve objects from distortion.Furthermore, with the advancement in the field of semantic segmentation, the speed of SAM-based methods is accelerating, which will greatly expedite our image stitching approach. For instance, EfficientSAM[41]. In Fig.9, there is no significant difference observed between the results obtained using SAM + mesh sampling and EfficientSAM + mesh sampling. However, the time consumption of EfficientSAM is only of SAM, which is predictable. With the further development of SAM-based methods, even faster and more accurate approaches are expected to emerge, making our stitching method faster and more precise.
9 StitchBench Metadata
We employed handheld smartphones to capture OBJ-GSP images. During the image acquisition process, we took care to minimize translational movement of the smartphone, primarily relying on rotation to adjust the framing.This approach was employed to ensure that the disparity between images remained manageable, preventing situations where the occlusion relationships between two images would be too dissimilar for successful image stitching.We amassed a total of 18 pairs of image sets, encompassing diverse scenes such as rooms, culinary creations, sculptures, gardens, rivers, ponds, industrial facilities, roads, and exteriors of buildings, among others.
We also collect 7 sets of aerial images, each consist of about 9 pairs of images of urban scenes. We fly drones at 100-120 meters, in urban areas where there are buildings, roads, trees, etc[3].We collect images with a DJI Mavic Air 2 and the image size is set to be 3000 × 4000 Pixels. Cameras on the drone are kept vertical to the ground (bird view).
Additionally, we curate existing image stitching datasets [23, 45, 32, 22, 12, 11, 17, 25, 21, 26, 24]to supplement our data collection efforts.Ultimately, we constructed a dataset consisting of 122 groups of images, marking it as the largest dataset currently employed in image stitching endeavors.We release this dataset to the public for further research and development.
10 Reestablishing baselines
10.1 Implementation details
We evaluated the results of GSP [6] and GES-GSP [11] using their publicly available C++ code. Our stitching code is also implemented in C++. Our implementation of SAM includes two approaches. Drawing from the Segment Anything C++ Wrapper [9], we exported the SAM [20] encoder and decoder into Open Neural Network Exchange (ONNX) format and subsequently replicated SAM’s automatic mode within C++. To achieve the best possible stitching results, we also directly implemented the semantic segmentation component using the SAM’s publicly available Python code and utilized their semantic segmentation results. The stitching part of our experiments ran on the CPU, while the SAM modules were capable of running on both CPU and GPU.
10.2 Baselines
We replicated the results reported in the literature for GSP [6], GES-GSP [11], APAP [45], and SPHP [24] by implementing their publicly available codebases with their default parameter settings. For the structural alignment component, we employed the executable provided by Autostitch [2]. GES-GSP includes experimental data for both GES-GSP and GSP.In our method’s experiments, we maintained consistent parameter settings across all trials. Furthermore, for the structure extraction stage, we utilized the official code provided by SAM. However, we excluded masks with extremely small areas.
11 Algorithm and Results of Low-Altitude Aerial Image Stitching
11.1 Why is Semantic Segmentation Necessary
For low-altitude drone aerial photography in urban areas, the drone’s flight altitude is low while the buildings are tall. Due to the significant distance difference between the drone camera and the buildings, the transformation matrices for buildings and rooftops differ from those for the ground in different images. Direct stitching can lead to a lot of ghosting and unnatural, distorted building structures. Additionally, selective information must be discarded during low-altitude aerial stitching in urban areas. If buildings are simply considered rectangular prisms, and the left and right walls are captured in two separate shots, it is impossible to retain both sides in the stitched image (as a person cannot see both opposing sides of a rectangular prism simultaneously). Therefore, this paper proposes first using a segmentation model to identify buildings and walls in the scene. The walls are removed from the images, and the ratio of the building height to the drone height is calculated based on the different transformation relationships for the ground and buildings. The buildings are then projected onto the ground plane before stitching.The importance of segmentation in aerial image stitching is shown in Fig.10
11.2 OBJ-GSP in Aerial Image Stitching
In aerial images of low altitude, there are two planes of interest: ground and roofs , where and is the number of roofs. Semantic segmentation models are adopted to segment roofs and walls, with the remaining pixels regarded as ground. We mask out walls, and then orthographically project roofs to grounds.For correctly matched feature points and on , and and on , we aim to find transformation matrix to project roof to ground before stitching. After projection, the images can be stitched with a global transformation matrix, which is also the transformation between grounds, .
(11) |
where is in homogeneous coordinates .Let the height of roof to the ground be , and height of drone be , for each pixel on the roof, the orthographic projection transformation is
(12) |
where and are half of pixel width and height of the image.If the transformation matrix is in the form of a homograph matrix, and the number of feature point matches is on , Eq.11 expands to mutually independent quadratic equations with the unknown . This over determined equation can be solved by methods such as Newton iteration. After solving , the orthographic projection map can be generated by simply transforming to the ground one by one. The transformation matrix is the affine transformation matrix, which can be solved similarly for the orthographic projection matrix.
11.3 Segmentation Models and Aerial Segmentation Datasets Used
We finetune Grounded SAM[33] on low-altitude drone datasets where roof and wall are annotated (Varied Drone Dataset[4] and ICG Drone Dataset[16]).
12 More qualitative results
In this section we provide more qualivative results. We mark images with boxes to indicate misalignment, and use lines (and intersections of lines) to show distortion. Please refer to Fig. 12, 13, 5, 14 and 15. It is shown that our object-level preservation of structures can prevent distortion and misalignment at the same time.
13 Successful cases because of Segmentation
In this section, we demonstrate the superiority of Segment Anything Model [20] with qualitative results. Please refer to Fig. 16 It is shown that SAM extracted object-level and complete structures of the ground and mountain, so the OBJ-GSP preserved their structures better than GES-GSP, where HED [39] only extracted fragmented edge information.In Fig. 17, we demonstrate that the proposed OBJ-GSP can even stitch images with poor lighting conditions and protect their object-level structures.
14 Failure cases
As shown in Fig. 11, OBJ-GSP fails in cases of significant parallax. In the perspective of the left image, the corner of the building appears to be obtuse. However, based on the inference from the right image, the corner of the building should be a right angle. Consequently, the stitching algorithm becomes perplexed, unsure of how to preserve the shape of the house.
In only rare instances OBJ-GSP experiences failures due to the malfunction of SAM [20], such as in conditions characterized by inadequate illumination or an exceedingly sparse set of features.However, we acknowledge that it’s very important to discuss the situations where, Segment Anything Model [20] fails, leading to the failure of the proposed OBJ-GSP image stitching algorithm. Please refer to Fig. 18 and 19.In the context of our approach, it is not imperative for the SAM [20] to achieve precise segmentation of objects containing semantic information. SAM need only recognize key object contours and boundaries. Therefore, the proposed OBJ-GSP is susceptible to SAM failure only in exceedingly rare instances where features are exceptionally sparse, and objects are highly indistinct. Consequently, SAM’s failure would impact solely the maintenance term of our structure, leading OBJ-GSP to degrade to the performance level of the GSP [6].Simultaneously, we emphasize that SAM’s failure would result in the nullification of our structural preservation term, causing OBJ-GSP’s performance to regress to that of GSP only under scenarios where SAM proves ineffective. In addressing cases involving distortion and misalignment, we posit that mitigation strategies such as global straightening and multi-bend blending, as employed in Autostitch [2], can be leveraged to alleviate these issues.
15 Comparasion with UDIS++
UDIS[30](TIP 2021) and UDIS++[31](ICCV 2023) are a family of attempts to address image stitching problems using deep learning frameworks.We compare with the revised version (UDIS++) as it performs better than UDIS.Like us, UDIS++ also aims to solve distortion issues on top of alignment. In the main text, we compute UDIS++’s NIQE to assess its alignment performance. Since UDIS++ is not feature point based, it cannot calculate MDR, a metric for measuring distortion. Therefore, in the supplementary material, we provide results for four scenes with multiple sets of images to intuitively compare distortion levels. Our method outperforms in both distortion resilience and alignment. Please refer to Fig. 20, 21 and 22.
References
- Anderson etal. [2016]Robert Anderson, David Gallup, JonathanT. Barron, Janne Kontkanen, NoahSnavely, Carlos Hernández, Sameer Agarwal, and StevenM. Seitz.Jump: virtual reality video.ACM Trans. Graph., 35:198:1–198:13, 2016.
- Brown and Lowe [2007]MatthewA. Brown and DavidG. Lowe.Automatic panoramic image stitching using invariant features.International Journal of Computer Vision, 74:59–73,2007.
- Cai etal. [2023a]Wenxiao Cai, Songlin Du, and Wankou Yang.Uav image stitching by estimating orthograph with rgb cameras.J. Vis. Commun. Image Represent., 94:103835,2023a.
- Cai etal. [2023b]Wenxiao Cai, Ke Jin, Jinyan Hou, Cong Guo, Letian Wu, and Wankou Yang.Vdd: Varied drone dataset for semantic segmentation.ArXiv, abs/2305.13608, 2023b.
- Canny [1986]JohnF. Canny.A computational approach to edge detection.IEEE Transactions on Pattern Analysis and MachineIntelligence, PAMI-8:679–698, 1986.
- Chen and Chuang [2016]Yu-Sheng Chen and Yung-Yu Chuang.Natural image stitching with the global similarity prior.In European Conference on Computer Vision, 2016.
- Cui etal. [2020]Jiguang Cui, Man Liu, Zhitao Zhang, Shuqin Yang, and Jifeng Ning.Robust uav thermal infrared remote sensing images stitching viaoverlap-prior-based global similarity prior model.IEEE Journal of Selected Topics in Applied Earth Observationsand Remote Sensing, 14:270–282, 2020.
- Dewangan etal. [2014]AbhishekKumar Dewangan, Rohit Raja, and Reetika Singh.An implementation of multi sensor based mobile robot with imagestitching application.2014.
- dinglufe [2023]dinglufe.Segment anything cpp wrapper.GitHub Repository, 2023.Accessed on 2023-09-20.
- Dosovitskiy etal. [2020]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn,Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, GeorgHeigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby.An image is worth 16x16 words: Transformers for image recognition atscale.ArXiv, abs/2010.11929, 2020.
- Du etal. [2022]Peng Du, Jifeng Ning, Jiguang Cui, Shaoli Huang, Xinchao Wang, and Jiaxin Wang.Geometric structure preserving warp for natural image stitching.2022 IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR), pages 3678–3686, 2022.
- Gao etal. [2011]Junhong Gao, SeonJoo Kim, and M.S. Brown.Constructing image panoramas using dual-homography warping.CVPR 2011, pages 49–56, 2011.
- Gao etal. [2013]Junhong Gao, Yu Li, Tat-Jun Chin, and M.S. Brown.Seam-driven image stitching.In Eurographics, 2013.
- Guo etal. [2016]Heng Guo, Shuaicheng Liu, Tong He, Shuyuan Zhu, Bing Zeng, and M. Gabbouj.Joint video stitching and stabilization from moving cameras.IEEE Transactions on Image Processing, 25:5491–5503, 2016.
- Igarashi etal. [2005]Takeo Igarashi, Tomer Moscovich, and JohnF. Hughes.As-rigid-as-possible shape manipulation.ACM SIGGRAPH 2005 Papers, 2005.
- Institute of Computer Graphics and Vision, Graz University ofTechnology [2024]Institute of Computer Graphics and Vision, Graz University of Technology.Icg drone dataset.http://dronedataset.icg.tugraz.at, 2024.Accessed: 2024-07-21.
- Jia etal. [2021]Qi Jia, Zheng Li, Xin Fan, Haotian Zhao, Shiyu Teng, Xinchen Ye, and LonginJanLatecki.Leveraging line-point consistence to preserve structures for wideparallax image stitching.2021 IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR), pages 12181–12190, 2021.
- Jia etal. [2023]Qi Jia, Xiaomei Feng, Yu Liu, Xin Fan, and LonginJan Latecki.Learning pixel-wise alignment for unsupervised image stitching.Network, 1(1):1, 2023.
- Kim etal. [2020]HakGu Kim, Heoun taek Lim, and YongMan Ro.Deep virtual reality image quality assessment with human perceptionguider for omnidirectional image.IEEE Transactions on Circuits and Systems for VideoTechnology, 30:917–928, 2020.
- Kirillov etal. [2023]Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, LauraGustafson, Tete Xiao, Spencer Whitehead, AlexanderC. Berg, Wan-Yen Lo, PiotrDollár, and RossB. Girshick.Segment anything.ArXiv, abs/2304.02643, 2023.
- Li etal. [2018]Jing Li, Zhengming Wang, Shiming Lai, Yongping Zhai, and Maojun Zhang.Parallax-tolerant image stitching based on robust elastic warping.IEEE Transactions on Multimedia, 20:1672–1687,2018.
- Li etal. [2015]Shiwei Li, Lu Yuan, Jian Sun, and Long Quan.Dual-feature warping-based motion model estimation.2015 IEEE International Conference on Computer Vision (ICCV),pages 4283–4291, 2015.
- Lin etal. [2015a]Chung-Ching Lin, Sharath Pankanti, KarthikeyanNatesan Ramamurthy, andAleksandrY. Aravkin.Adaptive as-natural-as-possible image stitching.2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 1155–1163, 2015a.
- Lin etal. [2015b]Chung-Ching Lin, Sharath Pankanti, KarthikeyanNatesan Ramamurthy, andAleksandrY. Aravkin.Adaptive as-natural-as-possible image stitching.2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 1155–1163, 2015b.
- Lin etal. [2016]Kaimo Lin, Nianjuan Jiang, LoongFah Cheong, MinhN. Do, and Jiangbo Lu.Seagull: Seam-guided local alignment for parallax-tolerant imagestitching.In European Conference on Computer Vision, 2016.
- Lin etal. [2011]Wen-Yan Lin, Siying Liu, Yasuyuki Matsushita, Tian-Tsong Ng, and LoongFahCheong.Smoothly varying affine stitching.CVPR 2011, pages 345–352, 2011.
- Lo etal. [2018]I-Chan Lo, Kuang-Tsu Shih, and HomerH. Chen.Image stitching for dual fisheye cameras.2018 25th IEEE International Conference on Image Processing(ICIP), pages 3164–3168, 2018.
- Lyu etal. [2019]Wei Lyu, Zhong Zhou, Lang Chen, and Yi Zhou.A survey on image and video stitching.Virtual Real. Intell. Hardw., 1:55–83, 2019.
- Mittal etal. [2013]Anish Mittal, Rajiv Soundararajan, and AlanConrad Bovik.Making a “completely blind” image quality analyzer.IEEE Signal Processing Letters, 20:209–212, 2013.
- Nie etal. [2021]Lang Nie, Chunyu Lin, Kang Liao, Shuaicheng Liu, and Yao Zhao.Unsupervised deep image stitching: Reconstructing stitched featuresto images.IEEE Transactions on Image Processing, 30:6184–6197, 2021.
- Nie etal. [2023]Lang Nie, Chunyu Lin, Kang Liao, Shuaicheng Liu, and Yao Zhao.Parallax-tolerant unsupervised deep image stitching.2023 IEEE/CVF International Conference on Computer Vision(ICCV), pages 7365–7374, 2023.
- Nomura etal. [2007]Yoshikuni Nomura, Li Zhang, and ShreeK Nayar.Scene collages and flexible camera arrays.In Proceedings of the 18th Eurographics conference on RenderingTechniques, pages 127–138, 2007.
- Ren etal. [2024]Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, JiayuChen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li,Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang.Grounded sam: Assembling open-world models for diverse visual tasks.ArXiv, abs/2401.14159, 2024.
- Sakharkar and Gupta [2013]VrushaliS Sakharkar and SR Gupta.Image stitching techniques-an overview.Int. J. Comput. Sci. Appl, 6(2):324–330,2013.
- Setiawan etal. [2023]MuhammadReza Setiawan, MuhamadAzrino Gustalika, and Muhammad LuluLatifUsman.Implementation of virtual tour using image stitching as anintroduction media of smpn 1 karangkobar to new students.Jurnal Teknik Informatika (Jutif), 4(5):1089–1098, 2023.
- Singla and Sharma [2014]Savita Singla and Reecha Sharma.Medical image stitching using hybrid of sift & surf techniques.2014.
- von Gioi etal. [2012]RafaelGrompone von Gioi, Jérémie Jakubowicz, Jean-Michel Morel, andGregory Randall.Lsd: a line segment detector.Image Process. Line, 2:35–55, 2012.
- Widiyaningtyas etal. [2018]Triyanna Widiyaningtyas, DidikDwi Prasetya, and AjiPrasetya Wibawa.Web-based campus virtual tour application using orb image stitching.2018 5th International Conference on Electrical Engineering,Computer Science and Informatics (EECSI), pages 46–49, 2018.
- Xie and Tu [2015]Saining Xie and Zhuowen Tu.Holistically-nested edge detection.International Journal of Computer Vision, 125:3 –18, 2015.
- Xiong and Pulli [2009]Yingen Xiong and Kari Pulli.Sequential image stitching for mobile panoramas.2009 7th International Conference on Information,Communications and Signal Processing (ICICS), pages 1–5, 2009.
- Xiong etal. [2023]Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xiang, Fanyi Xiao, ChenchenZhu, Xiaoliang Dai, Dilin Wang, Fei Sun, ForrestN. Iandola, RaghuramanKrishnamoorthi, and Vikas Chandra.Efficientsam: Leveraged masked image pretraining for efficientsegment anything.ArXiv, abs/2312.00863, 2023.
- Xu etal. [2020]Han Xu, Jiayi Ma, Junjun Jiang, Xiaojie Guo, and Haibin Ling.U2fusion: A unified unsupervised image fusion network.IEEE Transactions on Pattern Analysis and MachineIntelligence, 44:502–518, 2020.
- Xu etal. [2023]Han Xu, Jiteng Yuan, and Jiayi Ma.Murf: Mutually reinforcing multi-modal image registration and fusion.IEEE Transactions on Pattern Analysis and MachineIntelligence, 45:12148–12166, 2023.
- Xue etal. [2021]Wanli Xue, Zhe Zhang, and Shengyong Chen.Ghost elimination via multi-component collaboration for unmannedaerial vehicle remote sensing image stitching.Remote. Sens., 13:1388, 2021.
- Zaragoza etal. [2013]Julio Zaragoza, Tat-Jun Chin, Quoc-Huy Tran, M.S. Brown, and David Suter.As-projective-as-possible image stitching with moving dlt.2013 IEEE Conference on Computer Vision and PatternRecognition, pages 2339–2346, 2013.
- Zhang etal. [2023]Chaoning Zhang, Dongshen Han, Yu Qiao, JungUk Kim, Sung-Ho Bae, Seungkyu Lee,and Choong-Seon Hong.Faster segment anything: Towards lightweight sam for mobileapplications.ArXiv, abs/2306.14289, 2023.
- Zhang etal. [2022]Yujie Zhang, Xiaoguang Mei, Yong Ma, Xingyu Jiang, Zongyi Peng, and Jun Huang.Hyperspectral panoramic image stitching using robust matching andadaptive bundle adjustment.Remote. Sens., 14:4038, 2022.
- Zhao etal. [2010]XiuYing Zhao, Hongyu Wang, and Yongxue Wang.Medical image seamlessly stitching by sift and gist.2010 International Conference on E-Product E-Service andE-Entertainment, pages 1–4, 2010.