Object-level Geometric Structure Preserving for Natural Image Stitching (2025)

Wenxiao Cai, Wankou Yang
Southeast University
Corresponding author: wkyang@seu.edu.cn

Abstract

The topic of stitching images with globally natural structures holds paramount significance, with two main goals: pixel-level alignment and distortion prevention. The existing approaches exhibit the ability to align well, yet fall short in maintaining object structures. In this paper, we endeavour to safeguard the overall OBJect-level structures within images based on Global Similarity Prior (OBJ-GSP), on the basis of good alignment performance.Our approach leverages semantic segmentation models like the family of Segment Anything Model to extract the contours of any objects in a scene.Triangular meshes are employed in image transformation to protect the overall shapes of objects within images.The balance between alignment and distortion prevention is achieved by allowing the object meshes to strike a balance between similarity and projective transformation.We also demonstrate that object-level semantic information is necessary in low-altitude aerial image stitching.Additionally, we propose StitchBench, the largest image stitching benchmark with most diverse scenarios.Extensive experimental results demonstrate that OBJ-GSP outperforms existing methods in both pixel alignment and shape preservation.Code and dataset is publicly available at https://github.com/RussRobin/OBJ-GSP.

1 Introduction

Image stitching aims to align multiple images and create a composite image with a larger field of view. This method is widely utilized across diverse domains, including smartphone panoramic photography[40], robotic navigation[8], and virtual reality[1, 19].In recent years, the problem of alignment has largely been addressed.Methods such as APAP[45] and GSP[6] divide the images into multiple grids, compute local transformation matrices within each grid, and combine them with global transformation information to achieve precise alignment in overlapping regions. Thus, the main concern of image stitching nowadays is to prevent distortion on the basis of good alignment performance.

Object-level Geometric Structure Preserving for Natural Image Stitching (1)

Existing works extract lines in images are preserve them in image transformation. LPC[17] extracts and matches lines in alignment. Based on good alignment performance of GSP, GES-GSP[11] adds the similarity transformation of line structures into considerations.However, (a) they only preserves line structures, ignoring overall and object-level structures, (b) focusing only on individual lines can be quite chaotic and mislead the model (Fig.2), (c) straight or curved structures do not exist in some scenes.

Object-level Geometric Structure Preserving for Natural Image Stitching (2)

Since an important criterion for humans to judge whether an image looks natural is the naturalness of the object structures within the image, our key insight is to extract these structures and preserve them during stitching.Nowadays, state-of-the-art segmentation models can identify almost any object with superior performance. We use them to get object shapes, which represents the image structure, and then use triangle meshes to preserve these segmented object shapes during the stitching.We generate triangle meshes within each object. During image transformation, these triangle meshes tend to reach a balance between projection and similarity transformations, effectively preserving the structure of the objects.As demonstrated in Fig.1 (f), our method excels in maintaining the overall structure of images by preventing distortion of prominent object shapes.OBJ-GSP capitalizes on object-level preserving, and we adopt leverage segmentation models to extract geometric information.As shown in Fig.1 (e), segmentation models treats objects as cohesive entities, transcending the segmentation of individual lines and curves adopted in previous works[11, 17].This allows for a more nuanced understanding of the relationships between individual geometric structures, and superior to previous work, it works even when there are no prominent linear structures in the images.

Previous works often used their own collected images without testing on datasets from other papers. We unified the datasets from previous works and incorporated our own collected hand-held camera and aerial images to create StitchBench, the most complete benchmark to date.We also demonstrate that in low-altitude aerial image stitching, semantic segmentation in OBJ-GSP pipeline is necessary. When the drone flies at a low altitude, the camera moves significantly, and there is a considerable distance difference between the roofs and the ground relative to the camera.These conditions do not satisfy the assumptions of image stitching[2], which assumes a fixed camera optical center or distant scenes, making stitching unfeasible. In this case, it is necessary to use a semantic segmentation model to identify the houses, then perform orthorectification to project them onto the ground before stitching.

To summarize, the main contributions of the proposed OBJ-GSP include:

  • We propose to preserve object contours before and after image transformation to maintain the overall structure of the image. Object shapes are not limited to images with obvious linear structures and are not misled by excessively noisy line structures.

  • We introduce the segmentation models into image stitching, facilitating the extraction of any object in the scene. Furthermore, we demonstrate that segmentation and OBJ-GSP are crucial for low-altitude aerial image stitching.

  • We collect StitchBench, which is by far the largest and most diverse image stitching benchmark.

2 Related work

2.1 Grid-based image stitching

Autostitch [2], a pioneering work in image stitching, matches feature points and aligns them by homography transformation. Building upon this foundation, numerous stitching algorithms partition images into grids, compute geometric transformation relationships for each grid, and combine them into a global transformation to align overlapping regions and seamlessly transit the transformation to non-overlapping areas.APAP[45], AANAP[23], and GSP[6] have evolved over time, essentially addressing most alignment problems in images.However, their grid deformation methods have no knowledge of object shapes. They pay too much attention on alignment and thus causes geometric distortion.To address this, LPC[17] and GES-GSP[11] propose to preserve line structures.However,(a) their method only preserves line structures, ignoring the overall structure of objects,(b) an excessive number of lines without object structure information can mislead the model,(c) some scenes do not contain straight or curved structures.We find that the large segmentation models like SAM[20] can segment all types of objects and provide their contours. This helps image stitching maintain shape consistency, so we have incorporated the family of SAM into our method.We use triangular grids to protect the overall object-level geometric structure, and establish connections between dispersed geometric transformations, achieving superior results.

2.2 Geometric structure extraction

Previous works employ Line Segment Detector[37] to detect straight lines in images, and edge detection methods like Canny[5] and HED[39] to identify edges. However, these methods require line structures to be present in the image.In cases where textures are unclear or lighting is poor, conventional methods cannot extract lines effectively, whereas large models can still operate successfully in these scenarios.We employ the family of SAM and EfficientSAM[41] to extract object-level structures and preserve them during stitching.It is notable that segmentation models are not limited by line structures and can segment almost any object.In the future, the accuracy and speed of SAM-type methods will both improve[41, 46], further enhancing the quality and speed of our image stitching techniques.

2.3 Deep-learning based stitching

In recent years, several methods[18] like UDIS[30] have attempted to model certain image stitching steps as unsupervised deep learning problems, leading to notable advances in this field.UDIS++[31] also addresses the distortion problem on the basis of good alignment performance, which aligns perfectly with our goals.We adhere to the traditional approach in the stitching domain by preserving results in grid transformation, while UDIS++ provides a completely new deep learning-based pipeline, although currently its performance is not as good as ours.

3 The proposed method

Object-level Geometric Structure Preserving for Natural Image Stitching (3)

OBJ-GSP introduces SAM to segment objects to obtain their structural contours and preserve object-level structures as well as aligning feature points in stitching.Locally, our approach retains the original perspective of each image. On a global scale, it seeks to preserve overall structure[6].Moreover, at the object-level, we ensure the integrity of objects within the images, preventing distortion.To this end, we take four aspects into consideration: alignment, global similarity, local similarity and object-level shape preservation.A grid mesh is adopted to guide the image deformation, where V𝑉Vitalic_V and E𝐸Eitalic_E represent the sets of vertices and edges within the grid mesh, as shown in Fig.3. Image stitching methods aim to find a set of deformed vertex positions, denoted as V~~𝑉\widetilde{V}over~ start_ARG italic_V end_ARG, that minimizes the energy function ψ(V)𝜓𝑉\psi(V)italic_ψ ( italic_V ).

Alignment term extracts feature points p𝑝pitalic_p by with an extractor (e.g. SIFT[sift]) and matches feature point pairs with matcher ΦΦ\Phiroman_Φ.For each feature point pair (p,Φ(p))𝑝Φ𝑝(p,\Phi(p))( italic_p , roman_Φ ( italic_p ) ), v~(p)~𝑣𝑝\widetilde{v}(p)over~ start_ARG italic_v end_ARG ( italic_p ) represents the position of p𝑝pitalic_p as a linear combination of four vertex positions, and M𝑀Mitalic_M represents the set of all feature point pairs. The algorithm linearly combines the coordinates of the four vertices of each grid to represent the position of p𝑝pitalic_p through bi-linear interpolation. By optimizing the positions of grid vertices after geometric transformation, it aims to bring p𝑝pitalic_p as close as possible to Φ(p)Φ𝑝\Phi(p)roman_Φ ( italic_p ). Therefore, the energy equation is defined as:

ψa(V)=pkMv~(pk)v~(Φ(pk))2.subscript𝜓𝑎𝑉subscriptsubscript𝑝𝑘𝑀superscriptnorm~𝑣subscript𝑝𝑘~𝑣Φsubscript𝑝𝑘2\psi_{a}(V)=\sum_{p_{k}\in M}\|\widetilde{v}(p_{k})-\widetilde{v}(\Phi(p_{k}))%\|^{2}.italic_ψ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_V ) = ∑ start_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ italic_M end_POSTSUBSCRIPT ∥ over~ start_ARG italic_v end_ARG ( italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - over~ start_ARG italic_v end_ARG ( roman_Φ ( italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .(1)

Local similarity term aims to ensure that the transition from overlapping to non-overlapping regions is natural. Each grid undergoes a similarity transformation to minimize shape distortion. For an edge (j,k)𝑗𝑘(j,k)( italic_j , italic_k ), Sjksubscript𝑆𝑗𝑘S_{jk}italic_S start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT represents its similarity transformation. Suppose vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT transforms to vj~~subscript𝑣𝑗\widetilde{v_{j}}over~ start_ARG italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG after deformation, and the energy function is defined as:

ψl(V)=(j,k)Ei(v~kv~j)Sjk(vkvj)2.subscript𝜓𝑙𝑉subscript𝑗𝑘subscript𝐸𝑖superscriptnormsubscript~𝑣𝑘subscript~𝑣𝑗subscript𝑆𝑗𝑘subscript𝑣𝑘subscript𝑣𝑗2\psi_{l}(V)=\sum_{(j,k)\in E_{i}}\|(\widetilde{v}_{k}-\widetilde{v}_{j})-S_{jk%}(v_{k}-v_{j})\|^{2}.italic_ψ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( italic_V ) = ∑ start_POSTSUBSCRIPT ( italic_j , italic_k ) ∈ italic_E start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ ( over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_S start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .(2)

Global similarity term operates on a global scale to ensure the entire image undergoes a similarity transformation. GSP algorithm evaluates the scale s𝑠sitalic_s and rotation θ𝜃\thetaitalic_θ within the global image transformation and computes parameters c(e)𝑐𝑒c(e)italic_c ( italic_e ) and s(e)𝑠𝑒s(e)italic_s ( italic_e ) for similarity. Thus, the energy function is defined as:

ψg(V)=ejEw(ej)2[(c(ej)scosθ)2+(s(ej)ssinθ)2].subscript𝜓𝑔𝑉subscriptsubscript𝑒𝑗𝐸𝑤superscriptsubscript𝑒𝑗2delimited-[]superscript𝑐subscript𝑒𝑗𝑠𝜃2superscript𝑠subscript𝑒𝑗𝑠𝜃2\psi_{g}(V)=\sum_{e_{j}\in E}w(e_{j})^{2}[(c(e_{j})-s\cos\theta)^{2}+(s(e_{j})%-s\sin\theta)^{2}].italic_ψ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( italic_V ) = ∑ start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ italic_E end_POSTSUBSCRIPT italic_w ( italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ ( italic_c ( italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_s roman_cos italic_θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_s ( italic_e start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) - italic_s roman_sin italic_θ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .(3)
Object-level Geometric Structure Preserving for Natural Image Stitching (4)

After obtaining the contours, we generate a triangular mesh for each semantic object, preserving the shape of the object through similarity transformations within the triangle mesh. Unlike the As-Rigid-As-Possible (ARAP) [15] method, we simplify computational complexity by directly locating the center of the object and connecting it to sampling points on the object’s semantic boundary to form a triangular mesh. In Fig 3, V0subscript𝑉0V_{0}italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT represents the object’s center, while V1subscript𝑉1V_{1}italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and V2subscript𝑉2V_{2}italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are sampling points on the semantic boundary of the object, forming a triangular mesh with these three points. (x01,y01)subscript𝑥01subscript𝑦01(x_{01},y_{01})( italic_x start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT ) refer to the known coordinates of a feature point in the local coordinate plane. One vertex, V2subscript𝑉2V_{2}italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, of the triangle can be represented using the edges of the triangle V0V1subscript𝑉0subscript𝑉1\overrightarrow{V_{0}V_{1}}over→ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG and an orthogonal coordinate system obtained by rotating this edge counterclockwise by 90 degrees:

V2=V0+x01V0V1+y01[0110]V0V1.subscript𝑉2subscript𝑉0subscript𝑥01subscript𝑉0subscript𝑉1subscript𝑦01delimited-[]matrix0110subscript𝑉0subscript𝑉1V_{2}=V_{0}+x_{01}\overrightarrow{V_{0}V_{1}}+y_{01}\left[\begin{matrix}0&1\\-1&0\\\end{matrix}\right]\overrightarrow{V_{0}V_{1}}.italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT over→ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG + italic_y start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] over→ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG .(4)
Object-level Geometric Structure Preserving for Natural Image Stitching (5)

After the mesh deformation, V0subscript𝑉0V_{0}italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and V1subscript𝑉1V_{1}italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are transformed into V0^^subscript𝑉0\widehat{V_{0}}over^ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG and V1^^subscript𝑉1\widehat{V_{1}}over^ start_ARG italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG. To preserve the shape of the segmentation result, we aim for the triangle to undergo a similarity transformation, keeping x01subscript𝑥01x_{01}italic_x start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT and y01subscript𝑦01y_{01}italic_y start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT unchanged. Therefore, we desire V2subscript𝑉2V_{2}italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to transform into:

V2desired^=V0^+x01V0^V1^+y01[0110]V0^V1^.^superscriptsubscript𝑉2𝑑𝑒𝑠𝑖𝑟𝑒𝑑^subscript𝑉0subscript𝑥01^subscript𝑉0^subscript𝑉1subscript𝑦01delimited-[]matrix0110^subscript𝑉0^subscript𝑉1\widehat{V_{2}^{desired}}=\widehat{V_{0}}+x_{01}\overrightarrow{\widehat{V_{0}%}\widehat{V_{1}}}+y_{01}\left[\begin{matrix}0&1\\-1&0\\\end{matrix}\right]\overrightarrow{\widehat{V_{0}}\widehat{V_{1}}}.over^ start_ARG italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_e italic_s italic_i italic_r italic_e italic_d end_POSTSUPERSCRIPT end_ARG = over^ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG + italic_x start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT over→ start_ARG over^ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG over^ start_ARG italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG + italic_y start_POSTSUBSCRIPT 01 end_POSTSUBSCRIPT [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] over→ start_ARG over^ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG over^ start_ARG italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG .(5)

The corresponding energy term for the transformed V2^^subscript𝑉2\widehat{V_{2}}over^ start_ARG italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG is calculated as:

EV2=V2desired^V2^2.subscript𝐸subscript𝑉2superscriptnorm^superscriptsubscript𝑉2𝑑𝑒𝑠𝑖𝑟𝑒𝑑^subscript𝑉22E_{V_{2}}=\|\widehat{V_{2}^{desired}}-\widehat{V_{2}}\|^{2}.italic_E start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∥ over^ start_ARG italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_e italic_s italic_i italic_r italic_e italic_d end_POSTSUPERSCRIPT end_ARG - over^ start_ARG italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .(6)

Similar definitions for energy terms are applied to V0^^subscript𝑉0\widehat{V_{0}}over^ start_ARG italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG and V1^^subscript𝑉1\widehat{V_{1}}over^ start_ARG italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG, resulting in the error sum for a triangle:

E{V0,V1,V2}=i=0,1,2Videsired^Vi^2.subscript𝐸subscript𝑉0subscript𝑉1subscript𝑉2subscript𝑖012superscriptnorm^superscriptsubscript𝑉𝑖𝑑𝑒𝑠𝑖𝑟𝑒𝑑^subscript𝑉𝑖2E_{\{V_{0},V_{1},V_{2}\}}=\sum_{i=0,1,2}\|\widehat{V_{i}^{desired}}-\widehat{V%_{i}}\|^{2}.italic_E start_POSTSUBSCRIPT { italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 0 , 1 , 2 end_POSTSUBSCRIPT ∥ over^ start_ARG italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_e italic_s italic_i italic_r italic_e italic_d end_POSTSUPERSCRIPT end_ARG - over^ start_ARG italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .(7)

Initially, our approach constructs the triangular mesh by selecting sampling points and the object’s center. Unlike ARAP [15], we do not employ equilateral triangular meshes, as objects segmented from the image often lead to very small equilateral triangles. Experimental results demonstrate that this approximation not only has no adverse impact on the final outcome but also reduces computational complexity:

EV0,V1,V2=i=1,2Videsired^Vi^2.subscript𝐸subscript𝑉0subscript𝑉1subscript𝑉2subscript𝑖12superscriptnorm^superscriptsubscript𝑉𝑖𝑑𝑒𝑠𝑖𝑟𝑒𝑑^subscript𝑉𝑖2E_{{{V_{0},V_{1},V}_{2}}}=\sum_{i=1,2}\|\widehat{V_{i}^{desired}}-\widehat{V_{%i}}\|^{2}.italic_E start_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 , 2 end_POSTSUBSCRIPT ∥ over^ start_ARG italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d italic_e italic_s italic_i italic_r italic_e italic_d end_POSTSUPERSCRIPT end_ARG - over^ start_ARG italic_V start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .(8)

We extract Ncsubscript𝑁𝑐N_{c}italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT semantic object structures from a single image using semantic segmentation, and Nssubscript𝑁𝑠N_{s}italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT represents the total number of all sampling points within geometric structure i. Similar to GES-GSP [11], ω𝜔\omegaitalic_ω is a coefficient calculated based on the positions of the sampling points. Consequently, the total error equation is as follows:

ψobj(V)=β=1Ncα=1NsωαβEαβ.subscript𝜓𝑜𝑏𝑗𝑉superscriptsubscript𝛽1subscript𝑁𝑐superscriptsubscript𝛼1subscript𝑁𝑠superscriptsubscript𝜔𝛼𝛽superscriptsubscript𝐸𝛼𝛽\psi_{obj}(V)=\sum_{\beta=1}^{N_{c}}\sum_{\alpha=1}^{N_{s}}{\omega_{\alpha}^{%\beta}E_{\alpha}^{\beta}}.italic_ψ start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT ( italic_V ) = ∑ start_POSTSUBSCRIPT italic_β = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_α = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT .(9)

To conclude, our objective function is given by:

V~=argminV~(ψa(V~)+λlψl(V~)+ψg(V~)+λobjψobj(V~)).~𝑉~𝑉𝑎𝑟𝑔subscript𝜓𝑎~𝑉subscript𝜆𝑙subscript𝜓𝑙~𝑉subscript𝜓𝑔~𝑉subscript𝜆𝑜𝑏𝑗subscript𝜓𝑜𝑏𝑗~𝑉\begin{split}\widetilde{V}&=\underset{\widetilde{V}}{arg\min}\left(\psi_{a}(%\widetilde{V})+\lambda_{l}\psi_{l}(\widetilde{V})\right.\\&\quad\left.+\psi_{g}(\widetilde{V})+\lambda_{obj}\psi_{obj}(\widetilde{V})%\right).\end{split}start_ROW start_CELL over~ start_ARG italic_V end_ARG end_CELL start_CELL = start_UNDERACCENT over~ start_ARG italic_V end_ARG end_UNDERACCENT start_ARG italic_a italic_r italic_g roman_min end_ARG ( italic_ψ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over~ start_ARG italic_V end_ARG ) + italic_λ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( over~ start_ARG italic_V end_ARG ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + italic_ψ start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ( over~ start_ARG italic_V end_ARG ) + italic_λ start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT italic_ψ start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT ( over~ start_ARG italic_V end_ARG ) ) . end_CELL end_ROW(10)

Eq.10 can be solved with linear optimization.For fair comparison, our parameters are identical to those of GES-GSP: λl=0.75subscript𝜆𝑙0.75\lambda_{l}=0.75italic_λ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = 0.75, λobj=1.5subscript𝜆𝑜𝑏𝑗1.5\lambda_{obj}=1.5italic_λ start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT = 1.5. Our λobjsubscript𝜆𝑜𝑏𝑗\lambda_{obj}italic_λ start_POSTSUBSCRIPT italic_o italic_b italic_j end_POSTSUBSCRIPT corresponds λgessubscript𝜆𝑔𝑒𝑠\lambda_{ges}italic_λ start_POSTSUBSCRIPT italic_g italic_e italic_s end_POSTSUBSCRIPT in to GES-GSP.

Object-level Geometric Structure Preserving for Natural Image Stitching (6)
Models / DatasetsOBJ-GSPAANAPAPAPCAVEDFWDHW
Mean Distorted Residuals (MDR \downarrow)
GSP ECCV161.152961.061831.254950.908840.984571.08755
GES-GSP CVPR221.143661.062131.242490.908210.980341.05619
OBJ-GSP (ours)1.122291.059301.201230.897310.972591.00496
Improvement(%)1.90.33.31.20.84.9
Naturalness Image Quality Evaluator (NIQE \downarrow)
UDIS TIP213.694213.015173.69421-5.741373.28645
UDIS++ ICCV233.340032.954933.568124.077025.096803.23392
GSP ECCV162.665972.842413.43564.047085.619052.75485
GES-GSP CVPR222.649862.772203.487134.038355.715442.70838
OBJ-GSP (ours)2.549062.749653.392804.015655.691042.60825
Improvement(%)3.80.82.70.60.43.7
Models / DatasetsGES-GSPLPCREWSEAGULLSVASPHP
Mean Distorted Residuals (MDR \downarrow)
GSP ECCV161.069861.305621.161921.144671.511581.17784
GES-GSP CVPR221.154621.119931.471971.143401.044731.22256
OBJ-GSP (ours)0.982881.106221.086351.082961.478131.07699
Improvement(%)5.99.55.93.3-0.45.8
Naturalness Image Quality Evaluator (NIQE \downarrow)
UDIS TIP215.024423.769943.618884.674378.020904.35149
UDIS++ ICCV234.932793.665653.676614.385207.54194.1248
GSP ECCV163.848974.285463.185495.107847.004953.04781
GES-GSP CVPR223.792403.353152.752344.693906.966702.97173
OBJ-GSP (ours)3.700413.230572.814804.089036.961492.49712
Improvement(%)2.43.7-2.212.90.516.0

4 Experiments

4.1 StitchBench

Previous work often collected a small number of images themselves and performed qualitative tests only.Meanwhile, they have different focuses, such as parallax between the foreground and background, sparse features in natural scenery, precise alignment and no distinct structures to preserve, and distinct line structures, without comprehensively evaluating models’ performance in a wide range of scenarios.To address the issue, we present the most extensive image stitching benchmark to date: StitchBench, which include 122 pairs of images from 12 works.We collect 18 pairs of images captured by cameras, in which the preservation of object structures is crucial. StitchBench also includes 7 sets of urban scenes captured by low-altitude drones, featuring tall buildings and requiring the assistance of segmentation models.To overcome our subjective preferences and the limited locations where we collected the images, we also collect test images used in previous state-of-the-art works, namely AANAP[23], APAP[45], CAVE[32], DFW[22], DHW[12], GES-GSP[11], LPC[17], SEAGULL[25], REW[21], SVA[26] and SPHP[24].StitchBench is currently the most comprehensive stitching test dataset. An algorithm should demonstrate general applicability to perform well on all subsets of StitchBench: aligning well and preventing distortion naturally.

Evaluation metrics.We quantitatively assess the quality of our stitching results from two perspectives: distortion prevention and alignment.First, we employ the Mean Distorted Residuals (MDR) metric to measure the degree of image distortion.In intuitive terms, if the points on the same side of the mesh were originally collinear and remain collinear after stitching, it implies that the stitching result has minimal distortion.Furthermore, we employ the Naturalness Image Quality Evaluator (NIQE)[29] metric to evaluate alignment performance.We argue that NIQE is a more intitutive and better indicator of alignment than RMSE, SSIM and PSNR, as it measures image clarity, and stitching results with misalignment will produce blurry areas, leading to worse NIQE scores.

4.2 Baselines

We compare with GSP[6] and GES-GSP[11].UDIS[30] and UDIS++[31] are famous works in applyig deep learning into image stitching. Since they do not explicitly use feature points, we are unable to measure its quality with MDR.We provide a detailed comparasion between OBJ-GSP and UDIS++ in the supplementary materials.

4.3 Results

Quantitative results.Table 1 shows MDR and NIQE results on datasets used in other stitching algorithms and our own dataset. We outperform GSP and GES-GSP in both alignment and shape preservation. UDIS++[31] is a good try in deep learning based image stitching, but the performance is still no better than ours.

Qualitative results.

Fig.1 and2 elucidates the reasons behind the superior performance of our OBJ-GSP method. With the assistance of semantic segmentation techniques, we place greater emphasis on preserving critical structures and ensuring holistic protection at the object-level for objects within the images.Fig.4 and5 illustrates the stitching outcomes of six different methods, where we use straight lines and boxes to demonstrate the effects of alignment and distortion.Please kindly refer to our supplementary material for more quanlitative results.

Object-level Geometric Structure Preserving for Natural Image Stitching (7)
Object-level Geometric Structure Preserving for Natural Image Stitching (8)

4.4 Low-Altitude Aerial Image Stitching

Image stitching requires meeting one of two conditions[2]: either the camera’s optical center remains stationary while the camera rotates, or the scene only consists of objects that are far from the camera.Existing stitching algorithms mainly address the issue of stitching when these conditions are slightly violated.For low-altitude aerial images, where the flight height of the aircraft is around 100 meters but the height of buildings is no less than 20-40 meters, the camera’s optical center moves significantly during drone shooting, thus completely failing to meet the two assumptions for stitching.Moreover, if the left and right walls of a building are captured in two separate shots, it would be a logical error to include both walls in a panorama (for a cube, at most three faces can be seen at a time, and it’s impossible to see two opposing faces simultaneously).For stitching low-altitude aerial images, we first use a semantic segmentation model to segment the roofs and walls. We then calculate the height of the roofs and orthographically project the buildings onto the ground plane before stitching.In this scenario, the semantic segmentation model is essential for the stitching process[3].The stitching pipeline and result is in Fig.6.

5 Ablation Studies and Discussions

5.1 Lightweight SAM backbones

To assess the influence of semantic segmentation results on the stitching performance, we conducted a comparative analysis across three different backbones of the SAM[20] model, namely ViT-B, ViT-L, and ViT-H[10], with a progression from smaller to larger models. Larger models are inherently capable of capturing more fine-grained semantic details. The corresponding stitching results are depicted in Fig.7. It is worth noting that the models based on ViT-B and ViT-L exhibit some blurriness and minor distortions. From the perspective of MDR, our model, under the three aforementioned backbone configurations, achieved improvements of 1.9%, 2.7%, and 3.6% over the baseline model GES-GSP[11], respectively. The application of EfficientSAM[41], time consumption of OBJ-GSP, and its real-world applications are included in supplementary materials.

5.2 Sampling strategies

Incorporating SAM[20] with ViT-H [10] backbone, triangular mesh sampling yields superior results compared to the triangular sampling proposed in GES-GSP[11].The triangular mesh preserves the shape of lines but fails to maintain the overall geometric structure of the image.As illustrated in the Fig.8, we outline the structural elements in the original image using red dashed lines and then superimposed them onto the combined results of triangular sampling and triangular grid sampling.Triangular grid sampling retains the positional relationships between the lines present in the original image.

6 Conclusion

In this paper, we propose OBJect-level Geometric Structure Preserving for natural image stitching (OBJ-GSP) algorithm, which stands as a novel approach to achieving natural and visually pleasing composite images.OBJ-GSP protects object shapes by first segmenting them out, and then preserve the structures with triangle meshes.We also demonstrate that semantic segmentation is necessary when it comes to low-altitude aerial image stitching.We collect new test image pairs in common scenes and aerial imaging, and choose images from previous works, to establish the most comprehensive image stitching benchmark by far: StitchBench. Detailed experiments with comprehensive baselines in StitchBench demonstrate the effectiveness of OBJ-GSP.

7 Extensive Discussions

7.1 Limitations

While OBJ-GSP has demonstrated state-of-the-art performance in image stitching by extracting object-level geometric structures with semantic segmentation and preserving them with triangle mesh sampling, it introduces a large semantic segmentation model into image stitching, resulting in higher computational costs.However, with the development of semantic segmentation techniques, lighter versions of SAM will emerge[41, 46], enhancing the speed of our work.According to our analysis, the effectiveness of geometric structure extraction significantly impacts the final results. Our method is constrained by the quality of SAM’s results. Smaller models like SAM[20] with Vit-B/L[10] do not perform as SAM with Vit-H.To stitch a pair of 800*600 images, SAM spends 25s on RTX2090 with 8-24G GPU memory, depending on the backbone. For C++ ONNX implementation, SAM VIT-B spends 1.5 min on CPU.Mesh optimization and image processing cost less than 4s on an Intel i5 CPU, almost the same as GES-GSP[11].OBJ-GSP needs more computational resources and time than GES-GSP, but the stitching quality is also better.The time cost of SAM is larger than line detection methods in GES-GSP.The time used for triangle mesh optimization is almost the same as that in GES-GSP.

7.2 Applications of OBJ-GSP

In many fields, there is a need for high-quality stitched images, even at the expense of long time costs and significant computational resources.In medical image processing[36, 34, 48], for instance, stitching multiple pathological slice images to reconstruct the entire tissue structure or organ’s three-dimensional model demands high precision and quality. These tasks typically necessitate precise alignment and seamless fusion, and can tolerate longer computational times to ensure accuracy. Similarly, in photography[27] or virtual tourism[35, 38] applications, stitching numerous high-resolution images is necessary to generate high-quality panoramic images. This holds true in the fields of remote sensing[44, 7, 47] and movie industry[28, 14] as well.Currently, the emergence of faster and more accurate segmentation models, such as EfficientSAM[41], makes our method even more promising. We provide detailed comparisons in the supplementary materials.

7.3 Post-processing in image stitching

Our method primarily addresses computing transformation matrices to achieve alignment and shape preservation. Post-processing techniques can be combined with our approach to achieve a more natural stitching effect. Blending[2, 42, 43] and seam-driven[13] methods can be used to further reduce blurring, while global straightening[12] can decrease distortion.

7.4 Broader impacts

It is important to acknowledge that we do not explicitly discuss broader impacts in the proposed OBJ-GSP image stitching algorithm, such as fairness or bias. Segment Anything Model (SAM) [20] has discussed its broader impacts regarding geographic and income representation, as well as Fairness in segmenting people. Further research into how our algorithm may interact with other aspects of image stitching is encouraged.

8 Ablation of EfficientSAM and mesh sampling

We propose the utilization of SAM-based methods and mesh sampling to address distortion and misalignment during stitching. It is important to emphasize that both components are indispensable for object-level shape preservation. Fig.9 illustrates the distinct results achieved under the same semantic segmentation output using line-based triangle sampling and object-level mesh sampling. Mesh sampling can recognize object structures and effectively preserve objects from distortion.Furthermore, with the advancement in the field of semantic segmentation, the speed of SAM-based methods is accelerating, which will greatly expedite our image stitching approach. For instance, EfficientSAM[41]. In Fig.9, there is no significant difference observed between the results obtained using SAM + mesh sampling and EfficientSAM + mesh sampling. However, the time consumption of EfficientSAM is only 5%percent55\%5 % of SAM, which is predictable. With the further development of SAM-based methods, even faster and more accurate approaches are expected to emerge, making our stitching method faster and more precise.

Object-level Geometric Structure Preserving for Natural Image Stitching (9)

9 StitchBench Metadata

We employed handheld smartphones to capture OBJ-GSP images. During the image acquisition process, we took care to minimize translational movement of the smartphone, primarily relying on rotation to adjust the framing.This approach was employed to ensure that the disparity between images remained manageable, preventing situations where the occlusion relationships between two images would be too dissimilar for successful image stitching.We amassed a total of 18 pairs of image sets, encompassing diverse scenes such as rooms, culinary creations, sculptures, gardens, rivers, ponds, industrial facilities, roads, and exteriors of buildings, among others.

We also collect 7 sets of aerial images, each consist of about 9 pairs of images of urban scenes. We fly drones at 100-120 meters, in urban areas where there are buildings, roads, trees, etc[3].We collect images with a DJI Mavic Air 2 and the image size is set to be 3000 × 4000 Pixels. Cameras on the drone are kept vertical to the ground (bird view).

Additionally, we curate existing image stitching datasets [23, 45, 32, 22, 12, 11, 17, 25, 21, 26, 24]to supplement our data collection efforts.Ultimately, we constructed a dataset consisting of 122 groups of images, marking it as the largest dataset currently employed in image stitching endeavors.We release this dataset to the public for further research and development.

10 Reestablishing baselines

10.1 Implementation details

We evaluated the results of GSP [6] and GES-GSP [11] using their publicly available C++ code. Our stitching code is also implemented in C++. Our implementation of SAM includes two approaches. Drawing from the Segment Anything C++ Wrapper [9], we exported the SAM [20] encoder and decoder into Open Neural Network Exchange (ONNX) format and subsequently replicated SAM’s automatic mode within C++. To achieve the best possible stitching results, we also directly implemented the semantic segmentation component using the SAM’s publicly available Python code and utilized their semantic segmentation results. The stitching part of our experiments ran on the CPU, while the SAM modules were capable of running on both CPU and GPU.

10.2 Baselines

We replicated the results reported in the literature for GSP [6], GES-GSP [11], APAP [45], and SPHP [24] by implementing their publicly available codebases with their default parameter settings. For the structural alignment component, we employed the executable provided by Autostitch [2]. GES-GSP includes experimental data for both GES-GSP and GSP.In our method’s experiments, we maintained consistent parameter settings across all trials. Furthermore, for the structure extraction stage, we utilized the official code provided by SAM. However, we excluded masks with extremely small areas.

11 Algorithm and Results of Low-Altitude Aerial Image Stitching

11.1 Why is Semantic Segmentation Necessary

For low-altitude drone aerial photography in urban areas, the drone’s flight altitude is low while the buildings are tall. Due to the significant distance difference between the drone camera and the buildings, the transformation matrices for buildings and rooftops differ from those for the ground in different images. Direct stitching can lead to a lot of ghosting and unnatural, distorted building structures. Additionally, selective information must be discarded during low-altitude aerial stitching in urban areas. If buildings are simply considered rectangular prisms, and the left and right walls are captured in two separate shots, it is impossible to retain both sides in the stitched image (as a person cannot see both opposing sides of a rectangular prism simultaneously). Therefore, this paper proposes first using a segmentation model to identify buildings and walls in the scene. The walls are removed from the images, and the ratio of the building height to the drone height is calculated based on the different transformation relationships for the ground and buildings. The buildings are then projected onto the ground plane before stitching.The importance of segmentation in aerial image stitching is shown in Fig.10

Object-level Geometric Structure Preserving for Natural Image Stitching (10)

11.2 OBJ-GSP in Aerial Image Stitching

In aerial images of low altitude, there are two planes of interest: ground Pgsuperscript𝑃𝑔P^{g}italic_P start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT and roofs Pirsubscriptsuperscript𝑃𝑟𝑖P^{r}_{i}italic_P start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where i=1,2,N𝑖12𝑁i=1,2,...Nitalic_i = 1 , 2 , … italic_N and N𝑁Nitalic_N is the number of roofs. Semantic segmentation models are adopted to segment roofs and walls, with the remaining pixels regarded as ground. We mask out walls, and then orthographically project roofs to grounds.For correctly matched feature points fr1subscript𝑓𝑟1{f}_{r1}italic_f start_POSTSUBSCRIPT italic_r 1 end_POSTSUBSCRIPT and fr2subscript𝑓𝑟2{f}_{r2}italic_f start_POSTSUBSCRIPT italic_r 2 end_POSTSUBSCRIPTon Prsuperscript𝑃𝑟P^{r}italic_P start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT, and fg1subscript𝑓𝑔1{f}_{g1}italic_f start_POSTSUBSCRIPT italic_g 1 end_POSTSUBSCRIPT and fg2subscript𝑓𝑔2{f}_{g2}italic_f start_POSTSUBSCRIPT italic_g 2 end_POSTSUBSCRIPT on Pgsuperscript𝑃𝑔P^{g}italic_P start_POSTSUPERSCRIPT italic_g end_POSTSUPERSCRIPT, we aim to find transformation matrix Hosubscript𝐻𝑜H_{o}italic_H start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT to project roof to ground before stitching. After projection, the images can be stitched with a global transformation matrix, which is also the transformation between grounds, Hgsubscript𝐻𝑔H_{g}italic_H start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT.

{fg2=Hgfg1Hoifr2=HgHoifr1foriin1toN,casessubscript𝑓𝑔2subscript𝐻𝑔subscript𝑓𝑔1otherwisesubscript𝐻𝑜𝑖subscript𝑓𝑟2subscript𝐻𝑔subscript𝐻𝑜𝑖subscript𝑓𝑟1otherwisefor𝑖in1to𝑁,\begin{cases}f_{g2}=H_{g}f_{g1}\\H_{oi}f_{r2}={H_{g}}H_{oi}f_{r1}\end{cases}\text{for}\ i\ \text{in}1\ \text{to%}\ N\text{,}{ start_ROW start_CELL italic_f start_POSTSUBSCRIPT italic_g 2 end_POSTSUBSCRIPT = italic_H start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_g 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL italic_H start_POSTSUBSCRIPT italic_o italic_i end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_r 2 end_POSTSUBSCRIPT = italic_H start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_o italic_i end_POSTSUBSCRIPT italic_f start_POSTSUBSCRIPT italic_r 1 end_POSTSUBSCRIPT end_CELL start_CELL end_CELL end_ROW for italic_i in 1 to italic_N ,(11)

where f𝑓fitalic_f is in homogeneous coordinates [xy1]Tsuperscriptdelimited-[]matrix𝑥𝑦1𝑇\left[\begin{matrix}x&y&1\\\end{matrix}\right]^{T}[ start_ARG start_ROW start_CELL italic_x end_CELL start_CELL italic_y end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT.Let the height of roof Pirsubscriptsuperscript𝑃𝑟𝑖P^{r}_{i}italic_P start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the ground be hrisubscript𝑟𝑖h_{ri}italic_h start_POSTSUBSCRIPT italic_r italic_i end_POSTSUBSCRIPT, and height of drone be hdsubscript𝑑h_{d}italic_h start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, for each pixel [x,y,1]𝑥𝑦1[x,y,1][ italic_x , italic_y , 1 ] on the roof, the orthographic projection transformation is

[xgyg1]=[1hrihd0hrihda01hrihdhrihdb001][xryr1]foriin1toN,delimited-[]matrixsubscript𝑥𝑔subscript𝑦𝑔1delimited-[]matrix1subscript𝑟𝑖subscript𝑑0subscript𝑟𝑖subscript𝑑𝑎01subscript𝑟𝑖subscript𝑑subscript𝑟𝑖subscript𝑑𝑏001delimited-[]matrixsubscript𝑥𝑟subscript𝑦𝑟1for𝑖in1to𝑁,\left[\begin{matrix}x_{g}\\y_{g}\\1\\\end{matrix}\right]=\ \left[\begin{matrix}1-\frac{h_{ri}}{h_{d}}&0&\frac{h_{ri%}}{h_{d}}a\\0&1-\frac{h_{ri}}{h_{d}}&\frac{h_{ri}}{h_{d}}b\\0&0&1\\\end{matrix}\right]\ \left[\begin{matrix}x_{r}\\y_{r}\\1\\\end{matrix}\right]\text{for}\ i\ \text{in}1\ \text{to}\ N\text{,}[ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_y start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL 1 - divide start_ARG italic_h start_POSTSUBSCRIPT italic_r italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_h start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG end_CELL start_CELL 0 end_CELL start_CELL divide start_ARG italic_h start_POSTSUBSCRIPT italic_r italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_h start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG italic_a end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 - divide start_ARG italic_h start_POSTSUBSCRIPT italic_r italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_h start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG end_CELL start_CELL divide start_ARG italic_h start_POSTSUBSCRIPT italic_r italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_h start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG italic_b end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_y start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL 1 end_CELL end_ROW end_ARG ] for italic_i in 1 to italic_N ,(12)

where a𝑎aitalic_a and b𝑏bitalic_b are half of pixel width and height of the image.If the transformation matrix is in the form of a homograph matrix, and the number of feature point matches is Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT on Pirsubscriptsuperscript𝑃𝑟𝑖P^{r}_{i}italic_P start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Eq.11 expands to 2Mi2subscript𝑀𝑖2M_{i}2 italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT mutually independent quadratic equations with the unknown hrihdsubscript𝑟𝑖subscript𝑑\frac{h_{ri}}{h_{d}}divide start_ARG italic_h start_POSTSUBSCRIPT italic_r italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_h start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG. This over determined equation can be solved by methods such as Newton iteration. After solving hrihdsubscript𝑟𝑖subscript𝑑\frac{h_{ri}}{h_{d}}divide start_ARG italic_h start_POSTSUBSCRIPT italic_r italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_h start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_ARG, the orthographic projection map can be generated by simply transforming Pirsubscriptsuperscript𝑃𝑟𝑖P^{r}_{i}italic_P start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the ground one by one. The transformation matrix is the affine transformation matrix, which can be solved similarly for the orthographic projection matrix.

11.3 Segmentation Models and Aerial Segmentation Datasets Used

We finetune Grounded SAM[33] on low-altitude drone datasets where roof and wall are annotated (Varied Drone Dataset[4] and ICG Drone Dataset[16]).

12 More qualitative results

In this section we provide more qualivative results. We mark images with boxes to indicate misalignment, and use lines (and intersections of lines) to show distortion. Please refer to Fig. 12, 13, 5, 14 and 15. It is shown that our object-level preservation of structures can prevent distortion and misalignment at the same time.

13 Successful cases because of Segmentation

In this section, we demonstrate the superiority of Segment Anything Model [20] with qualitative results. Please refer to Fig. 16 It is shown that SAM extracted object-level and complete structures of the ground and mountain, so the OBJ-GSP preserved their structures better than GES-GSP, where HED [39] only extracted fragmented edge information.In Fig. 17, we demonstrate that the proposed OBJ-GSP can even stitch images with poor lighting conditions and protect their object-level structures.

14 Failure cases

As shown in Fig. 11, OBJ-GSP fails in cases of significant parallax. In the perspective of the left image, the corner of the building appears to be obtuse. However, based on the inference from the right image, the corner of the building should be a right angle. Consequently, the stitching algorithm becomes perplexed, unsure of how to preserve the shape of the house.

Object-level Geometric Structure Preserving for Natural Image Stitching (11)

In only rare instances OBJ-GSP experiences failures due to the malfunction of SAM [20], such as in conditions characterized by inadequate illumination or an exceedingly sparse set of features.However, we acknowledge that it’s very important to discuss the situations where, Segment Anything Model [20] fails, leading to the failure of the proposed OBJ-GSP image stitching algorithm. Please refer to Fig. 18 and 19.In the context of our approach, it is not imperative for the SAM [20] to achieve precise segmentation of objects containing semantic information. SAM need only recognize key object contours and boundaries. Therefore, the proposed OBJ-GSP is susceptible to SAM failure only in exceedingly rare instances where features are exceptionally sparse, and objects are highly indistinct. Consequently, SAM’s failure would impact solely the maintenance term of our structure, leading OBJ-GSP to degrade to the performance level of the GSP [6].Simultaneously, we emphasize that SAM’s failure would result in the nullification of our structural preservation term, causing OBJ-GSP’s performance to regress to that of GSP only under scenarios where SAM proves ineffective. In addressing cases involving distortion and misalignment, we posit that mitigation strategies such as global straightening and multi-bend blending, as employed in Autostitch [2], can be leveraged to alleviate these issues.

15 Comparasion with UDIS++

UDIS[30](TIP 2021) and UDIS++[31](ICCV 2023) are a family of attempts to address image stitching problems using deep learning frameworks.We compare with the revised version (UDIS++) as it performs better than UDIS.Like us, UDIS++ also aims to solve distortion issues on top of alignment. In the main text, we compute UDIS++’s NIQE to assess its alignment performance. Since UDIS++ is not feature point based, it cannot calculate MDR, a metric for measuring distortion. Therefore, in the supplementary material, we provide results for four scenes with multiple sets of images to intuitively compare distortion levels. Our method outperforms in both distortion resilience and alignment. Please refer to Fig. 20, 21 and 22.

Object-level Geometric Structure Preserving for Natural Image Stitching (12)
Object-level Geometric Structure Preserving for Natural Image Stitching (13)
Object-level Geometric Structure Preserving for Natural Image Stitching (14)
Object-level Geometric Structure Preserving for Natural Image Stitching (15)
Object-level Geometric Structure Preserving for Natural Image Stitching (16)
Object-level Geometric Structure Preserving for Natural Image Stitching (17)
Object-level Geometric Structure Preserving for Natural Image Stitching (18)
Object-level Geometric Structure Preserving for Natural Image Stitching (19)
Object-level Geometric Structure Preserving for Natural Image Stitching (20)
Object-level Geometric Structure Preserving for Natural Image Stitching (21)
Object-level Geometric Structure Preserving for Natural Image Stitching (22)

References

  • Anderson etal. [2016]Robert Anderson, David Gallup, JonathanT. Barron, Janne Kontkanen, NoahSnavely, Carlos Hernández, Sameer Agarwal, and StevenM. Seitz.Jump: virtual reality video.ACM Trans. Graph., 35:198:1–198:13, 2016.
  • Brown and Lowe [2007]MatthewA. Brown and DavidG. Lowe.Automatic panoramic image stitching using invariant features.International Journal of Computer Vision, 74:59–73,2007.
  • Cai etal. [2023a]Wenxiao Cai, Songlin Du, and Wankou Yang.Uav image stitching by estimating orthograph with rgb cameras.J. Vis. Commun. Image Represent., 94:103835,2023a.
  • Cai etal. [2023b]Wenxiao Cai, Ke Jin, Jinyan Hou, Cong Guo, Letian Wu, and Wankou Yang.Vdd: Varied drone dataset for semantic segmentation.ArXiv, abs/2305.13608, 2023b.
  • Canny [1986]JohnF. Canny.A computational approach to edge detection.IEEE Transactions on Pattern Analysis and MachineIntelligence, PAMI-8:679–698, 1986.
  • Chen and Chuang [2016]Yu-Sheng Chen and Yung-Yu Chuang.Natural image stitching with the global similarity prior.In European Conference on Computer Vision, 2016.
  • Cui etal. [2020]Jiguang Cui, Man Liu, Zhitao Zhang, Shuqin Yang, and Jifeng Ning.Robust uav thermal infrared remote sensing images stitching viaoverlap-prior-based global similarity prior model.IEEE Journal of Selected Topics in Applied Earth Observationsand Remote Sensing, 14:270–282, 2020.
  • Dewangan etal. [2014]AbhishekKumar Dewangan, Rohit Raja, and Reetika Singh.An implementation of multi sensor based mobile robot with imagestitching application.2014.
  • dinglufe [2023]dinglufe.Segment anything cpp wrapper.GitHub Repository, 2023.Accessed on 2023-09-20.
  • Dosovitskiy etal. [2020]Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn,Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, GeorgHeigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby.An image is worth 16x16 words: Transformers for image recognition atscale.ArXiv, abs/2010.11929, 2020.
  • Du etal. [2022]Peng Du, Jifeng Ning, Jiguang Cui, Shaoli Huang, Xinchao Wang, and Jiaxin Wang.Geometric structure preserving warp for natural image stitching.2022 IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR), pages 3678–3686, 2022.
  • Gao etal. [2011]Junhong Gao, SeonJoo Kim, and M.S. Brown.Constructing image panoramas using dual-homography warping.CVPR 2011, pages 49–56, 2011.
  • Gao etal. [2013]Junhong Gao, Yu Li, Tat-Jun Chin, and M.S. Brown.Seam-driven image stitching.In Eurographics, 2013.
  • Guo etal. [2016]Heng Guo, Shuaicheng Liu, Tong He, Shuyuan Zhu, Bing Zeng, and M. Gabbouj.Joint video stitching and stabilization from moving cameras.IEEE Transactions on Image Processing, 25:5491–5503, 2016.
  • Igarashi etal. [2005]Takeo Igarashi, Tomer Moscovich, and JohnF. Hughes.As-rigid-as-possible shape manipulation.ACM SIGGRAPH 2005 Papers, 2005.
  • Institute of Computer Graphics and Vision, Graz University ofTechnology [2024]Institute of Computer Graphics and Vision, Graz University of Technology.Icg drone dataset.http://dronedataset.icg.tugraz.at, 2024.Accessed: 2024-07-21.
  • Jia etal. [2021]Qi Jia, Zheng Li, Xin Fan, Haotian Zhao, Shiyu Teng, Xinchen Ye, and LonginJanLatecki.Leveraging line-point consistence to preserve structures for wideparallax image stitching.2021 IEEE/CVF Conference on Computer Vision and PatternRecognition (CVPR), pages 12181–12190, 2021.
  • Jia etal. [2023]Qi Jia, Xiaomei Feng, Yu Liu, Xin Fan, and LonginJan Latecki.Learning pixel-wise alignment for unsupervised image stitching.Network, 1(1):1, 2023.
  • Kim etal. [2020]HakGu Kim, Heoun taek Lim, and YongMan Ro.Deep virtual reality image quality assessment with human perceptionguider for omnidirectional image.IEEE Transactions on Circuits and Systems for VideoTechnology, 30:917–928, 2020.
  • Kirillov etal. [2023]Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, LauraGustafson, Tete Xiao, Spencer Whitehead, AlexanderC. Berg, Wan-Yen Lo, PiotrDollár, and RossB. Girshick.Segment anything.ArXiv, abs/2304.02643, 2023.
  • Li etal. [2018]Jing Li, Zhengming Wang, Shiming Lai, Yongping Zhai, and Maojun Zhang.Parallax-tolerant image stitching based on robust elastic warping.IEEE Transactions on Multimedia, 20:1672–1687,2018.
  • Li etal. [2015]Shiwei Li, Lu Yuan, Jian Sun, and Long Quan.Dual-feature warping-based motion model estimation.2015 IEEE International Conference on Computer Vision (ICCV),pages 4283–4291, 2015.
  • Lin etal. [2015a]Chung-Ching Lin, Sharath Pankanti, KarthikeyanNatesan Ramamurthy, andAleksandrY. Aravkin.Adaptive as-natural-as-possible image stitching.2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 1155–1163, 2015a.
  • Lin etal. [2015b]Chung-Ching Lin, Sharath Pankanti, KarthikeyanNatesan Ramamurthy, andAleksandrY. Aravkin.Adaptive as-natural-as-possible image stitching.2015 IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 1155–1163, 2015b.
  • Lin etal. [2016]Kaimo Lin, Nianjuan Jiang, LoongFah Cheong, MinhN. Do, and Jiangbo Lu.Seagull: Seam-guided local alignment for parallax-tolerant imagestitching.In European Conference on Computer Vision, 2016.
  • Lin etal. [2011]Wen-Yan Lin, Siying Liu, Yasuyuki Matsushita, Tian-Tsong Ng, and LoongFahCheong.Smoothly varying affine stitching.CVPR 2011, pages 345–352, 2011.
  • Lo etal. [2018]I-Chan Lo, Kuang-Tsu Shih, and HomerH. Chen.Image stitching for dual fisheye cameras.2018 25th IEEE International Conference on Image Processing(ICIP), pages 3164–3168, 2018.
  • Lyu etal. [2019]Wei Lyu, Zhong Zhou, Lang Chen, and Yi Zhou.A survey on image and video stitching.Virtual Real. Intell. Hardw., 1:55–83, 2019.
  • Mittal etal. [2013]Anish Mittal, Rajiv Soundararajan, and AlanConrad Bovik.Making a “completely blind” image quality analyzer.IEEE Signal Processing Letters, 20:209–212, 2013.
  • Nie etal. [2021]Lang Nie, Chunyu Lin, Kang Liao, Shuaicheng Liu, and Yao Zhao.Unsupervised deep image stitching: Reconstructing stitched featuresto images.IEEE Transactions on Image Processing, 30:6184–6197, 2021.
  • Nie etal. [2023]Lang Nie, Chunyu Lin, Kang Liao, Shuaicheng Liu, and Yao Zhao.Parallax-tolerant unsupervised deep image stitching.2023 IEEE/CVF International Conference on Computer Vision(ICCV), pages 7365–7374, 2023.
  • Nomura etal. [2007]Yoshikuni Nomura, Li Zhang, and ShreeK Nayar.Scene collages and flexible camera arrays.In Proceedings of the 18th Eurographics conference on RenderingTechniques, pages 127–138, 2007.
  • Ren etal. [2024]Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, JiayuChen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li,Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang.Grounded sam: Assembling open-world models for diverse visual tasks.ArXiv, abs/2401.14159, 2024.
  • Sakharkar and Gupta [2013]VrushaliS Sakharkar and SR Gupta.Image stitching techniques-an overview.Int. J. Comput. Sci. Appl, 6(2):324–330,2013.
  • Setiawan etal. [2023]MuhammadReza Setiawan, MuhamadAzrino Gustalika, and Muhammad LuluLatifUsman.Implementation of virtual tour using image stitching as anintroduction media of smpn 1 karangkobar to new students.Jurnal Teknik Informatika (Jutif), 4(5):1089–1098, 2023.
  • Singla and Sharma [2014]Savita Singla and Reecha Sharma.Medical image stitching using hybrid of sift & surf techniques.2014.
  • von Gioi etal. [2012]RafaelGrompone von Gioi, Jérémie Jakubowicz, Jean-Michel Morel, andGregory Randall.Lsd: a line segment detector.Image Process. Line, 2:35–55, 2012.
  • Widiyaningtyas etal. [2018]Triyanna Widiyaningtyas, DidikDwi Prasetya, and AjiPrasetya Wibawa.Web-based campus virtual tour application using orb image stitching.2018 5th International Conference on Electrical Engineering,Computer Science and Informatics (EECSI), pages 46–49, 2018.
  • Xie and Tu [2015]Saining Xie and Zhuowen Tu.Holistically-nested edge detection.International Journal of Computer Vision, 125:3 –18, 2015.
  • Xiong and Pulli [2009]Yingen Xiong and Kari Pulli.Sequential image stitching for mobile panoramas.2009 7th International Conference on Information,Communications and Signal Processing (ICICS), pages 1–5, 2009.
  • Xiong etal. [2023]Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xiang, Fanyi Xiao, ChenchenZhu, Xiaoliang Dai, Dilin Wang, Fei Sun, ForrestN. Iandola, RaghuramanKrishnamoorthi, and Vikas Chandra.Efficientsam: Leveraged masked image pretraining for efficientsegment anything.ArXiv, abs/2312.00863, 2023.
  • Xu etal. [2020]Han Xu, Jiayi Ma, Junjun Jiang, Xiaojie Guo, and Haibin Ling.U2fusion: A unified unsupervised image fusion network.IEEE Transactions on Pattern Analysis and MachineIntelligence, 44:502–518, 2020.
  • Xu etal. [2023]Han Xu, Jiteng Yuan, and Jiayi Ma.Murf: Mutually reinforcing multi-modal image registration and fusion.IEEE Transactions on Pattern Analysis and MachineIntelligence, 45:12148–12166, 2023.
  • Xue etal. [2021]Wanli Xue, Zhe Zhang, and Shengyong Chen.Ghost elimination via multi-component collaboration for unmannedaerial vehicle remote sensing image stitching.Remote. Sens., 13:1388, 2021.
  • Zaragoza etal. [2013]Julio Zaragoza, Tat-Jun Chin, Quoc-Huy Tran, M.S. Brown, and David Suter.As-projective-as-possible image stitching with moving dlt.2013 IEEE Conference on Computer Vision and PatternRecognition, pages 2339–2346, 2013.
  • Zhang etal. [2023]Chaoning Zhang, Dongshen Han, Yu Qiao, JungUk Kim, Sung-Ho Bae, Seungkyu Lee,and Choong-Seon Hong.Faster segment anything: Towards lightweight sam for mobileapplications.ArXiv, abs/2306.14289, 2023.
  • Zhang etal. [2022]Yujie Zhang, Xiaoguang Mei, Yong Ma, Xingyu Jiang, Zongyi Peng, and Jun Huang.Hyperspectral panoramic image stitching using robust matching andadaptive bundle adjustment.Remote. Sens., 14:4038, 2022.
  • Zhao etal. [2010]XiuYing Zhao, Hongyu Wang, and Yongxue Wang.Medical image seamlessly stitching by sift and gist.2010 International Conference on E-Product E-Service andE-Entertainment, pages 1–4, 2010.
Object-level Geometric Structure Preserving for Natural Image Stitching (2025)
Top Articles
Latest Posts
Recommended Articles
Article information

Author: Fredrick Kertzmann

Last Updated:

Views: 6477

Rating: 4.6 / 5 (66 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Fredrick Kertzmann

Birthday: 2000-04-29

Address: Apt. 203 613 Huels Gateway, Ralphtown, LA 40204

Phone: +2135150832870

Job: Regional Design Producer

Hobby: Nordic skating, Lacemaking, Mountain biking, Rowing, Gardening, Water sports, role-playing games

Introduction: My name is Fredrick Kertzmann, I am a gleaming, encouraging, inexpensive, thankful, tender, quaint, precious person who loves writing and wants to share my knowledge and understanding with you.