We provide the complete image sets used in Figs. 1 and 8, along with additional results.
Match-and-Fuse generates consistent content of rigid and non-rigid shared elements, single and multi-subject, with shared or varying background, preserving fine-grained consistency in textures, small details, and typography. Notably, it can generate consistent long sequences.
We provide the complete image sets used in Fig. 10a, along with additional
results.
Match-and-Fuse generalizes to sketched inputs, enabling controlled storyboard generation.
We provide the complete image sets used in Fig. 10b, along with additional
results.
Match-and-Fuse enables consistent, localized editing without \(\mathcal{P}^{theme}\), achieved
through integration with FlowEdit [1] (see Ap. A.1).
All results are produced using default integration parameters.
However, due to the preservation–editability trade-off inherited from FlowEdit, some cases
(e.g., the sheep example)
may exhibit slight structural deviations and benefit from mild per-edit hyperparameter tuning.
Automating this selection is left for future work.
We provide the complete image sets used in Fig. 9, along with additional comparisons.
FLUX [2] produces inconsistencies in both coarse and
fine details.
FLUX Kontext [3] has low prompt adherence (e.g., a
dog and cat edits remain animal-like in some images), and often distorts object structure (e.g.
vase, dog).
IC-LoRA [4] achieves partial consistency, with
coherence often restricted to subsets (vase, dog, cat). Its realism and fidelity are notably
lower than FLUX.
Edicho [5] performs best among baselines but still
shows noticeable inconsistencies (dog), as its one-to-all warping enforces consistency only with
the first image, making the choice of an anchor ambiguous and causing diverging appearances
across views.
These result were provided by the authors for qualitative comparison.
Edicho-Inpaint* [5] uses the same underlying
approach, but generates content only within the shared-region masks, omitting
\(\mathcal{P}^{theme}\).
Since we only evaluate the shared regions, this setup is comparable and is used in the reported
numeric evaluation, as the code for ControlNet pipeline was unavailable at publication time.
The additional results reveal that Edicho’s attention warping mechanism lacks robustness under
viewpoint variation, producing severe artifacts.
In contrast, Match-and-Fuse (Ours) maintains high image quality while
substantially improving structural and fine-grained consistency.
We provide additional comparisons to closed-source image generation and editing model Nano Banana [6]. The results were obtained though the API by sending a 2x3 image grid with a prompt: "Change the <object> in all images into \(\mathcal{P}^{shared}\). Keep the <object> poses as in the original images. Make the backgrounds look like a \(\mathcal{P}^{theme}\) setting."
Nano Banana fails to preserve original layouts and tends to copy-paste duplicate elements across images: foreground objects (dog, cat), as well as backgrounds (dog, cat, shoes, vase). Moreover, it often fails to apply the requested edit altogether (dog, cat). In cases where results do differ across the set, they lack consistency (vase).
We provide the complete image sets used in Fig. 7, along with additional
ablation comparisons.
W/o Pairwise Consistency Graph, correspondences alone cannot align appearances,
leading to identity drift.
W/o MFF and w/o Feature Guidance (both disabled) corresponds
to a configuration that doesn't use source 2D matches.
W/o Multiview Feature Fusion, latent versions aggregated across edges diverge
more easily, reducing consistency.
Omitting Feature Guidance at each step leads to misaligned fine-grained
details.
[1] Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas, and Tomer Michaeli. Flowedit: Inversion-free text-based editing using pre-trained flow models. arXiv preprint arXiv:2412.08629, 2024. 8
[2] Black-Forest. Flux: Diffusion models for layered image generation. https://github.com/black-forest-labs/flux, 2024. Accessed: 2024-09-24. 1, 3162
[3] Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, JonasMuller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image generation and editing in latent space, 2025. 1202
[4] Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jingren Zhou. In-context lora for diffusion transformers. 2024. 2, 6, 7
[5] Qingyan Bai, Hao Ouyang, Yinghao Xu, Qiuyu Wang, Ceyuan Yang, Ka Leong Cheng, Yujun Shen, and Qifeng Chen. Edicho: Consistent image editing in the wild. In arXiv preprint arXiv:2412.21079, 2024. 2, 7
[6] Google DeepMind. Gemini 2.5 flash image (“nano banana”) model/api, 2025. Accessible via Google Gemini API. 2, 7