Match-and-Fuse: Consistent Generation from Unstructured Image Sets

Supplementary Materials

Home

Contents


Results

We provide the complete image sets used in Figs. 1 and 8, along with additional results.

Match-and-Fuse generates consistent content of rigid and non-rigid shared elements, single and multi-subject, with shared or varying background, preserving fine-grained consistency in textures, small details, and typography. Notably, it can generate consistent long sequences.


Source Set
image_0 image_1 image_2 image_3 image_4
Generations
result_0 result_1 result_2 result_3 result_4

\(\mathcal{P}^{shared}\): "A dog-shaped balloon" \(\mathcal{P}^{theme}\): "winter"

Generations
result_0 result_1 result_2 result_3 result_4

\(\mathcal{P}^{shared}\): "A dog sculpture" \(\mathcal{P}^{theme}\): "autumn"


Source Set
image_0 image_1 image_2 image_3 image_4
Generations
result_0 result_1 result_2 result_3 result_4

\(\mathcal{P}^{shared}\): "Blue-white bag" \(\mathcal{P}^{theme}\): "luxury"

Generations
result_0 result_1 result_2 result_3 result_4

\(\mathcal{P}^{shared}\): "Vintage bag" \(\mathcal{P}^{theme}\): "retro"


Source Set
image_0 image_1 image_2 image_3 image_4
Generations
result_0 result_1 result_2 result_3 result_4

\(\mathcal{P}^{shared}\): "A cat-shaped balloon" \(\mathcal{P}^{theme}\): "winter"

Generations
result_0 result_1 result_2 result_3 result_4

\(\mathcal{P}^{shared}\): "Extraterrestrial alien pet" \(\mathcal{P}^{theme}\): "autumn"

Generations
result_0 result_1 result_2 result_3 result_4

\(\mathcal{P}^{shared}\): "A cat knitted from a two-colored yarn" \(\mathcal{P}^{theme}\): "autumn"


Source Set
image_0 image_1 image_2 image_3 image_4
Generations
result_0 result_1 result_2 result_3 result_4

\(\mathcal{P}^{shared}\): "Black-white knitted dog" \(\mathcal{P}^{theme}\): "autumn"


Source Set
image_0 image_1 image_2 image_3
Generations
result_0 result_1 result_2 result_3

\(\mathcal{P}^{shared}\): "Cartoon characters" \(\mathcal{P}^{theme}\): "cartoon"


Source Set
image_0 image_1 image_2 image_3 image_4 image_5
Generations
result_0 result_1 result_2 result_3 result_4 result_5

\(\mathcal{P}^{shared}\): "A 'Summer Potion' can" \(\mathcal{P}^{theme}\): "dreamy summer"


Source Set
image_0 image_1 image_2 image_3 image_4
Generations
result_0 result_1 result_2 result_3 result_4

\(\mathcal{P}^{shared}\): "A man making a cocktail" \(\mathcal{P}^{theme}\): "inter-galactic kitchen"


Source Set
image_0 image_1 image_2 image_3 image_4 image_5 image_6 image_7 image_8
Generations
result_0 result_1 result_2 result_3 result_4 result_5 result_6 result_7 result_8

\(\mathcal{P}^{shared}\): "70s car" \(\mathcal{P}^{theme}\): "retro"


Source Set
image_0 image_1 image_2 image_3 image_4 image_5 image_6 image_7 image_8 image_9 image_10 image_11 image_12 image_13 image_14
Generations
result_0 result_1 result_2 result_3 result_4 result_5 result_6 result_7 result_8 result_9 result_10 result_11 result_12 result_13 result_14

\(\mathcal{P}^{shared}\): "Claymation character in glasses" \(\mathcal{P}^{theme}\): "claymation"


Source Set
image_0 image_1 image_2 image_3 image_4 image_5 image_6 image_7 image_8 image_9 image_9 image_9 image_9 image_9
Generations
result_0 result_1 result_2 result_3 result_4 result_5 result_6 result_7 result_8 result_9 result_9 result_9 result_9 result_9

\(\mathcal{P}^{shared}\): "Pixar character" \(\mathcal{P}^{theme}\): "Pixar"


Source Set
image_0 image_1 image_2 image_3 image_4 image_5 image_6 image_7 image_8 image_9
Generations
result_0 result_1 result_2 result_3 result_4 result_5 result_6 result_7 result_8 result_9

\(\mathcal{P}^{shared}\): "A real panda in a two-colored shirt" \(\mathcal{P}^{theme}\): "kindergarten"


Source Set
image_0 image_1 image_2 image_3 image_4 image_5 image_6 image_7
Generations
result_0 result_1 result_2 result_3 result_4 result_5 result_6 result_7

\(\mathcal{P}^{shared}\): "A hamster in a cute costume" \(\mathcal{P}^{theme}\): "winter"


Source Set
image_0 image_1 image_2 image_3 image_4 image_5 image_6
Generations
result_0 result_1 result_2 result_3 result_4 result_5 result_6

\(\mathcal{P}^{shared}\): "An exotic flower" \(\mathcal{P}^{theme}\): "jungle"


Source Set
image_0 image_1 image_2 image_3 image_4 image_5 image_6
Generations
result_0 result_1 result_2 result_3 result_4 result_5 result_6

\(\mathcal{P}^{shared}\): "A flower toy" \(\mathcal{P}^{theme}\): "kids room"


Source Set
image_0 image_1 image_2 image_3 image_4
Generations
result_0 result_1 result_2 result_3 result_4

\(\mathcal{P}^{shared}\): "Pixar character" \(\mathcal{P}^{theme}\): "Pixar"


Source Set
image_0 image_1 image_2 image_3 image_4 image_5 image_6
Generations
result_0 result_1 result_2 result_3 result_4 result_5 result_6

\(\mathcal{P}^{shared}\): "Gingerbread house" \(\mathcal{P}^{theme}\): "gingerbread world"


Source Set
image_0 image_1 image_2 image_3
Generations
result_0 result_1 result_2 result_3

\(\mathcal{P}^{shared}\): "Fine-dining desert" \(\mathcal{P}^{theme}\): "luxury restaurant"


Source Set
image_0 image_1 image_2 image_3 image_4
Generations
result_0 result_1 result_2 result_3 result_4

\(\mathcal{P}^{shared}\): "Graffiti-styled monster toy" \(\mathcal{P}^{theme}\): "graffiti"


Source Set
image_0 image_1 image_2 image_3 image_4
Generations
result_0 result_1 result_2 result_3 result_4

\(\mathcal{P}^{shared}\): "Rusty car" \(\mathcal{P}^{theme}\): "retro"


Source Set
image_0 image_1 image_2 image_3
Generations
result_0 result_1 result_2 result_3

\(\mathcal{P}^{shared}\): "Space capsule" \(\mathcal{P}^{theme}\): "space"


Source Set
image_0 image_1 image_2
Generations
result_0 result_1 result_2

\(\mathcal{P}^{shared}\): "Loft room" \(\mathcal{P}^{theme}\): "Loft"


Source Set
image_0 image_1 image_2 image_3 image_4
Generations
result_0 result_1 result_2 result_3 result_4

\(\mathcal{P}^{shared}\): "Magical waterfall at sunset" \(\mathcal{P}^{theme}\): "sunset"


Source Set
image_0 image_1 image_2 image_3 image_4
Generations
result_0 result_1 result_2 result_3 result_4

\(\mathcal{P}^{shared}\): "warm woolen slipper" \(\mathcal{P}^{theme}\): "winter"


Source Set
image_0 image_1 image_2 image_3 image_4 image_5
Generations
result_0 result_1 result_2 result_3 result_4 result_5

\(\mathcal{P}^{shared}\): "Indiana Jones-themed satchel" \(\mathcal{P}^{theme}\): "jungle"


Source Set
image_0 image_1 image_2 image_3 image_4 image_5
Generations
result_0 result_1 result_2 result_3 result_4 result_5

\(\mathcal{P}^{shared}\): "Figurine of a Matrix character" \(\mathcal{P}^{theme}\): "Matrix movie"


Source Set
image_0 image_1 image_2 image_3 image_4 image_5 image_6
Generations
result_0 result_1 result_2 result_3 result_4 result_5 result_6

\(\mathcal{P}^{shared}\): "Rabbit from Alice in Wonderland" \(\mathcal{P}^{theme}\): "Alice in Wonderland"


Source Set
image_0 image_1 image_2 image_3 image_4
Generations
result_0 result_1 result_2 result_3 result_3

\(\mathcal{P}^{shared}\): "Gummy bear with 'Cheer Up' text" \(\mathcal{P}^{theme}\): "jelly world"

Generations
result_0 result_1 result_2 result_3 result_4

\(\mathcal{P}^{shared}\): "Horror teddy bear with 'Eternal Rest' text" \(\mathcal{P}^{theme}\): "gothic horror"


Source Set
image_0 image_1 image_2 image_3 image_4
Generations
result_0 result_1 result_2 result_3 result_4

\(\mathcal{P}^{shared}\): "Two-colored metal statue" \(\mathcal{P}^{theme}\): "winter"

Generations
result_0 result_1 result_2 result_3 result_4

\(\mathcal{P}^{shared}\): "Horror-style animal" \(\mathcal{P}^{theme}\): "night"


Source Set
image_0 image_1 image_2 image_3 image_4
Generations
result_0 result_1 result_2 result_3 result_4

\(\mathcal{P}^{shared}\): "Cat robot" \(\mathcal{P}^{theme}\): "winter"

Generations
result_0 result_1 result_2 result_3 result_4

\(\mathcal{P}^{shared}\): "An animal knitted from a two-colored yarn" \(\mathcal{P}^{theme}\): "winter"

Generations
result_0 result_1 result_2 result_3 result_4

\(\mathcal{P}^{shared}\): "Wooden cat statue" \(\mathcal{P}^{theme}\): "winter"

Storyboards

We provide the complete image sets used in Fig. 10a, along with additional results.
Match-and-Fuse generalizes to sketched inputs, enabling controlled storyboard generation.


Source Set
image_0 image_1 image_2 image_3 image_4 image_5
Generations
result_0 result_1 result_2 result_3 result_4 result_5

\(\mathcal{P}^{shared}\): "A young man" \(\mathcal{P}^{theme}\): "simple life"


Source Set
image_0 image_1 image_2 image_3 image_4 image_5
Generations
result_0 result_1 result_2 result_3 result_4 result_5

\(\mathcal{P}^{shared}\): "A pink octopus with yellow spots" \(\mathcal{P}^{theme}\): "underwater"


Source Set
image_0 image_1 image_2 image_3 image_4 image_5
Generations
result_0 result_1 result_2 result_3 result_4 result_5

\(\mathcal{P}^{shared}\): "Vibrant sneakers" \(\mathcal{P}^{theme}\): "gloomy day"


Source Set
image_0 image_1 image_2 image_3 image_4 image_5
Generations
result_0 result_1 result_2 result_3 result_4 result_5

\(\mathcal{P}^{shared}\): "Striped dress" \(\mathcal{P}^{theme}\): "simple"


Source Set
image_0 image_1 image_2 image_3
Generations
result_0 result_1 result_2 result_3

\(\mathcal{P}^{shared}\): "A couple's life" \(\mathcal{P}^{theme}\): "sunny day"

Localized Editing

We provide the complete image sets used in Fig. 10b, along with additional results.
Match-and-Fuse enables consistent, localized editing without \(\mathcal{P}^{theme}\), achieved through integration with FlowEdit [1] (see Ap. A.1). All results are produced using default integration parameters. However, due to the preservation–editability trade-off inherited from FlowEdit, some cases (e.g., the sheep example) may exhibit slight structural deviations and benefit from mild per-edit hyperparameter tuning. Automating this selection is left for future work.


Source Set
image_0 image_1 image_2 image_3 image_4
Generations
result_0 result_1 result_2 result_3 result_4

\(\mathcal{P}^{shared}\): "A princess keychain"


Source Set
image_0 image_1 image_2 image_3 image_4 image_5
Generations
result_0 result_1 result_2 result_3 result_4 result_5

\(\mathcal{P}^{shared}\): "A book titled 'Timetravel Manual'"


Source Set
image_0 image_1 image_2 image_3 image_4 image_5 image_6 image_7
Generations
result_0 result_1 result_2 result_3 result_4 result_5 result_6 result_7

\(\mathcal{P}^{shared}\): "A sheep in a red t-shirt"


Source Set
image_0 image_1 image_2 image_3 image_4
Generations
result_0 result_1 result_2 result_3 result_4

\(\mathcal{P}^{shared}\): "Extraterrestrial alien pet"


Source Set
image_0 image_1 image_2 image_3
Generations
result_0 result_1 result_2 result_3

\(\mathcal{P}^{shared}\): "A statue of panda in clothes"


Source Set
image_0 image_1 image_2 image_3
Generations
result_0 result_1 result_2 result_3

\(\mathcal{P}^{shared}\): "A hippy car"


Source Set
image_0 image_1 image_2 image_3
Generations
result_0 result_1 result_2 result_3

\(\mathcal{P}^{shared}\): "Plush Stitch"


Source Set
image_0 image_1 image_2
Generations
result_0 result_1 result_2

\(\mathcal{P}^{shared}\): "Disney-styled bike"


Source Set
image_0 image_1 image_2
Generations
result_0 result_1 result_2

\(\mathcal{P}^{shared}\): "A cat-shaped candy"

Comparisons

We provide the complete image sets used in Fig. 9, along with additional comparisons.

FLUX [2] produces inconsistencies in both coarse and fine details.
FLUX Kontext [3] has low prompt adherence (e.g., a dog and cat edits remain animal-like in some images), and often distorts object structure (e.g. vase, dog).
IC-LoRA [4] achieves partial consistency, with coherence often restricted to subsets (vase, dog, cat). Its realism and fidelity are notably lower than FLUX.
Edicho [5] performs best among baselines but still shows noticeable inconsistencies (dog), as its one-to-all warping enforces consistency only with the first image, making the choice of an anchor ambiguous and causing diverging appearances across views. These result were provided by the authors for qualitative comparison.
Edicho-Inpaint* [5] uses the same underlying approach, but generates content only within the shared-region masks, omitting \(\mathcal{P}^{theme}\). Since we only evaluate the shared regions, this setup is comparable and is used in the reported numeric evaluation, as the code for ControlNet pipeline was unavailable at publication time. The additional results reveal that Edicho’s attention warping mechanism lacks robustness under viewpoint variation, producing severe artifacts.
In contrast, Match-and-Fuse (Ours) maintains high image quality while substantially improving structural and fine-grained consistency.


Source Set
image_0 image_1 image_2 image_3 image_4 image_5

\(\mathcal{P}^{shared}\): "A dog toy with two-colored yarn" \(\mathcal{P}^{theme}\): "winter"

FLUX.1-dev
result_0 result_1 result_2 result_3 result_4 result_5
FLUX.1 Kontext
result_0 result_1 result_2 result_3 result_4 result_5
IC-LoRA
result_0 result_1 result_2 result_3 result_4 result_5
Edicho-Inpaint*
result_0 result_1 result_2 result_3 result_4 result_5
Edicho
result_0 result_1 result_2 result_3 result_4 result_5
Ours
result_0 result_1 result_2 result_3 result_4 result_5

Source Set
image_0 image_1 image_2 image_3 image_4

\(\mathcal{P}^{shared}\): "A marble cat statue" \(\mathcal{P}^{theme}\): "autumn"

FLUX.1-dev
result_0 result_1 result_2 result_3 result_4
FLUX.1 Kontext
result_0 result_1 result_2 result_3 result_4
IC-LoRA
result_0 result_1 result_2 result_3 result_4
Edicho-Inpaint*
result_0 result_1 result_2 result_3 result_4
Edicho
result_0 result_1 result_2 result_3 result_4
Ours
result_0 result_1 result_2 result_3 result_4

Source Set
image_0 image_1 image_3 image_4

\(\mathcal{P}^{shared}\): "A plush slipper " \(\mathcal{P}^{theme}\): "modern"

FLUX.1-dev
result_0 result_1 result_3 result_4
FLUX.1 Kontext
result_0 result_1 result_3 result_4
IC-LoRA
result_0 result_1 result_3 result_4
Edicho-Inpaint*
result_0 result_1 result_3 result_4
Edicho
result_0 result_1 result_3 result_4
Ours
result_0 result_1 result_3 result_4

Source Set
image_1 image_2 image_3

\(\mathcal{P}^{shared}\): "A cute disney-styled vase" \(\mathcal{P}^{theme}\): "rural"

FLUX.1-dev
result_1 result_2 result_3
FLUX.1 Kontext
result_1 result_2 result_3
IC-LoRA
result_1 result_2 result_3
Edicho-Inpaint*
result_1 result_2 result_3
Edicho
result_1 result_2 result_3
Ours
result_1 result_2 result_3

Source Set
image_0 image_2 image_3 image_4 image_6 image_7 image_8

\(\mathcal{P}^{shared}\): "two-colored vintage car" \(\mathcal{P}^{theme}\): "vintage"

FLUX.1-dev
result_0 result_2 result_3 result_4 result_6 result_7 result_8
FLUX Kontext
result_0 result_2 result_3 result_4 result_6 result_7 result_8
IC-LoRA
result_0 result_2 result_3 result_4 result_6 result_7 result_8
Edicho-Inpaint*
result_0 result_2 result_3 result_4 result_6 result_7 result_8
Ours
result_0 result_2 result_3 result_4 result_6 result_7 result_8

Source Set
image_0 image_1 image_2 image_3 image_4

\(\mathcal{P}^{shared}\): "Pixar characters" \(\mathcal{P}^{theme}\): "plain background"

FLUX.1-dev
result_0 result_1 result_2 result_3 result_4
FLUX.1 Kontext
result_0 result_1 result_2 result_3 result_4
IC-LoRA
result_0 result_1 result_2 result_3 result_4
Edicho-Inpaint*
result_0 result_1 result_2 result_3 result_4
Ours
result_0 result_1 result_2 result_3 result_4

Extended Comparisons

We provide additional comparisons to closed-source image generation and editing model Nano Banana [6]. The results were obtained though the API by sending a 2x3 image grid with a prompt: "Change the <object> in all images into \(\mathcal{P}^{shared}\). Keep the <object> poses as in the original images. Make the backgrounds look like a \(\mathcal{P}^{theme}\) setting."

Nano Banana fails to preserve original layouts and tends to copy-paste duplicate elements across images: foreground objects (dog, cat), as well as backgrounds (dog, cat, shoes, vase). Moreover, it often fails to apply the requested edit altogether (dog, cat). In cases where results do differ across the set, they lack consistency (vase).


Source Set
image_0 image_1 image_2 image_3 image_4 image_5

\(\mathcal{P}^{shared}\): "A dog toy with two-colored yarn" \(\mathcal{P}^{theme}\): "winter"

Nano Banana
result_0 result_1 result_2 result_3 result_4 result_5
Ours
result_0 result_1 result_2 result_3 result_4 result_5

Source Set
image_0 image_1 image_2 image_3 image_4

\(\mathcal{P}^{shared}\): "A marble cat statue" \(\mathcal{P}^{theme}\): "autumn"

Nano Banana
result_0 result_1 result_2 result_3 result_4
Ours
result_0 result_1 result_2 result_3 result_4

Source Set
image_0 image_1 image_2 image_3 image_4

\(\mathcal{P}^{shared}\): "A plush slipper " \(\mathcal{P}^{theme}\): "modern"

Nano Banana
result_0 result_1 result_2 result_3 result_4
Ours
result_0 result_1 result_2 result_3 result_4

Source Set
image_0 image_1 image_2 image_3 image_4

\(\mathcal{P}^{shared}\): "A cute disney-styled vase" \(\mathcal{P}^{theme}\): "rural"

Nano Banana
result_0 result_1 result_2 result_3 result_4
Ours
result_0 result_1 result_2 result_3 result_4

Ablations

We provide the complete image sets used in Fig. 7, along with additional ablation comparisons.
W/o Pairwise Consistency Graph, correspondences alone cannot align appearances, leading to identity drift.
W/o MFF and w/o Feature Guidance (both disabled) corresponds to a configuration that doesn't use source 2D matches.
W/o Multiview Feature Fusion, latent versions aggregated across edges diverge more easily, reducing consistency.
Omitting Feature Guidance at each step leads to misaligned fine-grained details.


Source
Set
image_0 image_1 image_2 image_3 image_4

\(\mathcal{P}^{shared}\): "An orange dog-styled backpack" \(\mathcal{P}^{theme}\): "autumn"

w/o Pairwise Graph
result_0 result_1 result_2 result_3 result_4
w/o MFF & w/o Guidance
result_0 result_1 result_2 result_3 result_4
w/o Multiview Feature Fusion
result_0 result_1 result_2 result_3 result_4
w/o Feature Guidance
result_0 result_1 result_2 result_3 result_4
Full Method (Ours)
result_0 result_1 result_2 result_3 result_4

References


[1] Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas, and Tomer Michaeli. Flowedit: Inversion-free text-based editing using pre-trained flow models. arXiv preprint arXiv:2412.08629, 2024. 8
[2] Black-Forest. Flux: Diffusion models for layered image generation. https://github.com/black-forest-labs/flux, 2024. Accessed: 2024-09-24. 1, 3162
[3] Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, JonasMuller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image generation and editing in latent space, 2025. 1202
[4] Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jingren Zhou. In-context lora for diffusion transformers. 2024. 2, 6, 7
[5] Qingyan Bai, Hao Ouyang, Yinghao Xu, Qiuyu Wang, Ceyuan Yang, Ka Leong Cheng, Yujun Shen, and Qifeng Chen. Edicho: Consistent image editing in the wild. In arXiv preprint arXiv:2412.21079, 2024. 2, 7
[6] Google DeepMind. Gemini 2.5 flash image (“nano banana”) model/api, 2025. Accessible via Google Gemini API. 2, 7