Supplementary Materials

Match-and-Fuse: Consistent Generation from Unstructured Image Sets

Supplementary Materials

Results

We provide the complete image sets used in Figs. 1 and 8, along with additional results.

Match-and-Fuse generates consistent content of rigid and non-rigid shared elements, single and multi-subject, with shared or varying background, preserving fine-grained consistency in textures, small details, and typography. Notably, it can generate consistent long sequences.

Source Set

Generations

\(\mathcal{P}^{shared}\): "A dog-shaped balloon" \(\mathcal{P}^{theme}\): "winter"

Generations

\(\mathcal{P}^{shared}\): "A dog sculpture" \(\mathcal{P}^{theme}\): "autumn"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Blue-white bag" \(\mathcal{P}^{theme}\): "luxury"

Generations

\(\mathcal{P}^{shared}\): "Vintage bag" \(\mathcal{P}^{theme}\): "retro"

Source Set

Generations

\(\mathcal{P}^{shared}\): "A cat-shaped balloon" \(\mathcal{P}^{theme}\): "winter"

Generations

\(\mathcal{P}^{shared}\): "Extraterrestrial alien pet" \(\mathcal{P}^{theme}\): "autumn"

Generations

\(\mathcal{P}^{shared}\): "A cat knitted from a two-colored yarn" \(\mathcal{P}^{theme}\): "autumn"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Black-white knitted dog" \(\mathcal{P}^{theme}\): "autumn"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Cartoon characters" \(\mathcal{P}^{theme}\): "cartoon"

Source Set

Generations

\(\mathcal{P}^{shared}\): "A 'Summer Potion' can" \(\mathcal{P}^{theme}\): "dreamy summer"

Source Set

Generations

\(\mathcal{P}^{shared}\): "A man making a cocktail" \(\mathcal{P}^{theme}\): "inter-galactic kitchen"

Source Set

Generations

\(\mathcal{P}^{shared}\): "70s car" \(\mathcal{P}^{theme}\): "retro"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Claymation character in glasses" \(\mathcal{P}^{theme}\): "claymation"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Pixar character" \(\mathcal{P}^{theme}\): "Pixar"

Source Set

Generations

\(\mathcal{P}^{shared}\): "A real panda in a two-colored shirt" \(\mathcal{P}^{theme}\): "kindergarten"

Source Set

Generations

\(\mathcal{P}^{shared}\): "A hamster in a cute costume" \(\mathcal{P}^{theme}\): "winter"

Source Set

Generations

\(\mathcal{P}^{shared}\): "An exotic flower" \(\mathcal{P}^{theme}\): "jungle"

Source Set

Generations

\(\mathcal{P}^{shared}\): "A flower toy" \(\mathcal{P}^{theme}\): "kids room"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Pixar character" \(\mathcal{P}^{theme}\): "Pixar"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Gingerbread house" \(\mathcal{P}^{theme}\): "gingerbread world"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Fine-dining desert" \(\mathcal{P}^{theme}\): "luxury restaurant"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Graffiti-styled monster toy" \(\mathcal{P}^{theme}\): "graffiti"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Rusty car" \(\mathcal{P}^{theme}\): "retro"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Space capsule" \(\mathcal{P}^{theme}\): "space"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Loft room" \(\mathcal{P}^{theme}\): "Loft"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Magical waterfall at sunset" \(\mathcal{P}^{theme}\): "sunset"

Source Set

Generations

\(\mathcal{P}^{shared}\): "warm woolen slipper" \(\mathcal{P}^{theme}\): "winter"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Indiana Jones-themed satchel" \(\mathcal{P}^{theme}\): "jungle"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Figurine of a Matrix character" \(\mathcal{P}^{theme}\): "Matrix movie"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Rabbit from Alice in Wonderland" \(\mathcal{P}^{theme}\): "Alice in Wonderland"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Gummy bear with 'Cheer Up' text" \(\mathcal{P}^{theme}\): "jelly world"

Generations

\(\mathcal{P}^{shared}\): "Horror teddy bear with 'Eternal Rest' text" \(\mathcal{P}^{theme}\): "gothic horror"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Two-colored metal statue" \(\mathcal{P}^{theme}\): "winter"

Generations

\(\mathcal{P}^{shared}\): "Horror-style animal" \(\mathcal{P}^{theme}\): "night"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Cat robot" \(\mathcal{P}^{theme}\): "winter"

Generations

\(\mathcal{P}^{shared}\): "An animal knitted from a two-colored yarn" \(\mathcal{P}^{theme}\): "winter"

Generations

\(\mathcal{P}^{shared}\): "Wooden cat statue" \(\mathcal{P}^{theme}\): "winter"

Storyboards

We provide the complete image sets used in Fig. 10a, along with additional results.
Match-and-Fuse generalizes to sketched inputs, enabling controlled storyboard generation.

Source Set

Generations

\(\mathcal{P}^{shared}\): "A young man" \(\mathcal{P}^{theme}\): "simple life"

Source Set

Generations

\(\mathcal{P}^{shared}\): "A pink octopus with yellow spots" \(\mathcal{P}^{theme}\): "underwater"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Vibrant sneakers" \(\mathcal{P}^{theme}\): "gloomy day"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Striped dress" \(\mathcal{P}^{theme}\): "simple"

Source Set

Generations

\(\mathcal{P}^{shared}\): "A couple's life" \(\mathcal{P}^{theme}\): "sunny day"

Localized Editing

We provide the complete image sets used in Fig. 10b, along with additional results.
Match-and-Fuse enables consistent, localized editing without \(\mathcal{P}^{theme}\), achieved through integration with FlowEdit [1] (see Ap. A.1). All results are produced using default integration parameters. However, due to the preservation–editability trade-off inherited from FlowEdit, some cases (e.g., the sheep example) may exhibit slight structural deviations and benefit from mild per-edit hyperparameter tuning. Automating this selection is left for future work.

Source Set

Generations

\(\mathcal{P}^{shared}\): "A princess keychain"

Source Set

Generations

\(\mathcal{P}^{shared}\): "A book titled 'Timetravel Manual'"

Source Set

Generations

\(\mathcal{P}^{shared}\): "A sheep in a red t-shirt"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Extraterrestrial alien pet"

Source Set

Generations

\(\mathcal{P}^{shared}\): "A statue of panda in clothes"

Source Set

Generations

\(\mathcal{P}^{shared}\): "A hippy car"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Plush Stitch"

Source Set

Generations

\(\mathcal{P}^{shared}\): "Disney-styled bike"

Source Set

Generations

\(\mathcal{P}^{shared}\): "A cat-shaped candy"

Comparisons

We provide the complete image sets used in Fig. 9, along with additional comparisons.

FLUX [2] produces inconsistencies in both coarse and fine details.
FLUX Kontext [3] has low prompt adherence (e.g., a dog and cat edits remain animal-like in some images), and often distorts object structure (e.g. vase, dog).
IC-LoRA [4] achieves partial consistency, with coherence often restricted to subsets (vase, dog, cat). Its realism and fidelity are notably lower than FLUX.
Edicho [5] performs best among baselines but still shows noticeable inconsistencies (dog), as its one-to-all warping enforces consistency only with the first image, making the choice of an anchor ambiguous and causing diverging appearances across views. These result were provided by the authors for qualitative comparison.
Edicho-Inpaint* [5] uses the same underlying approach, but generates content only within the shared-region masks, omitting \(\mathcal{P}^{theme}\). Since we only evaluate the shared regions, this setup is comparable and is used in the reported numeric evaluation, as the code for ControlNet pipeline was unavailable at publication time. The additional results reveal that Edicho’s attention warping mechanism lacks robustness under viewpoint variation, producing severe artifacts.
In contrast, Match-and-Fuse (Ours) maintains high image quality while substantially improving structural and fine-grained consistency.

FLUX.1-dev FLUX.1 Kontext IC-LoRA Edicho-Inpaint* Edicho

Source Set

\(\mathcal{P}^{shared}\): "A dog toy with two-colored yarn" \(\mathcal{P}^{theme}\): "winter"

FLUX.1-dev

FLUX.1 Kontext

IC-LoRA

Edicho-Inpaint*

Edicho

Ours

Source Set

\(\mathcal{P}^{shared}\): "A marble cat statue" \(\mathcal{P}^{theme}\): "autumn"

FLUX.1-dev

FLUX.1 Kontext

IC-LoRA

Edicho-Inpaint*

Edicho

Ours

Source Set

\(\mathcal{P}^{shared}\): "A plush slipper " \(\mathcal{P}^{theme}\): "modern"

FLUX.1-dev

FLUX.1 Kontext

IC-LoRA

Edicho-Inpaint*

Edicho

Ours

Source Set

\(\mathcal{P}^{shared}\): "A cute disney-styled vase" \(\mathcal{P}^{theme}\): "rural"

FLUX.1-dev

FLUX.1 Kontext

IC-LoRA

Edicho-Inpaint*

Edicho

Ours

Source Set

\(\mathcal{P}^{shared}\): "two-colored vintage car" \(\mathcal{P}^{theme}\): "vintage"

FLUX.1-dev

FLUX Kontext

IC-LoRA

Edicho-Inpaint*

Ours

Source Set

\(\mathcal{P}^{shared}\): "Pixar characters" \(\mathcal{P}^{theme}\): "plain background"

FLUX.1-dev

FLUX.1 Kontext

IC-LoRA

Edicho-Inpaint*

Ours

Extended Comparisons

We provide additional comparisons to closed-source image generation and editing model Nano Banana [6]. The results were obtained though the API by sending a 2x3 image grid with a prompt: "Change the <object> in all images into \(\mathcal{P}^{shared}\). Keep the <object> poses as in the original images. Make the backgrounds look like a \(\mathcal{P}^{theme}\) setting."

Nano Banana fails to preserve original layouts and tends to copy-paste duplicate elements across images: foreground objects (dog, cat), as well as backgrounds (dog, cat, shoes, vase). Moreover, it often fails to apply the requested edit altogether (dog, cat). In cases where results do differ across the set, they lack consistency (vase).

Source Set

\(\mathcal{P}^{shared}\): "A dog toy with two-colored yarn" \(\mathcal{P}^{theme}\): "winter"

Nano Banana

Ours

Source Set

\(\mathcal{P}^{shared}\): "A marble cat statue" \(\mathcal{P}^{theme}\): "autumn"

Nano Banana

Ours

Source Set

\(\mathcal{P}^{shared}\): "A plush slipper " \(\mathcal{P}^{theme}\): "modern"

Nano Banana

Ours

Source Set

\(\mathcal{P}^{shared}\): "A cute disney-styled vase" \(\mathcal{P}^{theme}\): "rural"

Nano Banana

Ours

Ablations

We provide the complete image sets used in Fig. 7, along with additional ablation comparisons.
W/o Pairwise Consistency Graph, correspondences alone cannot align appearances, leading to identity drift.
W/o MFF and w/o Feature Guidance (both disabled) corresponds to a configuration that doesn't use source 2D matches.
W/o MFF (Multiview Feature Fusion), latent versions aggregated across edges diverge more easily, reducing consistency.
Omitting Feature Guidance at each step leads to misaligned fine-grained details.

Source
Set

\(\mathcal{P}^{shared}\): "An orange dog-styled backpack" \(\mathcal{P}^{theme}\): "autumn"

w/o
Pairwise Graph

w/o MFF &
w/o Guidance

w/o
MFF

w/o
Guidance

Full Method
(Ours)

References


[1] Vladimir Kulikov, Matan Kleiner, Inbar Huberman-Spiegelglas, and Tomer Michaeli. Flowedit: Inversion-free text-based editing using pre-trained flow models. arXiv preprint arXiv:2412.08629, 2024. 8
[2] Black-Forest. Flux: Diffusion models for layered image generation. https://github.com/black-forest-labs/flux, 2024. Accessed: 2024-09-24. 1, 3162
[3] Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, JonasMuller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image generation and editing in latent space, 2025. 1202
[4] Lianghua Huang, Wei Wang, Zhi-Fan Wu, Yupeng Shi, Huanzhang Dou, Chen Liang, Yutong Feng, Yu Liu, and Jingren Zhou. In-context lora for diffusion transformers. 2024. 2, 6, 7
[5] Qingyan Bai, Hao Ouyang, Yinghao Xu, Qiuyu Wang, Ceyuan Yang, Ka Leong Cheng, Yujun Shen, and Qifeng Chen. Edicho: Consistent image editing in the wild. In arXiv preprint arXiv:2412.21079, 2024. 2, 7
[6] Google DeepMind. Gemini 2.5 flash image (“nano banana”) model/api, 2025. Accessible via Google Gemini API. 2, 7

Match-and-Fuse: Consistent Generation from Unstructured Image Sets