Grounding text-to-image diffusion models for controlled high-quality image generation
Tarih
Yazarlar
Dergi Başlığı
Dergi ISSN
Cilt Başlığı
Yayıncı
Erişim Hakkı
Özet
ge-scale text-to-image (T2I) diffusion models have emerged as the new state-of-the art image generative models, demonstrating outstanding performance in synthesizing diverse high-quality visuals from natural language text captions. Image generative models have a wide range of applications including content creation, image editing and medical imaging. Although simple and powerful, text prompt alone is insufficient for tailoring the generation process into creating customized outputs. To address this limitation, multiple layout-to-image models have been developed to control the generation process by utilizing a broad array of layouts such as segmentation maps, edges, and human keypoints. In this thesis, we propose ObjectDiffusion, a model that builds on the top of cutting-edge image generative frameworks to seamlessly extend T2I models with object names along with their corresponding bounding boxes. Specifically, we make substantial modifications to the network architecture introduced in ContorlNet to integrate it with the condition processing and injection techniques proposed in GLIGEN. ObjectDiffusion is initialized with pre-training parameters to leverage the generation knowledge obtained from training on large-scale datasets. We fine-tune ObjectDiffusion on COCO2017 training dataset and evaluate it on COCO2017 validation dataset. Our model achieves an AP score of 27.4, an AP50 of 46.6, an AP75 of 28.2, an AR of 44.5, and a FID of 19.8 outperforming the current SOTA model trained on open-source datasets in AP50, AR, and FID metrics. ObjectDiffusion demonstrates a distinctive capability in synthesizing diverse high-quality, high-fidelity images that seamlessly conform to the semantic and spatial control inputs. Evaluated in qualitative and quantitative tests, ObjectDiffusion exhibits remarkable grounding abilities on closed-set and open-set settings across a wide variety of contexts. The qualitative assessment verifies the ability of ObjectDiffusion to integrate multiple grounding entities of different sizes and locations. The results of the ablative studies highlight the efficacy of our suggested weight initialization in harnessing the pre-training knowledge to enhance the conditional model performance.











