FontDiffuser: One-Shot Font Generation via Denoising Diffusion with Multi-Scale Content Aggregation and Style Contrastive Learning

1South China University of Technology, 2SCUT-Zhuhai Institute of Modern Industrial Innovation, 3Alibaba Group
Interpolate start reference image.
Interpolate start reference image.

Abstract

Automatic font generation is an imitation task, which aims to create a font library that mimics the style of reference images while preserving the content from source images. Although existing font generation methods have achieved satisfactory performance, they still struggle with complex characters and large style variations. To address these issues, we propose FontDiffuser, a diffusion-based image-to-image one-shot font generation method, which innovatively models the font imitation task as a noise-to-denoise paradigm. In our method, we introduce a Multi-scale Content Aggregation (MCA) block, which effectively combines global and local content cues across different scales, leading to enhanced preservation of intricate strokes of complex characters. Moreover, to better manage the large variations in style transfer, we propose a Style Contrastive Refinement (SCR) module, which is a novel structure for style representation learning. It utilizes a style extractor to disentangle styles from images, subsequently supervising the diffusion model via a meticulously designed style contrastive loss. Extensive experiments demonstrate FontDiffuser’s state-of-the-art performance in generating diverse characters and styles. It consistently excels on complex characters and large style changes compared to previous methods.

Motivation

Interpolate start reference image.
  • Task Definition: Font generation is an imitation task, which aims to create a font library that mimics the style of reference images while preserving the content from source images.
  • Existing Problems: Existing font generation methods have achieved satisfactory performance, but they still struggle with complex characters and large style variations, leading to severe strok missing, artifacts, blurriness, layout errors, and style inconsistency as shown in the above figures.
  • Reason Analysis: (1) Most methods adopt a GAN-based framework which potentially suffers from unstable training due to their adversarial training nature. (2) Most of these methods perceive content information through only single-scale high-level features, omitting the fine-grained details that are crucial to preserving the source content, especially for complex characters. (3) Many methods that employ prior knowledge to facilitate font generation, such as stroke or component composition of characters; however, this information is costly to annotate for complex characters. (4) the target style is commonly represented by a simple classifier or a discriminator in previous literature, which struggles to learn the appropriate style and hinders the style transfer with large variations.
  • Strategy: FontDiffuser is a diffusion-based image-to-image one-shot font generation method, which models the font generation learning as a noise-to-denoise paradigm and is capable to generate unseen characters and styles. (1) we introduce a Multi-scale Content Aggregation (MCA) block, which leverages global and local content features across various scales. (2) we introduce a novel style representation learning strategy, by applying a Style Contrastive Refinement (SCR) module to enhance the generator’s capability in mimicking styles.
  • Framwork

    Interpolate start reference image.

    Overview of our proposed method. (a) The Conditional Diffusion model is a UNet-based network composed of a content encoder Ec and a style encoder Es. The reference image Xs is passed through a style encoder Es and a content encoder Ec respectively, obtaining a style embedding e and structure maps Fs. The source image is encoded by a content encoder Ec. To obtain multi-scale features Fc, we derive output from the different layers of Ec and inject each of them through our proposed MCA block. RSI block is employed to conduct spatial deformation from reference structural features Fs. (b) The Style Contrastive Refinement module is to disentangle different styles from images and provide guidance to the diffusion model.

    Experiments

    Quantitative Results

    Interpolate start reference image.

    Qualitative comparison

    Interpolate start reference image.

    More Visualization

    Characters of hard level of complexity

    Interpolate start reference image.

    Characters of medium level of complexity

    Interpolate start reference image.

    Characters of easy level of complexity

    Interpolate start reference image.

    Visualization of cross-lingual generation (Chinese to Korean)

    Interpolate start reference image.

    Contact

    For issues when using FontDiffuser, please email Zhenhua Yang with email (eezhyang@gmail.com). And for commercial use, please contact Prof. Lianwen Jin with email (eelwjin@scut.edu.cn).

    BibTeX

    @inproceedings{yang2024fontdiffuser,
          title={FontDiffuser: One-Shot Font Generation via Denoising Diffusion with Multi-Scale Content Aggregation and Style Contrastive Learning},
          author={Yang, Zhenhua and Peng, Dezhi and Kong, Yuxin and Zhang, Yuyi and Yao, Cong and Jin, Lianwen},
          booktitle={Proceedings of the AAAI conference on artificial intelligence},
          year={2024}
        }