Predicting the Original Appearance of Damaged Historical Documents

1South China University of Technology, 2INTSIG-SCUT Joint Lab on Document Analysis and Recognition
Interpolate start reference image. Interpolate start reference image. Interpolate start reference image.

Abstract

Historical documents encompass a wealth of cultural treasures but suffer from severe damages including character missing, paper damage, and ink erosion over time. However, existing document processing methods primarily focus on binarization, enhancement, etc., neglecting the repair of these damages. To this end, we present a new task, termed Historical Document Repair (HDR), which aims to predict the original appearance of damaged historical documents. To fill the gap in this field, we propose a large-scale dataset HDR28K and a diffusion-based network DiffHDR for historical document repair. Specifically, HDR28K contains 28,552 damaged-repaired image pairs with character-level annotations and multi-style degradations. Moreover, DiffHDR augments the vanilla diffusion framework with semantic and spatial information and a meticulously designed character perceptual loss for contextual and visual coherence. Experimental results demonstrate that the proposed DiffHDR trained using HDR28K significantly surpasses existing approaches and exhibits remarkable performance in handling real damaged documents. Notably, DiffHDR can also be extended to document editing and text block generation, showcasing its high flexibility and generalization capacity. We believe this study could pioneer a new direction of document processing and contribute to the inheritance of invaluable cultures and civilizations. The dataset and code is available at https://github.com/yeungchenwa/HDR.

Motivation

Interpolate start reference image.
  • New Task - HDR: The damaged historical document images are fed into the HDR model to repair the damaged regions. The output images of HDR model, termed repaired images, should not only capture precise character content and style but also harmonize with the surrounding background within the repaired region.
  • New Dataset - HDR28K: we contribute a large-scale dataset, named HDR28K, which comprises a total of 28,552 damaged-repaired image pairs with character-level annotations and multi-style degradation.
  • New Solution - DiffHDR: We propose a Diffusion-based Historical Document Repair method (DiffHDR), which augments the DDPM framework with semantic and spatial information and incorporates a meticulously designed character perceptual loss to enhance the contextual and visual coherence.
  • HDR28K Dataset

    Interpolate start reference image. Interpolate start reference image. Interpolate start reference image.

    DiffHDR Framwork

    Interpolate start reference image.

    Overview of our proposed method. DiffHDR comprises a condition parsing and a diffusion pipeline. In the condition parsing, the user provides the content and location of damaged characters, obtaining the content image and mask image. In the diffusion pipeline, our denoiser, a UNet-based network, outputs the repaired image conditioned on the noised image, damaged image, mask image, and content image. During training, in addition to using the diffusion loss, we introduce a character perceptual loss to enhance the content preservation of repaired characters.

    Experiments

    Quantitative Results

    Interpolate start reference image.

    Qualitative comparison

    Interpolate start reference image.

    More Visualization

    Real damaged historical documents repair by DiffHDR

    Interpolate start reference image.

    Document editing and text block font generation

    Interpolate start reference image.

    Historical document repair on the type of character missing

    Interpolate start reference image.

    Historical document repair on the type of paper damage

    Interpolate start reference image.

    Historical document repair on the type of ink erosion

    Interpolate start reference image.

    Contact

    For issues when using DiffHDR, please email Zhenhua Yang with email (eezhyang@gmail.com). And for commercial use, please contact Prof. Lianwen Jin with email (eelwjin@scut.edu.cn).

    BibTeX

    @inproceedings{yang2025diffhdr,
          title={Predicting the Original Appearance of Damaged Historical Documents},
          author={Yang, Zhenhua and Peng, Dezhi and Shi, Yongxin and Zhang, Yuyi and Liu, Congyu and Jin, Lianwen},
          booktitle={Proceedings of the AAAI conference on artificial intelligence},
          year={2025}
        }