Logo

Dario Cancelliere

Passionate Rustacean, Senior PHP Web Developer, Android and C/C++ Developer

Software Engineer at Prima Assicurazioni

Born on 23 October 1989

Linkedin

View My GitHub Profile

« return blog_index;

29 September 2020

PDF manipulation with Rust and considerations

by Dario Cancelliere

The Portable Document Format (PDF) is a file format developed by Adobe in 1993 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems.

Rust is a multi-paradigm programming language focused on performance and safety, especially safe concurrency. Rust is syntactically similar to C++, and provides memory safety without using garbage collection.

How can I manipulate PDF programmatically in Rust?

Let’s introduce lopdf crate, a Rust library that let you to manipulate PDF file following the reference by Adobe.

I recently contributed to this project introducing features like:

With this library you can basically do anything with the document elements, altering the Root, the Catalog and all other kinds of Contents, it’s a low level library, and it’s perfect for building some wrapper based on it.

An example of a library built on top of lopdf is the pdf_form crate, this library in fact, let you to manipulate the AcroForm (the PDF form fields) easily and without struggling about discovering elements inside the page tree.

Load and modify a PDF

The lopdf library loads the PDF in memory, quoting what the author says about that:

Normally a PDF document won’t be very large, ranging form tens of KB to hundreds of MB. Memory size is not a bottle neck for today’s computer. By keep the whole document in memory, stream length can be pre-calculated, no need to use a reference object for the Length entry, the resulting PDF file is smaller for distribution and faster for PDF consumers to process.

Here an example of code for text replacing:

use lopdf::Document;

fn main() {
    let mut document = Document::load("example.pdf")?;
    document.version = "1.4".to_string();
    
    document.replace_text(1, "Hello World!", "Modified text!");
    
    document.save("modified.pdf")?;
}

With this piece of code, the library will replace all object elements that have that string and will save it as a new file.

Merging more PDF documents

This is a particular example code that I personally written for the repository and shows step by step how the file merging done:

use std::collections::BTreeMap;

use lopdf::{Document, Object, ObjectId};

fn main() {
    // Generate a stack of Documents to merge
    let documents = vec![
        Document::load("example.pdf").unwrap(),
        Document::load("example2.pdf").unwrap(),
        Document::load("example3.pdf").unwrap(),
        Document::load("example4.pdf").unwrap(),
    ];

    // Define a starting max_id (will be used as start index for object_ids)
    let mut max_id = 1;

    // Collect all Documents Objects grouped by a map
    let mut documents_pages = BTreeMap::new();
    let mut documents_objects = BTreeMap::new();

    for mut document in documents {
        document.renumber_objects_with(max_id);

        max_id = document.max_id + 1;

        documents_pages.extend(
            document
                    .get_pages()
                    .into_iter()
                    .map(|(_, object_id)| {
                        (
                            object_id,
                            document.get_object(object_id).unwrap().to_owned(),
                        )
                    })
                    .collect::<BTreeMap<ObjectId, Object>>(),
        );
        documents_objects.extend(document.objects);
    }

    // Initialize a new empty document
    let mut document = Document::with_version("1.5");

    // Catalog and Pages are mandatory
    let mut catalog_object: Option<(ObjectId, Object)> = None;
    let mut pages_object: Option<(ObjectId, Object)> = None;

    // Process all objects except "Page" type
    for (object_id, object) in documents_objects.iter() {
        // We have to ignore "Page" (as are processed later), "Outlines" and "Outline" objects
        // All other objects should be collected and inserted into the main Document
        match object.type_name().unwrap_or("") {
            "Catalog" => {
                // Collect a first "Catalog" object and use it for the future "Pages"
                catalog_object = Some((
                    if let Some((id, _)) = catalog_object {
                        id
                    } else {
                        *object_id
                    },
                    object.clone(),
                ));
            }
            "Pages" => {
                // Collect and update a first "Pages" object and use it for the future "Catalog"
                // We have also to merge all dictionaries of the old and the new "Pages" object
                if let Ok(dictionary) = object.as_dict() {
                    let mut dictionary = dictionary.clone();
                    if let Some((_, ref object)) = pages_object {
                        if let Ok(old_dictionary) = object.as_dict() {
                            dictionary.extend(old_dictionary);
                        }
                    }

                    pages_object = Some((
                        if let Some((id, _)) = pages_object {
                            id
                        } else {
                            *object_id
                        },
                        Object::Dictionary(dictionary),
                    ));
                }
            }
            "Page" => {}     // Ignored, processed later and separately
            "Outlines" => {} // Ignored, not supported yet
            "Outline" => {}  // Ignored, not supported yet
            _ => {
                document.objects.insert(*object_id, object.clone());
            }
        }
    }

    // If no "Pages" found abort
    if pages_object.is_none() {
        println!("Pages root not found.");

        return;
    }

    // Iter over all "Page" and collect with the parent "Pages" created before
    for (object_id, object) in documents_pages.iter() {
        if let Ok(dictionary) = object.as_dict() {
            let mut dictionary = dictionary.clone();
            dictionary.set("Parent", pages_object.as_ref().unwrap().0);

            document
                    .objects
                    .insert(*object_id, Object::Dictionary(dictionary));
        }
    }

    // If no "Catalog" found abort
    if catalog_object.is_none() {
        println!("Catalog root not found.");

        return;
    }

    let catalog_object = catalog_object.unwrap();
    let pages_object = pages_object.unwrap();

    // Build a new "Pages" with updated fields
    if let Ok(dictionary) = pages_object.1.as_dict() {
        let mut dictionary = dictionary.clone();

        // Set new pages count
        dictionary.set("Count", documents_pages.len() as u32);

        // Set new "Kids" list (collected from documents pages) for "Pages"
        dictionary.set(
            "Kids",
            documents_pages
                    .into_iter()
                    .map(|(object_id, _)| Object::Reference(object_id))
                    .collect::<Vec<_>>(),
        );

        document
                .objects
                .insert(pages_object.0, Object::Dictionary(dictionary));
    }

    // Build a new "Catalog" with updated fields
    if let Ok(dictionary) = catalog_object.1.as_dict() {
        let mut dictionary = dictionary.clone();
        dictionary.set("Pages", pages_object.0);
        dictionary.remove(b"Outlines"); // Outlines not supported in merged PDFs

        document
                .objects
                .insert(catalog_object.0, Object::Dictionary(dictionary));
    }

    document.trailer.set("Root", catalog_object.0);

    // Update the max internal ID as wasn't updated before due to direct objects insertion
    document.max_id = document.objects.len() as u32;

    // Reorder all new Document objects
    document.renumber_objects();
    document.compress();

    // Save the merged PDF
    document.save("merged.pdf").unwrap();
}

The important thing here are the single object ids. In fact, every PDF’s object should have a unique id in order to work, it’s something like a key of a hashmap. The render will fail if more than object shares the same id.

Filling form fields

As said earlier, another crate that I contributed the pdf_form library, helps for PDF’s fields filling. Let’s see an example of code here:

use pdf_form::Form;

fn main() {
    // Load the PDF into a Form from a path
    let mut form = Form::load("path/to/pdf").unwrap();
    
    // Set the first field (you can use the "form.get_all_types()" function in order to iter all fields) text
    form.set_text(0, String::from("filling the field"));
    
    // Save the new document
    form.save("new.pdf");
}

At moment, all form fields are supported, such: Text, Button, Radio, CheckBox, ListBox and ComboBox.

Insert an image

You can add an image using the lopdf feature library called embed_image, something like:

use lopdf::Document;
use lopdf::xobject;

fn main() {
    let mut document = Document::load("example.pdf")?;
    document.version = "1.4".to_string();
    
    // If the stream is loaded correctly
    if let Ok(stream) = xobject::image("image.png") {
        // we need a "page_id", the position coordinates and the size of the image
        document.insert_image(page_id, stream, (x, y), (width, height));
    }
    
    document.save("image.pdf")?;
}

Considerations

Of course, the library is not 100% complete, it’s a low level PDF library and more crates like pdf_form are needed, I can imagine a crate of a certain feature of PDF manipulation needed.

I will continue to contribute to this library and try to port in the Rust world, a comfortable PDF manipulation, like other languages already does.

tags: rust - pdf - manipulation - library - tutorial