Bulk reading PDF date of creation or other XMP metadata

I had about 2828 files that I wanted to mass read the date of creation from their metadata.

Initially, my thought was to use PDF.js with a structure like this (I am using legacy since I am on Node.js):

import { readFile } from "fs/promises";

import { getDocument } from "pdfjs-dist/legacy/build/pdf.mjs";

const file = await readFile(path);
const metadata = await getDocument(new Uint8Array(file)).promise.then(
  (doc) => doc.getMetadata(),
);
const creationDate = metadata.info["CreationDate"];

but the annoying bit was that it was taking too long. While some of the issue was possibly from PDF.js having to process the whole PDF, I suspect part of the issue was readFile which reads the entire file in memory. Since I am only concerned with the metadata, which I believe generally should be at the beginning of the file, I shouldn't need to read the entire PDF.

I am really not certain what the PDF structure is like since standard ISO 32000 costs 221 Swiss Francs (273.68 USD). In hindsight, probably the intelligent way to do it probably would have been reading the file incrementally until you get the data of creation metadata <xmp:CreateDate>, reading all the text up until that tag is closed by </xmp:CreateDate>. The way I did it was using RegEx:

import type { PathLike } from "fs";
import { open } from "fs/promises";

async function readBytes(path: PathLike, numberofBytes = 2500) {
  const fileHandle = await open(path, "r");
  const buffer = Buffer.alloc(numberofBytes);
  try {
    const { bytesRead } = await fileHandle.read(buffer);
    return buffer.subarray(0, bytesRead);
  } finally {
    await fileHandle.close();
  }
}

const main = async () => {
  const path = "path/to/your/file.pdf";
  try {
    const file = (await readBytes(path)).toString("utf-8");
    const result = /<xmp:CreateDate>(.*?)<\/xmp:CreateDate>/.exec(file);
    if (result === null || result[1] === undefined) {
      console.error("No CreateDate found in the file.");
      // Read with PDF.js instead
      return;
    }
    const creationDate = result[1];
  } catch (error) {
    if (error instanceof Error) {
      console.error(`Error reading file: ${error.message}`);
    }
  }
};

main();

(.*?) is just a capture group that means any characters at all. 2500 bytes is just an arbitrary number I chose that seemed to work fine. If it didn't work (for example, one PDF had it on line 46484 of 60558), I just used PDF.js to read it then.

I thought this was interesting.

jeremy-code/README.md

Select an option

No results found

Select an option

No results found