PDF

PDFJS: Read PDF from memory Buffer in NodeJS

Note: This post uses async/await and therefore requires NodeJS 8+.

This is how to read a PDF file from a file, e.g. mypdf.pdf:

pdfjs.getDocument('mypdf.pdf');

Full example:

const pdfjs = require('pdfjs-dist');

async function readPDF() {
    const pdf = await pdfjs.getDocument('mypdf.pdf');
    // ...
}

Here’s how you can read the PDF from a memory buffer:

pdfjs.getDocument({data: buffer});

Full example

const fs = require('mz/fs')
const pdfjs = require('pdfjs-dist');

async function readPDF() {
    // Read file into buffer
    const buffer = await fs.readFile('mypdf.pdf')
    // Parse PDF from buffer
    const pdf = await pdfjs.getDocument({data: buffer});
    // ...
}

Using mz/fs is not required, it’s just used as an utility library to be able to use await with files.

 

Posted by Uli Köhler in Javascript, PDF

Convert pt (postscript/PDF unit) to inch or mm in Javascript

Here are some simple utility functions to convert the preprint unit pt (defined as 1/72 inch) into inches or mm.

function convertPtToInch(pt) { return pt / 72; }
function convertInchToMM(inch) { return inch * 25.4; }
function convertPtToMM(pt) {
  return convertInchToMM(convertPtToInch(pt)); }

// Example usage
console.log(convertPtToMM(595)) // Prints 209.90277777777777

Note that while this conversion is exact, there is some tolerance required when comparing these units:
An ISO A4 paper is defined as 210x297 mm – or 595x842 pt.

However, converting 595×842 pt into mm results in 209.902777 mm and 297.038888 mm respectively. Watch out for those tolerances if you try to compare paper sizes. I recommend a tolerance of at least 0.25 mm.

Posted by Uli Köhler in Javascript, PDF

Extract PDF page sizes using PDFJS & NodeJS

Although most PDFs have some pages with only one page size (e.g. DIN A4 or Letter in portrait orientation), PDFs sometimes also have pages that have another size or orientation (which is treated just like another size) that other pages in the same document.

This post provides an easy-to-reuse example on how to use PDFJS in NodeJS (though it will be just as easy to do in the browser) to extract the PDF

It is based on this previous post on how to read all pages from a PDF document using PDFJS, so be sure to check that out first.

First install the required dependencies:

npm install bereich pdfjs-dist

then you can use this source code to read the page sizes of mypdf.pdf:

const pdfjs = require('pdfjs-dist');
const bereich = require('bereich');

class PageSize {
  constructor(width, height) {
    this.width = width;
    this.height = height
  }
}

function getPageSize (page) {
    const [x, y, w, h] = page.pageInfo.view;
    const width = w - x;
    const height = h - y;
    const rotate = page.pageInfo.rotate;
    // Consider rotation
    return (rotate === 90 || rotate === 270)
        ? new PageSize(height, width) : new PageSize(width, height);
}

async function readPDFPageSizes() {
  const pdf = await pdfjs.getDocument('mypdf.pdf');
  const numPages = pdf.numPages;

  const pageNumbers = Array.from(bereich(1, numPages));
  // Start reading all pages 1...numPages
  const promises = pageNumbers.map(pageNo => pdf.getPage(pageNo));
  // Wait until all pages have been read
  const pages = await Promise.all(promises);
  // You can do something with pages here.
  return pages.map(getPageSize);
}

readPDFPageSizes()
    .then(pageSizes => {console.log(pageSizes)})
    .catch(err => {console.error(`Error while reading PDF: ${err}`)})

Running this with a document having a single A4 page will result in

[ PageSize { width: 595, height: 842 } ]

Note that the width & height unit is pt (Points). One pt is defined as 1/72 inches. A DIN A4 page (portrait) is 595x842pt, therefore you see those values here.
See this TechOverflow post for code to convert pt to mm and inches.

Posted by Uli Köhler in Javascript, PDF

PDFJS: Read all pages using async/await in NodeJS

PDFJS has an official example that – among other things, reads all pages from a PDF document.
However, their promise-based method is rather complex to understand and to write. Luckily, there is an easier way using async/await (which is supported starting from NodeJS 8.x).

I’m using the bereich library (bereich is german for range) in order to generate an array of page numbers (1..numPages).
Install the required libraries using

npm install pdfjs-dist bereich

Here’s the source code example:

const pdfjs = require('pdfjs-dist');
const bereich = require('bereich');

async function readPDFPages() {
  const pdf = await pdfjs.getDocument('mypdf.pdf');
  const numPages = pdf.numPages;

  const pageNumbers = Array.from(bereich(1, numPages));
  // Start reading all pages 1...numPages
  const promises = pageNumbers.map(pageNo => pdf.getPage(pageNo));
  // Wait until all pages have been read
  const pages = await Promise.all(promises);
  // You can do something with pages here.
  return pages;
}

readPDFPages().then(pages => {
    console.log(pages)
}).catch(err => {
    console.error(`Error while reading PDF: ${err}`)
})

 

Posted by Uli Köhler in Javascript, PDF

How to read PDF creation & modification date in NodeJS

Problem:

You have a PDF file from which you want to know the creation and modification date: Not the dates stored in the file itself but those from the PDF metadata.

Solution:

This solution assumes you use NodeJS version 8+ which supports async/await.
You can use pdfjs to read these dates. First install it using

npm install pdfjs-dist

Then use this code to extract the dates.

const pdfjs = require('pdfjs-dist');

async function readPDFDates() {
  const pdf = await pdfjs.getDocument('mypdf.pdf');
  const metadata = await pdf.getMetadata();

  const modDate = new Date(metadata.metadata._metadata['xmp:modifydate']);
  const createDate = new Date(metadata.metadata._metadata['xmp:createdate']);
  return [modDate, createDate]
}

readPDFDates().then(([modDate, createDate]) => {
    console.log(`Creation date: ${createDate}`)
    console.log(`Modification date: ${modDate}`)
}).catch(err => {
    console.error(`Error while reading PDF: ${err}`)
})

 

The PDF files I’ve seen use ISO8601-style formatting, but without a timezone specification. The code therefore assumes that the times are in the local timezone.

Note: metadata is e.g. the following object (not all attributes are present for all PDFs):

{ info: 
   { PDFFormatVersion: '1.5',
     IsAcroFormPresent: false,
     IsXFAPresent: false,
     Title: 'Microsoft Word - mypdf',
     Author: 'uli',
     Creator: 'PScript5.dll Version 5.2.2',
     Producer: 'Acrobat Distiller 9.3.0 (Windows)',
     CreationDate: 'D:20100209100924+01\'00\'',
     ModDate: 'D:20100209100924+01\'00\'' },
  metadata: 
   Metadata {
     _metadata: 
      { 'dc:format': 'application/pdf',
        'dc:creator': 'peter',
        'dc:title': 'Microsoft Word - mypdf',
        'xmp:createdate': '2010-02-09T10:09:24+01:00',
        'xmp:creatortool': 'PScript5.dll Version 5.2.2',
        'xmp:modifydate': '2010-02-09T10:09:24+01:00',
        'pdf:producer': 'Acrobat Distiller 9.3.0 (Windows)',
        'xmpmm:documentid': 'uuid:2fd66f45-5f2a-4dd6-8cb0-297ce85ee9e1',
        'xmpmm:instanceid': 'uuid:f6e62218-4b40-47c7-837b-6cb1e6e90995' } },

 

Posted by Uli Köhler in Javascript, PDF