PDF text extraction and QR decoding
Nodes to aid in performing these tasks
extract text from PDF
- I really like node-red-contrib-pdfjs to perform this task. It's very simple to use, and just works. I use the "Order text from top to bottom" check, the "Order text from left to right" check, and "Merge text if both are in the same row" check.
In my wish list for changes to this node are:
- ability to specify to extract text from just a single page, or range of pages. Most of the time I need to work just with the first page
- ability to define a diff tolerance for the row (y dimension), so texts that are slightly misaligned (+/- tolerance units above or below), get joined as a single row
- I really like node-red-contrib-pdfjs to perform this task. It's very simple to use, and just works. I use the "Order text from top to bottom" check, the "Order text from left to right" check, and "Merge text if both are in the same row" check.
In my wish list for changes to this node are:
convert from PDF to image
- I use @martip/node-red-pdf-to-png for this task. For some unknown reason, the generated image does not include text, it includes only images (lines + embedded images representing QR codes + QR codes made using fonts). I'd tried to fix this but can't find the root cause. But, as it is, it fits my purpose as it includes all QR related images
extract images embedded in PDF
- I authored a soon to be publised custom node for this task. I've found that some QR codes and PDF 417 codes, are represented as images in PNG or JPG format, embedded within the PDF.
- I tried not to build this node, but find it necessary as the "barcode decoder" node in node-red-contrib-image-tools, does correctly recognize QR codes and PDF 417 codes when you use as input the extracted image, but fails when you feed a bigger image (like the one representing the first page of the PDF)
decode QR in image embedded in PDF
- using as input the extracted images embedded in PDF, the "barcode decoder" node in node-red-contrib-image-tools works fine
decode QR from jpg and png images
- using as input a PNG or JPG image of the pages of the PDF, "barcode decoder" is not able to recognize the QR codes. I think this is a limitation of the zxing library it depends on
- this use case is very important, as some QR codes are not embedded in the PDF as images, but are generated with custom fonts in the PDF. So it's not posible to just extract this region as a separte image
- I've tried the node-red-contrib-qrdecode but it doesn't recognize the QR either. It depends on qrcode reader, but this project is not supported anymore, and the author recomends moving to cozmo/jsQR. I couldn't find any node-red node that works with cozmo/jsQR
- So I built a custom node to use cozmo/jsQR for this task. Using a PNG as input (generated by the "pdf to image" node), jsQR correctly decodes QR. For this to work, I had to configure the "pdf to image" node for a 200% zoom.
QR decode PDF417
Base64 decoding and encoding