Optical Character Recognition software notes

Sometimes you want to get text from an image. Like, if you’re scanning receipts (also see my image scanning notes).

I’ve mostly only just used Tesseract from the command line.

Tesseract

Homepage
https://tesseract-ocr.github.io/
Source code
https://github.com/tesseract-ocr/tesseract
User Manual
https://tesseract-ocr.github.io/tessdoc/
nixpkg
tesseract

Command line example

tesseract receipt-2023NOV02.webp output --oem 1 -l eng

I think it by default will add “.txt” to the base “output”.

Here’s how you’d output a PDF:

tesseract receipt.webp output --oem 1 -l eng pdf

Also see my image scanning note example of an OCR’d PDF.