JMM’s notes on using Optical Character Recognition software

Sometimes you want to get text from an image. Like, if you’re scanning receipts (also see my image scanning notes).

I’ve mostly only just used Tesseract from the command line.

Tesseract

tesseract receipt-2023NOV02.webp output --oem 1 -l eng

I think it by default will add “.txt” to the base “output”.

Here’s how you’d output a PDF:

tesseract receipt.webp output --oem 1 -l eng pdf

Compressing and enhancing hand-written notes blog post by Matt Zucker, originally linked at Improving the quality of the output from tessdoc
- Uses k-means clustering to compress colors and denoise background. Pretty cool.
  This might be fun to implement in Stan. Like, I could have a prior on some clusters being exactly white or black.
  
  My old 2013 blog post on color palettes with k-means might be useful here.