Sometimes you want to get text from an image. Like, if you’re scanning receipts (also see my image scanning notes).
I’ve mostly only just used Tesseract from the command line.
Tesseract
- Homepage
- https://tesseract-ocr.github.io/
- Source code
- https://github.com/tesseract-ocr/tesseract
- User Manual
- https://tesseract-ocr.github.io/tessdoc/
- nixpkg
- 
      tesseract
Command line example
tesseract receipt-2023NOV02.webp output --oem 1 -l eng
      I think it by default will add “.txt” to the base “output”.
     
Here’s how you’d output a PDF:
tesseract receipt.webp output --oem 1 -l eng pdfAlso see my image scanning note example of an OCR’d PDF.
Links to check out
- 
       Compressing and enhancing hand-written notes blog post by Matt Zucker, originally linked at Improving the quality of the output from tessdoc
       - 
         Uses k-means clustering to compress colors and denoise background.  Pretty cool.
         This might be fun to implement in Stan. Like, I could have a prior on some clusters being exactly white or black. My old 2013 blog post on color palettes with k-means might be useful here. 
 
- 
         Uses k-means clustering to compress colors and denoise background.  Pretty cool.