Clicky

feinberj
I have a fairly large PDF document archive.  Some of the files are image-only PDFs (mostly scans of printed documents), and some of the PDF files are document scans with underlying text data inserted after the fact with Acrobat Pro's OCR feature.  The document scans look exactly like the originals, but I can use the text tool to copy text from "underneath" the scanned image.

Is there a way that I can use HoudahSpot to flag for PDF files that either have or don't have text data "behind" the image scan?

Thanks,

John Feinberg
New York, NY

0 0
houdah
Hi!

Spotlight relies on importer plug-ins to index files. What attributes get indexed depends on the importer in charge of the given file type.

Please use the Inspector window in HoudahSpot to see what attributes were indexed for your PDF files. There might be one to flag files which have been processed by OCR.

Best,
Pierre Bernard
Houdah Software s.à r.l.

Houdah Software s. à r. l.
https://www.houdah.com

HoudahGeo: One-stop photo geocoding
HoudahSpot: Advanced file search utility
Tembo: Easy and effective file search
0 0
feinberj
Thanks for the suggestion - with your post as a starting point I've come up with a workable query.

I'm using the "encoding software" attribute to search for items that have or have not had OCR performed.  

I select "other", then select the encoding software attribute, and then drag a recognized file onto that line (but not onto the text entry dialog of that line).  It will then fill in the text dialog with the encoding software tag, which (for my file) was Adobe Acrobat 9.11 Paper Capture Plug-in.  For simplicity, I have changed it to Encoding Software contains Paper Capture Plug-in.

John F
NY, NY

0 0
houdah
Great!

Thank you for sharing.

Best,
Pierre Bernard
Houdah Software s.à r.l.

Houdah Software s. à r. l.
https://www.houdah.com

HoudahGeo: One-stop photo geocoding
HoudahSpot: Advanced file search utility
Tembo: Easy and effective file search
0 0