Houdah Software Forums
Register Latest Topics
 
 
 


Reply
  Author   Comment  
stevenb

Registered:
Posts: 4
 #1 
I am attempting to find the word anki (it is the name of software) in the content of my thousands of references. I use the following in Houdah:
Text Content>Contains phrase>anki

I expect to get results that contain only the word (phrase) anki. However, the results return references that contain anki within words such as: ranking, banking, flanking, etc. This should not happen if I am searching for an exact phrase. A search using 'contains words' returns similar results.

A look at the raw data window shows the query (Text Content>Contains phrase>anki) as: 
kMDItemTextContent == "*anki*"c

If I read this correctly it seems that Houdah puts wildcards as prefix and suffix to the phrase anki. Of course I do not want this to happen. Why does Houdah seem to place wildcards in a search for an exact phrase and how can this be avoided?
0
houdah

Moderator
Registered:
Posts: 2,906
 #2 
Hi!

The “contains phrase” operator matches the text/phrase you typed against text in the file. The wildcards are needed to allow for text before and after the search phrase. I.e. you expect to find “Anki” somewhere within a document that contains more text.

You could search for the phrase “ anki ”. I.e. add spaces before and after the word. But you will miss occurences like “foo bar anki. more foo bar”.

To search for the word “Anki”, I recommend you use the “contains words” operator.


Best,

Pierre Bernard
Houdah Software s.à r.l.


__________________
Houdah Software s. à r. l.
https://www.houdah.com

HoudahGeo: One-stop photo geocoding
HoudahSpot: Advanced file search utility
Tembo: Easy and effective file search
0
stevenb

Registered:
Posts: 4
 #3 
Thanks for the reply but as I previously mentioned using the 'contains words' operator still returned documents with words such as flanking and rankings and not just the word anki. The raw query was as follows:

kMDItemTextContent == "anki"cdwt

Admittedly the number of returned documents using the 'contains words' operator was much less than that using the 'contains phrase' operator. Using the search term " anki " as you suggest also reduced the number of returned documents. In this case documents that contained words such as rankings were also returned. However, it seemed that those documents had justified spacing, hyphenation, or incorrect spacing which essentially 'isolated' anki from the rest of the word. This was not the case for the 'contains words' or 'contains phrase' operators. 

Nevertheless, there seems to be something 'not quite right' with the search algorithm. 

 
0
houdah

Moderator
Registered:
Posts: 2,906
 #4 
Hi!

The “contains words” operator should not match “anki” to “flanking”. It will however match camelcase word like “FlAnking”.

Best,

Pierre Bernard
Houdah Software s.à r.l.


__________________
Houdah Software s. à r. l.
https://www.houdah.com

HoudahGeo: One-stop photo geocoding
HoudahSpot: Advanced file search utility
Tembo: Easy and effective file search
0
stevenb

Registered:
Posts: 4
 #5 
Maybe it shouldn't but it does. Please refer to the images at the following URLs:

https://ibb.co/hSff3Q
https://ibb.co/cXWWA5

 

0
gilby

Registered:
Posts: 10
 #6 
When I search with "contains words" for "anki", I also get a few PDF documents that have words like 'yanking' and not 'anki'.   This is what is going on:

The raw text inside a PDF is not the same as displayed by the Preview app - the words are sometimes/frequently* broken up in the raw text.   You can see this in the two examples you have provided where HS is showing you some raw text.   I am guessing (but fairly certain) that if you go further down the HS raw text previews you will find what looks like the word 'anki' even though the Preview app would display it as part of a word like 'yanking'.

In my case searches (in GofT PDFs) for the words "anki" or "yanki" find PDFs whilst "ankin" finds none.  And when I look at the text preview I can see places where 'yanking' is broken up into 'y anki ng' and 'yanki ng', but never 'y ankin g'.

So you can say that Spotlight (and hence HS) are correctly(?) finding the word 'anki' in the raw text of the PDP.  HS shows this in its text preview.  This doesn't, of course, solve the problem but I hope it explains why it is happening.

*Some documents are worse than others.  My GofT PDFs (like your two examples) are particularly bad, but in most documents it is only where you might have explicit hyphenation.   As an example, I searched for 'lect' which showed may documents containing 'electricity', but when I searched the raw text preview I found 'col -lect' (I assume this is marking how to hyphenate at a line break).
 
0
stevenb

Registered:
Posts: 4
 #7 
Hello Gilby

Thanks for your contribution although I do wonder about your choice of 'yanking' as an example. In any case, you are correct in that further down the raw text in the pdf document there were words broken up such that the word 'anki' was physically separated by at least one space either side. Consequently, it seems that HS does find anki as a true individual word, but it also highlights the word 'anki' in the non-broken words such as 'flanking' 

As a small experiment I edited one of the pdf documents and deleted all words, except one, containing 'anki', and then run HS with the text content>contains word>anki search term on the folder containing the document. The word remaining in the document was 'flanking' and it was an unbroken word. HS did not return any results. This supports the contention that HS is capable of searching for only a specific word (in this case anki), but for some unknown reason once it finds the word it also highlights other words that contains anki.
0
gilby

Registered:
Posts: 10
 #8 
Hi stevenb, 

When I searched for 'anki', the first document was the first Games of Thrones book.  In my particular copy it has breaks within words such that HS/Spotlight find 8 'anki', six of which were in 'yanking'.   Honest!  No offence was meant, but I couldn't resist using that as my example.

Your initial postings seeming unlikely, but just possible, so that I was intrigued to see if I could reproduce them.  Much to my surprise I could.  So trying to work out what was happening was a learning experience for me.  Thanks for that.

Searching within Preview finds all occurrences of the letter string 'anki'.  So it, just as you found for HS's text preview, is searching for the string 'anki' and not the word.  Shows the difference between searching within a file as a long string of text and searching for occurrences of words and phrases in an index.
0
houdah

Moderator
Registered:
Posts: 2,906
 #9 
Hi!

The Text Preview in HoudahSpot does not try to replicate the word splitting / search performed by the Spotlight engine.

Text Preview simply highlights the words you used a "text content” or “any text” search criteria.

Best,

Pierre Bernard
Houdah Software s.à r.l.


__________________
Houdah Software s. à r. l.
https://www.houdah.com

HoudahGeo: One-stop photo geocoding
HoudahSpot: Advanced file search utility
Tembo: Easy and effective file search
0
Previous Topic | Next Topic
Print
Reply

Quick Navigation:

Easily create a Forum Website with Website Toolbox.