Page 1 of 1

Phrase Search in 1.6 does not work

Posted: Sat Aug 26, 2017 3:51 am
by t-p
Hi Rich,

I realized yesterday that "phrase search" does not work. I tried different browsers. Also, i tried looking at find the cause but could not.

I did work fine in 1.5.4 and earlier versions.

Thanks for your help.

--tara

Re: Phrase Search in 1.6 does not work

Posted: Sat Aug 26, 2017 4:27 am
by captquirk
I'll look into it. Are you using regular or PDO?

Re: Phrase Search in 1.6 does not work

Posted: Sat Aug 26, 2017 5:31 am
by t-p
captquirk wrote:
Sat Aug 26, 2017 4:27 am
Are you using regular or PDO?
Regular

Re: Phrase Search in 1.6 does not work

Posted: Sat Aug 26, 2017 6:05 am
by captquirk
I visited your site and search page. What I found is that:
1) Phrase search DOES work, but...
2) It appears the entire blog is not indexed!

When I search for phrases in later blogs, I get no results. Searching for phrases in older posts DO return expected results.
When I do an AND search using a phrase (as individual words) from a later post, I get results BUT the later posts aren't among them, which tells me they aren't indexed.

This doesn't mean there isn't a problem. If you HAVE re-indexed and later posts are not being picked up, that is a problem. One thing I have found when re-indexing, it is a good idea to check the database tab and see if there are any items in the temp table. I have had this interfere with a site index. You can also check this (and clean the temp table if necessary) from the Clean tables tab. If indexing is in an Incomplete or Suspended state, do NOT clear the temp table, but continue the indexing to completion.

If you have re-indexing set up as a cron job (or in task manager in Windows), check your logs to see if the job has actually been run.

Let me know and we can pursue this to resolution. I am also running a scan of the blog and will check the results in the A.M.

Re: Phrase Search in 1.6 does not work

Posted: Sat Aug 26, 2017 5:23 pm
by captquirk
I indexed your blog. I did exclude some items, such as "/tag", "/category/", "/author", "/feed", "/page", "/?", and "/wp-json". Anyway, I ended up with 190 pages in the index. I was able to do a search on phrases, although I did find ONE phrase that would not return a result!

From the latest post, "Eclipse and the Gurbani", the phrase "memorable event in the U.S." yielded no results. However, "memorable event in the U." DID! (In fact, searching for simply "U.S." in ALL the sites I have indexed for test purposes, there were zero results even though it does in fact appear numerous times on multiple pages on multiple sites. It even suggested that I search for "U.S." instead of "U.S."!!!)

In a word search, the periods are stripped and the result is too short to be of any value, and as a phrase it just isn't found. After further examination, it appears that in a phrase search, a single period at the end is ignored, but a period appearing anywhere else in the phrase will cause the search to fail. Thus, "memorable event in the U." and "memorable event in the U" are equivalent. However, the addition of an "S" after the period following the "U" and nothing. Neither "U S" nor "US" are a solution.

I will investigate that, but seeing that periods are a somewhat special case, trying to fix one anomaly may cause other problems.

At any rate, testing locally and on your site does seem to indicate an incomplete indexing of the entire blog. Check your logs, and if such is NOT the case, let me know and we will proceed from there.

Re: Phrase Search in 1.6 does not work

Posted: Sat Aug 26, 2017 11:42 pm
by captquirk
Okay. I found the reason the phrase "a memorable event in the U.S." returns no results, at least in my setup. During indexing, I have the minimum word length set at 3. "U.S." is 4 characters. So the phrase search is including in in the list of words to look for and not the list of words which are too short or too common.

However, in the keywords table, the word "U.S." will never appear because the periods are stripped and you end up with "U" and "S", which won't be indexed.So the phrase search, not finding "U.S." as a keyword will negate the search and no results are returned.

I then added "u.s." to the file include/common.txt right after "too" and before "under". It has to be lowercase "u.s.". Now "U.S." is considered a common word and not a keyword and the phrase search returns expected results. With my index, that is two rows (pages) containing the phrase "a memorable event in the U.S.".

Now as it stands, the file common.txt is very English (latin-1 charset) in its content, so it very well may be that there are short words using extended ascii characters, such as ü or é or any of many other non-ascii but utf-8 charset characters that are going to affect phrase search results. They are are not indexed for one reason or another (the presence of periods is the only reason I can think of offhand), don't qualify as "too short", and don't appear in common.txt.

This aberration may have nothing to do with your issue, but still interesting. You may want to add "u.s." to common.txt for "just in case".

Re: Phrase Search in 1.6 does not work

Posted: Sat Aug 26, 2017 11:59 pm
by t-p
Hi Rich,

I have done several times all that your suggested.

I have re-indexed several times, after emptying ALL tables.

Also, please note my sphider search is totally Exclusive of the blog at my site. I use it only for the main site, EXCLUING the blog

Here are the results:

INDEXING:
(1) It did not complete if I use url
(2) However, it did complete idexing using a XML sitemap. I have a XML sitemap in the root of my site and sphider used that one and completed indexing the site.

SEARCH - with "find all words"
(1) search works for ENGLISH as well as NON-English terms
(2) However, it does not search if there is a , ; : and so on at the end of a sentence.

SEARCH - PHRASE
(1) English terms - No problem
(2) NON-English terms - Does NOT work (In addition to English, I also use an Indic language)
(3) Also, same problem with , ; : and so on as noted above.

Re: Phrase Search in 1.6 does not work

Posted: Sun Aug 27, 2017 1:11 am
by captquirk
Search - with "find all words"

Confirmed that there is a problem with punctuation marks making words un-findable. The problem is that punctuation marks (apostrophe excluded) are not indexed as a part of the word, but the search has not stripped those marks. Thus, the search fails. Definite problem, which I will address.

For English phrases, the same problem occurs. Initially, the internal logic WILL find the phrase, but during the weighting process the string is broken down into component words, which are including the punctuation marks. It's a little trickier to address, but certainly not an impossible task. That, too, will be solved.

I will do an index of your articles for testing purposes. If you can provide a short list of non-English phrases (only need four of five, even three would suffice). I can trace what is happening and hopefully resolve that also.

Now as far as indexing not completing ...
I presume that the sphider and the articles appear on a web server and you are accessing remotely through a browser to do the indexing. I do so also. What happens to me, all too frequently, is that all those accesses overwhelm the server and I end up with getting a 500 error at some point (sometimes it takes awhile, and the point is entirely random). At this point, indexing fails. Using a sitemap.xml is a lot quicker and reduces the risk. Now one idea occurred to me and I will give it a try the next time I do a re-index, and that is to go into Settings, Spider Settings, "Minimal delay between page downloads", and change the number from the default 0 to something else, maybe 5. This will slow the pace down and MIGHT stop the chances of a 500 error. I haven't tried it on my site, but intend to.

Another alternative, which DOES work for me, is to log onto my server from a command prompt and run a re-index from the command line. The User Guide advises how to do that.

I'll be in touch when I find more information.... AND solutions! :)

------------------------------------
UPDATE!!!!
I see your problem with indexing and needing to use a sitemap. I do run into this on occasion. What is happening is that the links (href's) aren't being picked up by Sphider. I ran a test script using an alternate method of finding href's and it DOES find them. I did find the cause for one of the sites on which I encountered the problem. Yours may be a different cause, but at least I know WHERE to start looking! I'll let you know what I find there. One thing at a time...

Re: Phrase Search in 1.6 does not work

Posted: Sun Aug 27, 2017 11:33 pm
by captquirk
Final resolution....

Sphider 1.6 encodes the full text of a web page prior to storing. This is fine unless the page is ALREADY utf-8 encoded. Encoding to utf-8 text that is already utf-8 produces garbage.

For anyone having the same issue with Sphider 1.6, go into sphider.php and comment out line 345. (For PDO users, that would be line 359.) You will need to clear the indexing for the site and re-index. This will remove the garbled text and replace it with clean text.

Web pages come in many charsets. Sphider ORIGINALLY was a latin-1 creature and has been transitioning to utf-8. Clearly, this was a pitfall. The next version of Sphider will make a strong effort to determine the character set of the page being indexed and convert it to utf-8 ONLY if necessary.