Page 1 of 1

indexed with the message "Page contains less than 10 word BUG???

Posted: Wed Dec 06, 2023 7:42 am
by chef-olaf
Hello,

I also had this problem in version 5.41.
But I have probably found the reason.

If I activate in the settings - Index decimal numbers (Index numbers must be checked for this to have any effect) -
the error occurs.
When deactivated, all pages are indexed normally.

Greetings Olaf

Re: indexed with the message "Page contains less than 10 word BUG???

Posted: Thu Dec 07, 2023 3:49 pm
by captquirk
A BUG! in MY code? Heaven forbid!

Well, to be honest it is possible (because it has happened so many times before). :lol:

And considering your report on a "less than 10 words" solution, this is DEFINITELY something I will be looking into.

Thanks for the tip... and the tip-off.

Re: indexed with the message "Page contains less than 10 word BUG???

Posted: Mon Dec 11, 2023 7:34 pm
by captquirk
Confirmed! Checking "Index decimals" DOES indeed cause Sphider to report "Page contains less than 10 words." In fact, in my test case, "less than 1 word!"

Initial look at the code does not real any smoking guns, so this needs to be studied more to find a fix. This may be a blessing in disguise as Sphider just assumes the decimal separator to be a period or full stop, when in fact that is only true in English speaking countries and a few selected others. A large percentage of the world uses a decimal comma. That will be taken into consideration when a fix is made.

In the meanwhile, I recommend to NOT check "Index decimal numbers". Indexing numbers is safe, but NOT decimals.

Re: indexed with the message "Page contains less than 10 word BUG???

Posted: Fri Dec 15, 2023 4:36 pm
by captquirk
First off, the bug got fixed.
But on reflection and testing, the ability to index decimals may be pointless!
Why? You can't search ("and", "or") for a word containing a period (.) or comma (,)! For example, a search for 1.06 ( or 1,06) will translate into a search for "1 06". Both the period and comma are html entity characters and get replaced by a space. Now one COULD change that behavior, but that opens a big can of security concerns.

Meanwhile, even without indexing decimal numbers, one could still find any occurrence of a decimal number by using a "phrase" search. This type of search is very different than the "and" and "or" searches, which is why it works.

Therefore, the "Index decimals" option may be removed from the next release. More pondering and testing before that happens...

----------------------------------
[UPDATE]: I just may have found a way to crack open the security can of worms, extract only one worm, reseal the can, and put tight controls on that one worm! We just might be able to search for decimals without blowing security out of the water after-all!