indexed with the message "Page contains less than 10 word BUG???

Come here for help or to post comments on Sphider
Post Reply
chef-olaf
Posts: 13
Joined: Wed Dec 06, 2023 7:38 am

indexed with the message "Page contains less than 10 word BUG???

Post by chef-olaf »

Hello,

I also had this problem in version 5.41.
But I have probably found the reason.

If I activate in the settings - Index decimal numbers (Index numbers must be checked for this to have any effect) -
the error occurs.
When deactivated, all pages are indexed normally.

Greetings Olaf
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: indexed with the message "Page contains less than 10 word BUG???

Post by captquirk »

A BUG! in MY code? Heaven forbid!

Well, to be honest it is possible (because it has happened so many times before). :lol:

And considering your report on a "less than 10 words" solution, this is DEFINITELY something I will be looking into.

Thanks for the tip... and the tip-off.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: indexed with the message "Page contains less than 10 word BUG???

Post by captquirk »

Confirmed! Checking "Index decimals" DOES indeed cause Sphider to report "Page contains less than 10 words." In fact, in my test case, "less than 1 word!"

Initial look at the code does not real any smoking guns, so this needs to be studied more to find a fix. This may be a blessing in disguise as Sphider just assumes the decimal separator to be a period or full stop, when in fact that is only true in English speaking countries and a few selected others. A large percentage of the world uses a decimal comma. That will be taken into consideration when a fix is made.

In the meanwhile, I recommend to NOT check "Index decimal numbers". Indexing numbers is safe, but NOT decimals.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: indexed with the message "Page contains less than 10 word BUG???

Post by captquirk »

First off, the bug got fixed.
But on reflection and testing, the ability to index decimals may be pointless!
Why? You can't search ("and", "or") for a word containing a period (.) or comma (,)! For example, a search for 1.06 ( or 1,06) will translate into a search for "1 06". Both the period and comma are html entity characters and get replaced by a space. Now one COULD change that behavior, but that opens a big can of security concerns.

Meanwhile, even without indexing decimal numbers, one could still find any occurrence of a decimal number by using a "phrase" search. This type of search is very different than the "and" and "or" searches, which is why it works.

Therefore, the "Index decimals" option may be removed from the next release. More pondering and testing before that happens...

----------------------------------
[UPDATE]: I just may have found a way to crack open the security can of worms, extract only one worm, reseal the can, and put tight controls on that one worm! We just might be able to search for decimals without blowing security out of the water after-all!
Post Reply