On the matter of stemming

Come here for help or to post comments on Sphider
Post Reply
User avatar
captquirk
Site Admin
Posts: 127
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

On the matter of stemming

Post by captquirk » Sat Nov 30, 2019 12:14 am

There have been a couple (two, actually) reports of Sphider failing when stemming has been enabled. I have not been able to test those instances, not knowing the urls affected. I can test sites randomly, but have been unable to get a failure... until now. More in a moment.

First to let you know a bit about the stemmer in Sphider. The original Sphider used what is called the Porter Stemmer. It was English only and wasn't at all picky about encoding. In fact Sphider wasn't picky about encoding, either. That actually limited what Sphider could do. Sphider has been updated to index using utf-8 encoding. If a site is not encoded in utf-8 (which is a LOT of sites), Sphider works to determine the encoding and converting it to utf-8. Even the database is utf-8 (FULL utf-8, not the partial utf-8 MySQL defaults to). The stemmer has also been updated.

Now, the new stemmer employed by Sphider is an amalgamation of functions developed by the folks at https;//snowballstem.org. There are several language variations. The English variation is Porter2, an improvement over the original Porter stemmer. There functions are coded for utf-8... very STRINGENTLY encoded for utf-8!

I tried indexing one of my own sites, with stemming enabled. It ran for awhile, then abruptly stopped. Stopped on the same page every time. The page was encoded in utf-8 (or so I thought), and looked proper in every browser I tried. Well, it turns out that even though everything LOOKED good, there is a single character (displaying properly) that was NOT proper utf-8!

The lesson is, if you want to use stemming, your pages MUST be 100% utf-8. And despite the efforts of Sphider to ensure utf-8, despite the best efforts on our part to ensure utf-8, every once in awhile an innocent character is going to throw a monkey wrench into the mix.

If your indexing, using stemming, always fails on a certain page, most likely that page has a non-conforming character on it.

To be sure that is the case, edit the spider.php file, located in admin. In 3.3.0-MB, you will want to look at lines 34 and 35, find:
//error_reporting(E_ALL ^ E_NOTICE ^ E_WARNING); //Development only
error_reporting(0);
Change theses to read:
error_reporting(E_ALL ^ E_NOTICE ^ E_WARNING); //Development only
//error_reporting(0);
This will enable error reporting. If spider stops and you see the error
Fatal error: Uncaught Exception: Word must be in UTF-8
character encoding is the issue. The current stemmer isn't very forgiving. Find the offending character, make it utf-8, and try again. Let's hope you don't have a lot of offending characters!
---------------------------
[EDIT: 8 December 2019 - The Sphider 3.4.0 and SphiderLite 1.1.0 releases have tried to make the stemmer a bit more forgiving and converting a word to UTF-8 rather than giving a fatal error. If there are an excessive number of non-UTF-8 characters in a site, this may not be much help.]

Post Reply