Page 1 of 1

First impressions - Am I missing something?

Posted: Fri Jul 23, 2021 3:12 pm
by rakone
Years ago I had a registered copy of Sphider Pro. Glad you've take this project on and kept it alive.

I've been working on a WordPress crawler/search engine over the past few months. It incorporates ideas from both Sphider and PHPCrawl (which isn't working any more). I have it pretty well along using the same keyword database structure as Sphider.

So, I installed Sphider yesterday and got it working and the first site I tried to index I ran into problems. Here's the issues I either had problems with or things I found lacking.

Not handling redirects - 301 redirect just brought it to a screeching halt on the first page I tried to index. PHP Crawl had a couple settings to handle this. Follow redirects (t/f), Follow to content (t/f), Maximum redirects (int).

Various errors causing Sphider to stop crawling - Tried to index another page from the same site and it reported a 404 error. Tried this same page with my own crawler it reported an invalid SSL certificate. I use CURL to handle the connections and after adding CURLOPT_SSL_VERIFYPEER => FALSE, it crawled and indexed it with no issues.

Multiple databases - Noticed the multiple database feature from the Pro version was missing. This was a nice feature as you could point Sphider to one db and have it work on indexing and point the search to another db on another server and have it provide the user's search front end.

New links - Noticed that links from different domains seem to vanish in the ether. In the old pro version you could create a database of links from different domains and index those. This limits Sphider to indexing one site at a time. Once upon a time I ran Sphider for over a month telling it to just keep indexing any new sites it found.

Maximum number of pages to index - Did not see a setting for this and this is scary. I once had Sphider index the old about.com site which easily had hundreds of thousands of pages. Without any 'abort' feature this is a problem if you're like me and convince your host to set the php execution timeout to 0 so you can just keep running scripts. I have to submit a support ticket on my web hosts if I need to kill a process because cPanel doesn't have that feature. I may have missed this but I didn't see this setting. Having either a page limit or a set_time_limit as a user setting would be nice.

Didn't see an option to ignore the robots.txt file. While obeying robots.txt is the polite thing to do, forcing the end user to obey this could prevent them from indexing certain sites that they may want to index.

Seems REALLY, REALLY slow. As a test I had both my crawler and Sphider index the old DMOZ site here: https://dmoz-odp.org. I had Sphider set up on an older core i3 with Ubuntu server running Apache with a 100mb internet connection. My crawler is on a shared host that runs a LOT slower with WordPress overhead but has a faster connection. It uses the same method of storing and weighing keywords as Sphider. My crawler indexed 50 pages in about a minute and a half. 14 minutes in and Sphider crashed on page 49 with [quote]Execution failed: Data too long for column 'link' at row 1[/quote] which completely stopped it. Since this isn't an error I can 'fix' and since Sphider crashes at that point, I couldn't just skip over it and keep indexing the site.

And just something to be aware of that I've ran into. If your log file becomes really, really large, Chrome will eventually crash with a SBOX FATAL MEMORY EXCEEDED error. It takes quite a bit to reach this point but you may want to consider checking how many lines are in the log and either trim it or start a new one.

Thanks for keeping it going and I'll keep checking in to see how it's progressing.

Re: First impressions - Am I missing something?

Posted: Mon Jul 26, 2021 6:43 pm
by captquirk
Sphider Pro I have no experience with. Development seems to have stopped. Sphider Plus I have looked at. It is much more full featured than the original Sphider.

Since development on the original Sphider was suspended, I took over to keep it running in a modern environment. Sphider was intended as a light weight indexer for personal use and I have strived to keep it that way.

Additional databases add to the complexity and is not deemed necessary for a light weight crawler. Sphider Plus I believe uses them.

While I understand your desire to crawl sites, we must also respect the desire of certain web developers NOT to have portions of their sites crawled! If you are the owner of a site, you have the ability to add Sphider as an exception to the robots.txt. Sphider does have the ability to override the robots.txt when indexing images, but its use is not recommended. It is possible to change the coding on your copy of Sphider to bypass robots.txt and this is no prohibited. Sphider is open source.

I know Sphider can have difficulty with 301 redirects at times. I have found this is because some sites mix http links with https links. Within Sphider, these do not play well together.

Sphider has an option to index foreign links, but this is off by default. It is found in the settings page for each individual site.

Limiting the number of pages crawled --- this can be somewhat controlled by how deep a crawl will go. The default is 2 levels. You can either increase this number of go unlimited.

What you have to remember is this: Sphider is, and is intended to be, a LIGHT WEIGHT indexer. If you are trying to do the equivalent of indexing Wikipedia, you have the wrong indexer! Sphider is a Ford Focus, not a Ferrari. Sphider Plus tries to be a Ferrari. I have heard varying opinions about how well it has succeeded... some love it, others not so much. You might want to give it a look see.