Optimizing for speed

Come here for help or to post comments on Sphider
Post Reply
usabilitest
Posts: 6
Joined: Thu Aug 31, 2023 9:17 pm

Optimizing for speed

Post by usabilitest »

I have a large site. I had to break the sphidering process into smaller chunks and am running it for days. Once the site is crawled it is still a very long process to return the results. I was wondering if there's a way to optimize the schema further to improve it's performance.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Optimizing for speed

Post by captquirk »

Sphider, as originally envisioned by Ando Saabas, was for small personal websites. It is, however, capable of handling some pretty large stuff! My own webisite (https://www.worldspaceflight.com) runs about 2000 pages. I have also, during testing, had as many as 15 sites indexed at once, some pretty good sized.
I have found the fastest way to do an index run is to index using a sitemap. This speeds things up by Sphider not having the need to search for links. It is given a list of what to index and it follows that list. Of course, the sitemap is not of much use during a re-index run, but then unchanged pages don't need to be indexed again, so that helps.
Another aid to speeding things up, especially if the site is powered by WordPress or other system that uses many different ways of accessing the same content is to use the "Must not include" feature. For example, with WordPress you can exclude "/?=", "/feed", "/wp-", "/?replyto", and "/xmlrpc" without losing any relevant content and shortening the time to index.
Enabling the option to "Index images" is also going to be a great consumer of run time.
Once a mega-site is indexed, there isn't much you can do when performing a search. It takes longer to look through "War and Peace" than through "Green Eggs and Ham!" More pages, more words.
usabilitest
Posts: 6
Joined: Thu Aug 31, 2023 9:17 pm

Re: Optimizing for speed

Post by usabilitest »

Yes, the Links table is huge. Take quite some time even when I search directly in the table using phpmyadmin...
usabilitest
Posts: 6
Joined: Thu Aug 31, 2023 9:17 pm

Re: Optimizing for speed

Post by usabilitest »

Even though I broke the site into sections and crawl them separately, sometimes the script times out after around 3K links. When that happens during the initial indexing I can utilize the "Continue Indexing" option but if the script times out during re-indexing there's no way to continue.

Would you recommend, in such cases, to "Clear site" and then do a new index? The alert for that option says: Are you sure you want to clear? Index data will be lost. Won't it create new indexes or will it mess-up the rest of the db?
Last edited by usabilitest on Tue Oct 24, 2023 2:27 pm, edited 1 time in total.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Optimizing for speed

Post by captquirk »

Clearing the site, then doing a new index will not harm anything. Only recommendation is that AFTER clearing the site you then run "clean keywords" from the Clean Tables tab. Also "clean temp" if there is any data in there.
Cleaning keywords MIGHT take awhile...

Also, when a timeout during a re-index occurs, you can just "clear temp: from Clean Tables and start over. Of course, odds are it will just happen again and you never will get to the end. SUPER FRUSTRATING!!!

What works for me during a re-index run is to do so from a command prompt! The time outs are typically 500 and 504 errors, which, for me, only occur when using a browser. I'm not saying you will never time out from a command prompt, but it will be of a PHP nature and not HTML/browser.
Post Reply