Trouble with indexing

Come here for help or to post comments on Sphider
Post Reply
harrymcc
Posts: 2
Joined: Sat Feb 17, 2024 8:42 am

Trouble with indexing

Post by harrymcc »

I've installed Sphider on my personal site (to try it out before putting it on a larger site). It's up and running, and I'm attempting to index using a sitemap, but it's stalling on certain pages, specifically these two:

https://harrymccracken.com/blog/2005/03 ... ir-murals/

https://harrymccracken.com/blog/2021/09 ... le-statue/

It starts indexing them, but never finishes.

I created the sitemap after not succeeding in getting Sphider to index this WordPress site without one. I thought it might be able to find everything from this archive page, but it didn't work:

https://harrymccracken.com/blog/harry-g ... g-archive/

Any advice would be appreciated. I'm excited about using this software.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Trouble with indexing

Post by captquirk »

I tried indexing the your blog using your sitemap. Results were partial success.
By this I mean that I made two runs and both ran to completion. No stalls. So in this regards I suspect you are encountering the dreaded 500 error! The best work around for this is to index from a command prompt.

BUT --- even though indexing ran to completion, and MOST of the pages listed in the sitemap indexed, some pages did not even get referenced in the run! For example, any URL containing "bix-animation" was totally ignored. There were also a couple other pages totally ignored.

Additionally, some URLs NOT in the sitemap were indexed! I was able to eliminate the majority of those with a simple filer on the second run. Tweaking the filer can eliminate the remaining few.

I will need to do additional troubleshooting to see what is happening with the ignored pages.

Your sitemap is split into different sections. This is fine, Sphider (which is pretty basic) is not reading anything except the base URL in the second two sections, and those URLs are just duplicates of the first section. I did not attempt to index images, although Sphider can find images embedded on a web page and would not need a special image sitemap. Also, Sphider does not index video files. I am not saying your sitemap is bad. It is not! It is just more advanced than Sphider. And the first part of the sitemap IS being read, as it should be.

In my experience, indexing WordPress blogs is a nightmare. The way WordPress is laid out is that there are many different ways to reach the page --- several different ways of pointing to a single thing. Using a sitemap is probably the most efficient way of indexing WordPress blogs. The fact that I was able to complete TWO runs in a very reasonable amount of time is amazing!

Back to your original issue, indexing hangs on some page --- does Sphider then give you the "Continue indexing" option to resume? If so, then that is a pretty good indication of a 500 error. Try indexing from a command prompt.

Meanwhile, I'll try to see why some pages in a sitemap are ignored.

*******************************************************************
UPDATE: I realized there are TWO sitemaps! The one at
https://harrymccracken.com/sitemap.xml
is being used by Sphider. The one at
https://harrymccracken.com/blog/sitemap.xml
is not. Sitemaps have to be in the domain root.
The sitemap IS indeed being indexed correctly.
harrymcc
Posts: 2
Joined: Sat Feb 17, 2024 8:42 am

Re: Trouble with indexing

Post by harrymcc »

Thank you for all this. It's an odd site consisting of a WordPress blog, plus older flat HTML pages, and not everything links to everything else. So some erraticness in a crawler finding everything, I'd expect. I'll see if I can figure out doing this from the command line.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Trouble with indexing

Post by captquirk »

Presuming the main issue is 500 errors stalling the indexing, the command line should do the trick. Then it is just a matter of getting the right stuff in the sitemap.xml.

Commandline from sphider/admin:
php spider.php -u https://harrymccracken.com/blog/ -f -s
Post Reply