Relocation: http 301 error in terminal

Come here for help or to post comments on Sphider
kas
Posts: 11
Joined: Fri Dec 17, 2021 3:36 pm

Relocation: http 301 error in terminal

Post by kas »

Hi geeks, I'm new to sphider and geeting this error while trying to index via. terminal almost with many websites :

localhost@website.com# php spider.php -u https://sugermint.com/ -d 0

1. Retrieving: https://sugermint.com/ at 13:36:44.
Relocation: http 301
Legit links found: 0. New links found: 0
Completed at 13:36:47.

Tried with <http> and also <https> still the same error occurs,
What is the solution for this how to index a website when Relocation: http 301 error showsup??? Any experts here appreciated :?:
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Relocation: http 301 error in terminal

Post by captquirk »

Let me investigate. I will get back to you.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Relocation: http 301 error in terminal

Post by captquirk »

I get the same result --- 301 error. I have seen this happen before and it is not your installation. It is something funky with the web page! I will keep looking to see if I can determine the EXACT issue.

In one other case where I saw this, the issue was how the https was set up on the site's server. The site successfully went to https, but how it did so was bouncing between http and https. Sphider is really kind of dumb! A browser knows how to handle this (redirects), but Sphider sees the redirect and quits! I can't say for sure this is the case with sugermint.com, so I will keep poking around.
kas
Posts: 11
Joined: Fri Dec 17, 2021 3:36 pm

Re: Relocation: http 301 error in terminal

Post by kas »

Yeah me too figuring out on various channels most of the websites I tried get this error ex. https://yourstory.com eventually sphider is not able to handle the redirect or curious like how is google handling this kind of issue still the website gets crawled
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Relocation: http 301 error in terminal

Post by captquirk »

Sphider is definitely a lightweight search engine. Anyone trying to use it as their own personal replacement for Google, Bing, etc., is going to be sorely disappointed!

Google, Bing, et al., use much more sophisticated algorithms than Sphider, and like any decent browser, follows redirects. Sphider retrieves a page and simply reports what it finds --- 200 (gets indexed), 404 (not found), 403 (denied), 500 (server error), 301/2 (redirect)... then it moves on to the next page, if any.

To index a redirect, Sphider would need to not only see the redirect code, it would then need to find the redirect target and then try to retrieve that for processing. Sounds simple, but a bit harder in practice. If anyone out there knows a simple way to get Sphider to do the same, that would be great! I've played around, but nothing practical SO FAR --- but I'm no genius programmer, either! LOL!
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Relocation: http 301 error in terminal

Post by captquirk »

I have determined that sugermint.com is built using WordPress. I also know Sphider has difficulties with WordPress sites. A lot has to do with how robots.txt is set up. Sphider CAN index some WordPress sites, but it is an iffy proposition. (From experience using Sphider on my own WordPress blog.)

In the case of sugermint.com, if I use snippets of code from Sphider, I can get the content of https://sugermint.com/, but from within Sphider I can't. Tentative conclusion: The 301 may not be real, but the website configuration is detecting a crawler and throwing a 301??? (Other tests independent of Sphider do not indicate there being a valid 301 error.)

I am no guru. If anyone out there can give an explanation, it would be appreciated.
kas
Posts: 11
Joined: Fri Dec 17, 2021 3:36 pm

Re: Relocation: http 301 error in terminal

Post by kas »

https://yourstory.com/ This is another website of this same 301 error, also this is crawling only sitemaps.xml and not really indexing the webpages under url. How do we crawl XML sitemap index which has several other sitemaps in it.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Relocation: http 301 error in terminal

Post by captquirk »

I took a quick look at the source code for https://yourstory.com/, and ... WOW!
Definitely not your traditional HTML! I'll have to look deeper, but at this moment I'm not sure Sphider is sophisticated enough to digest it.

I'll check deeper and elaborate.

EDIT/UPDATE: This site is proving difficult to index. The connection keeps dropping, but I can s-l-o-w-l-y progress. I did manage to index 41 pages and a total of 2386 keywords before I grew tired to "continuing" a suspended index.

I do see that the sitemap.xml for this site is a listing of 753 other sitemaps (which is totally valid for large search engines such as Bing, Google, Yahoo, etc. Unfortunately, Sphider (being the lightweight that it is) does not have that capability. (This may actually be opportunity for improvement!)

There are SOME pages return a 301 code, but as I said, I did get 41 pages index and could continue on, but getting dropped after every few pages is rather frustrating. MAYBE increasing the delay between calls (in settings) would help.

Since you seem to be getting a 301 right out of the gate, it might be useful to see if you are getting any php errors. In spider.php, lines 34 and 35 (33 and 34 in the lite version), swap the comments to enable error reporting.
kas
Posts: 11
Joined: Fri Dec 17, 2021 3:36 pm

Re: Relocation: http 301 error in terminal

Post by kas »

Yeah similar thing faced by another website https://e27.co/ also similar pattern has sitemaps but not crawling webpages search results show as text links of sitemap.html etc not actual title or url or description of article.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Relocation: http 301 error in terminal

Post by captquirk »

The method of having a sitemap.xml consist of a list of additional sitemaps is perfectly valid. However, is is used primarily by very large sites in which a single sitemap would be HUGE! The maps are reduced to a manageable size, then referenced by a single master.

I MAY look into a procedure to read and parse such a sitemap, but, realistically, sites like that are equally extensive. Sphider was originally inte3nded to be a tool for indexing small, personal websites. As such, Sphider will reach a point that it chokes. (I actually had a person a few years back who tried to index wikipedia with Sphider!)

As to why you are getting 301 errors right out of the gate, whereas I tend to have SOME success before getting choked off is still a mystery to me. And one I am determined to eventually find just what the cause is.

If ANYONE out there has thoughts on this, I'd love to hear from you.
Post Reply