Page 2 of 2

Re: Relocation: http 301 error in terminal

Posted: Tue Feb 22, 2022 10:40 am
by kas
Yeah some sites are crawling and some are not but some websites show called but won't show up on search results what could be the reason is the indexing not properly crawled???

Re: Relocation: http 301 error in terminal

Posted: Sat Feb 26, 2022 4:17 pm
by captquirk
I have been looking into this. Basically, web pages are evolving in a manner which Sphider is incapable of understanding. I have a blog post addressing this and the ramifications.

https://www.blog.worldspaceflight.com/2 ... -obsolete/

Re: Relocation: http 301 error in terminal

Posted: Tue Mar 01, 2022 2:32 am
by captquirk
In the case of https://yourstory.com/, the Sphider nod found in the MODS section of this forum and titled "Index from sitemaps when sitemap is a list of sitemaps" will allow you to index using their sitemap. BUT be forewarned: the sitemap references 759 other sitemaps with a total of 61,309 (!!) url's. Even then, many of these are STILL 301 redirects, which ARE LEGITIMATE redirects. It takes a L-O-N-G time just to get the sitemaps processed and indexing to begin.

For https://sugermint.com/, the very first page returns a 301. For some reason, that first 301 aborts the entire operation, even if you use the sitemap option (and the mod mentioned above). The 301 IS not a valid 301. There are a lot of redirect checkers to be found, and they all report the url to be a 200, no redirection found. Using a homemade tool based on code taken from Sphider, I can get a full text (no html) list from the home page. No idea where Sphider is getting this 301 from. My SUSPICION is that this MAY be a filter of some sort in their .htaccess file???

Anyway, the mod for processing sitemaps may be useful to you in some cases. I will be testing it for awhile and if all goes well, will incorporate it into an official release.

Re: Relocation: http 301 error in terminal

Posted: Thu Mar 17, 2022 6:35 am
by kas
Trying to make a landing page and results page saperate so that some clean design as a project. Sphider is capable of crawling ebooks and list in gallery view? What I can experiment is that crawling news website is failing as sitemaps exists and sub-sitemaps with several thousands articles urls in them.

What memory is required for sphider to handle the crawl in terms of RAM is that affecting anyway for 301 error?

But then tried to understand Gigablast.com search can scrap anything without following robots.txt how come it can index any website ?

Re: Relocation: http 301 error in terminal

Posted: Thu Mar 17, 2022 4:06 pm
by captquirk
Hopefully the sitemap.xml mod will help in indexing sites. If testing doesn't show any serious problems, the mod will be incorporated into the next release.

What I HAVE noticed, however, then even when indexing with a sitemap, the very first (landing) page has to be able to be indexed, If that ends up a 301, we are dead in the water. But getting a 301 is fuzzy! I have a few tools outside of Sphider I use and for selected sites these tools all agree with Sphider that there is a 301. Some sites the tools disagree. Then another tool will seem to prove there isn't a real 301 for any of them!

So.... getting really interesting, a subsection of Sphider code applied to one website say the landing page is 301. The full Sphider will not index it. For another website (MY OWN BLOG!) the subsection of code says it returns a 301, YET SPHIDER INDEXES IT! What in the world is going on? I am trying to track this down to see what the difference is.

This is one of those issues that is driving me nuts. I work on it for awhile and get a bit burned out, set it aside and take a break. Then back to trying new approaches. I am SURE there is a solution,... i just haven't hit on it yet.

Re: Relocation: http 301 error in terminal

Posted: Thu Apr 14, 2022 2:47 am
by captquirk
Sorry for the long delay since the last post, but here is what I have found...

Many times, the 301 error which has been our nemesis, is "real" in that IS what is reported to Sphider. I can duplicate this by other methods. HOWEVER, in some cases there IS NOT relocation!

Wild guess... webmaster is sending a 301 to stop applications like Sphider.

I found a work around. First off, this work around basically IGNORES 301 errors, EVEN IF IT IS LEGITIMATE. So I definitely would not want to make this a part of a published version of Sphider.

Now to the "fix" --- more of a HACK, actually!

In spider.php (talking the full version, not the lite, see this on line 334 (line 311 in lite):
$url_status['state'] == "redirected':
That line is totally useless and does nothing. It was that way W-A-Y back in Sphider 1.3.6, and is non-functional. (Yes, it took me this long to find that out and I am removing it from the next release.) But we can use it as the basis for our hack. Replace that with this:
$url_status['state'] = "ok";
$url_status['content'] = "text";
If there really is content on those (formerly) 301 pages, it will now index.

As I said, I wouldn't recommend this to be a formal part of Sphider, just a hacker's approach to index the un-indexable.

Re: Relocation: http 301 error in terminal

Posted: Fri May 06, 2022 1:55 am
by kas
https://sifted.eu/ gonna check this site again and few others news nd blogs sites to check if it crawls or stops up. Okay let me check with and play around also with pixabay.com random sites

Also there are category created in admin but what is there a way to user to add the Title, website, summary by self using frontend via Add a site to directory kind of link and admin can approve it in backend

Re: Relocation: http 301 error in terminal

Posted: Fri Jul 15, 2022 7:13 pm
by captquirk
I just had an interesting experience concerning a 301 error!
I have been successfully indexing my own blog, which is WordPress driven.
I have also been working on the next version of Sphider, and created a new database instance for it. When testing, I tried indexing my blog.
301 errors!!! several times I tried, same result. The old version still indexed without issue. I compared code to see where the issue might me. No luck.

Then in a act of desperation, I tied one thing in Settings. I changed the User Agent string. The blog now indexes fine!

It COULD be that some sites are looking at the User Agent, and if it doesn't like what it sees, blocks the crawl with a 301.

Re: Relocation: http 301 error in terminal

Posted: Wed Sep 07, 2022 4:02 am
by kas
So what did you mention in user Agent box in the settings at backend, Should I leave it blank or type something etc or as xyzbot .

Yeah some news media sites like sifted.eu not taking up if so they show sitemaps in results not the actual articles

Re: Relocation: http 301 error in terminal

Posted: Wed Sep 14, 2022 2:03 am
by captquirk
Sorry for the delay in responding.

You can try different things for User Agent. I do know SOME administrators consider "Sphider" to be a rogue bot, which is a shame. It is not a rogue bot, but either the app or the name can be used by ROGUE PEOPLE!

You can try leaving User Agent blank. Or perhaps make up your own name.

I know some people (hopefully NOT Sphider users!) use a name like "google.com" or some other legit crawler name, but I personally feel that is dishonest and discourage doing so.

Ultimately, if some organization does not wish to be crawled, we should respect their wishes.