Robots.txt - would allow support be useful for many

Come here for help or to post comments on Sphider
Post Reply
wiringmaze
Posts: 4
Joined: Wed Sep 06, 2023 5:34 pm

Robots.txt - would allow support be useful for many

Post by wiringmaze »

I'm not sure the best way to handle this -
  • I want to spider parts of my local server, while generally disallowing external robots.
    There is a second challenge in that I would prefer to the secure path like https://local.server.com/path/
Allow: With a bit of digging, I see that the code supports "disallow", but I do not see support for allow.
Also, since spiders come and go and their names change, supporting only disallow seems a bit limiting - at least for me.

https: My server has a letsencrypt cert, but I don't know if there is a way to grant sphider capability to make that https connection.

thanks - I'm a fairly new user, so still learning. Apologies if the answer is already available.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Robots.txt - would allow support be useful for many

Post by captquirk »

First off, Sphider will index https sites. It follows robots.txt. To allow Sphider, but disallow all other bots, try something like this:
User-agent: *
Disallow: /

User-agent: Sphider (sphidersearch.com)
Allow: /
You could also go into the Settings tab and change the User Agent string to something you are sure is unique, then in robots.txt use that as the name of the allowed bot.

Be aware that there are bots out there (Baidu from China) who don't give a rat's behind about robots.txt and index whatever the heck they want.

One sure fire way to block a bot is to password protect a directory. Unfortunately that will even give Sphider a 403 (access denied) error, and there is no way around that.

Hope this helps.
wiringmaze
Posts: 4
Joined: Wed Sep 06, 2023 5:34 pm

Re: Robots.txt - would allow support be useful for many

Post by wiringmaze »

Thanks for the follow up. But I'm not quite with you yet.

For the robots.txt, I had exactly what you recommended, but it isn't working for me. This is what led me to look at the "checkRobotTxt($url)" function in spiderfuncs.php. I observed that it does not have a check for "allow:", only disallow.

I downloaded the latest version to make sure. I then sprinked in a few diagnostic print statements. It reads the robots.txt:

Code: Select all

  User-agent: *
  Disallow: /
After these lines, it will insert "/" into the $omit array.

Then it reads these -

User-agent: Sphider (sphidersearch.com)
Allow: /

This one does not add anything to the omit array, but that array is already 'tainted' from the first set.
Then it returns the omit array with What I think it needs to do is remove the "/" from the $omit array.

I think I understand it better. It has to 'return null' to permit indexing. But it can only get there if the preg_match is true, but it is never true for the allow record.

There may be a better way, but I've attached my modified version of that function.

For the https, I'll turn to that as a separate item - I should not have combined two things into one thread.
wiringmaze
Posts: 4
Joined: Wed Sep 06, 2023 5:34 pm

Re: Robots.txt - would allow support be useful for many

Post by wiringmaze »

CheckRobot_Function.7z
Proposed improvement to checkRobotTxt function.
(1.89 KiB) Downloaded 693 times
Attachment for the prior message.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Robots.txt - would allow support be useful for many

Post by captquirk »

You are correct. Looking closer you can see the existing function ONLY looks for "disallows" and not allows".

There is definitely room for improvement here. Also, thanks for your proposed improvements. I will look it over and may incorporate it, or some version of it, in a future (next?) release of Sphider.

There is always room for improvement. Part of me has been saying, with each release, "This is it! It just can't get any better." (I've been saying that since Sphider 2.0.0!!! :lol: Then I come up with an improvement...))

In the meantime, you might want to post your code in "Sphider MODS"... This would give people a chance to play around with it until I get things sorted out here. I've been doing research on EXACTLY how a robots.txt is supposed to work... what takes precedent over what... essentially what is "best practice".
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Robots.txt - would allow support be useful for many

Post by captquirk »

I tried you mod out and there are issues.
Using the 5.3.0 checkRobotsTxt():
Indexing omits the flagged items.

Using the mod, I get:
Everything got indexed.

Still, you are correct, checkRobotsTxt() HAS to be modified/updated to take the Allow directives into consideration. There is another issue as well. As it it, checkRobotsTxt() sees everything case insensitive. The directives themselves (user-agent, disallow, and allow) would be fine, even beneficial (since one person may use "User-agent:" and another "user-agent:". But the targets should be case sensitive!

In other words, a "disallow: /files" should NOT disallow "/Files"!

Anyway, don't be discouraged. I am going to be working on an improvement. You do the same. One of us is bound to come up with a good solution.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Robots.txt - would allow support be useful for many

Post by captquirk »

Check this: viewtopic.php?p=547#p547

While not a FINAL solution,

user-agent: Sphider
Allow: /

should now allow access, disregarding all the disallows in user-agent: *.
Any desired disallows need to be added to user_agent: Sphider, even if they are duplicates of some disallows in user-agent: *.

Feedback is appreciated.
Post Reply