Improvements to Sphider handling of robots.txt

Come here for help or to post comments on Sphider
Post Reply
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Improvements to Sphider handling of robots.txt

Post by captquirk »

The checkRobotsTxt() function in Sphider is deficient. It is not case sensitive, but that is a minor problem easily corrected.
Of more major cancern is the lack of support for the Allow directive.

I have gathered some thoughts on what needs to be done and would appreciate any comment or suggestion as how to proceed.

Here is what I have---
__________________________________
Only two user-agents matter to Sphider.
One is the user-agent specified on the Settings tab. Call it Sphider-agent.
The other is user-agent: *. Call it Star-agent.

Three directives need to be recognized:
1) User-agent:
2) Allow:
3) Disallow:

Directives need to be prioritized.

1) More specific directives override less specific directives
a) Sphider-agent is more specific than Star-agent.
b) Longer rules are more specific than shorter rules.
2) More permissive rules override more restrictive rules.
a) Allow is more permissive then Disallow.

Initial scan of robots text should produce 4 categories.
1) Sphider-agent permits
2) Sphider-agent denys
3) Star-agent permits
4) Star-agent denys


Steps to eliminate conflicts and result in only 2 categories, permit and deny
1) Sphider-agent permits vs Sphider-agent denys:
Exact matches and we drop the deny (more permissive)
2) Star-agent permits vs Star-agent denys:
Exact matches and we drop the deny (more permissive)
3) Sphider-agent permits vs Star-agent denys:
Exact matches and we drop the Star-agent deny (more specific)
Special case: Sphider-agent "Allow: /" negates ALL Star-agent denys!
4) Sphider-agent denys vs Star-agent permits:
Exact matcfhes and we drop the Star-agent permit (more specific)
Special case: Sphider-agent "Disallow /" negates ALL star-agent permits!
At this point, we combine Sphider-agent permits with Star-agent-permits, for a single permit list.
Then we combine the Sphider-agent denys with the Star-agent denys for a single deny list.

We are now finished editing checkRobotsTxt function and focus goes to Sphider.

When examining a URL, as it is now, if the URL has a match in the deny list ($omit), it does not get indexed and we move on to the next URL.

The new procedure will be:

if (URL matches deny list) {
if (URL match permit list) {
if (length of allow rule > length of deny rule) {
INDEX
} else {
DO NOT INDEX
}
} else {
DO NOT INDEX
}
} else {
INDEX
}
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Improvements to Sphider handling of robots.txt

Post by captquirk »

This is a partial solution to fixing checkRobotsTxt() function. It reads the robots.txt file, considering both the * user-agent and the Sphider user-agent, both allows and disallows. It produces a master array of denys and allows which are compiled based on the rules mentioned in a previous post.

This solution is "partial" in the fact that only the deny strings are passed back to Sphider. Ideally, Sphider should be able to receive and process both deny and allow, but for now it only considers deny. BUT, this does NOT mean that an allow is not being considered! It is, in the composition of the array.
For example, if in robots.txt, for user-agent Sphider you have "Allow: /", ALL of the Disallows from user-agent: * are eliminated!
Also, strings are now case sensitive. A "Disallow: /files/" will no longer affect "/Files/".

Make a backup copy of spiderfuncs.php. Then in spiderfuncs.php, replace the entire checkRobotTxt() function with the code attached below.
checkRobotsTxt.zip
(1.66 KiB) Downloaded 303 times
Post Reply