Improvements to Sphider handling of robots.txt
Posted: Fri Sep 08, 2023 5:44 pm
The checkRobotsTxt() function in Sphider is deficient. It is not case sensitive, but that is a minor problem easily corrected.
Of more major cancern is the lack of support for the Allow directive.
I have gathered some thoughts on what needs to be done and would appreciate any comment or suggestion as how to proceed.
Here is what I have---
__________________________________
Only two user-agents matter to Sphider.
One is the user-agent specified on the Settings tab. Call it Sphider-agent.
The other is user-agent: *. Call it Star-agent.
Three directives need to be recognized:
1) User-agent:
2) Allow:
3) Disallow:
Directives need to be prioritized.
1) More specific directives override less specific directives
a) Sphider-agent is more specific than Star-agent.
b) Longer rules are more specific than shorter rules.
2) More permissive rules override more restrictive rules.
a) Allow is more permissive then Disallow.
Initial scan of robots text should produce 4 categories.
1) Sphider-agent permits
2) Sphider-agent denys
3) Star-agent permits
4) Star-agent denys
Steps to eliminate conflicts and result in only 2 categories, permit and deny
1) Sphider-agent permits vs Sphider-agent denys:
Exact matches and we drop the deny (more permissive)
2) Star-agent permits vs Star-agent denys:
Exact matches and we drop the deny (more permissive)
3) Sphider-agent permits vs Star-agent denys:
Exact matches and we drop the Star-agent deny (more specific)
Special case: Sphider-agent "Allow: /" negates ALL Star-agent denys!
4) Sphider-agent denys vs Star-agent permits:
Exact matcfhes and we drop the Star-agent permit (more specific)
Special case: Sphider-agent "Disallow /" negates ALL star-agent permits!
At this point, we combine Sphider-agent permits with Star-agent-permits, for a single permit list.
Then we combine the Sphider-agent denys with the Star-agent denys for a single deny list.
We are now finished editing checkRobotsTxt function and focus goes to Sphider.
When examining a URL, as it is now, if the URL has a match in the deny list ($omit), it does not get indexed and we move on to the next URL.
The new procedure will be:
if (URL matches deny list) {
if (URL match permit list) {
if (length of allow rule > length of deny rule) {
INDEX
} else {
DO NOT INDEX
}
} else {
DO NOT INDEX
}
} else {
INDEX
}
Of more major cancern is the lack of support for the Allow directive.
I have gathered some thoughts on what needs to be done and would appreciate any comment or suggestion as how to proceed.
Here is what I have---
__________________________________
Only two user-agents matter to Sphider.
One is the user-agent specified on the Settings tab. Call it Sphider-agent.
The other is user-agent: *. Call it Star-agent.
Three directives need to be recognized:
1) User-agent:
2) Allow:
3) Disallow:
Directives need to be prioritized.
1) More specific directives override less specific directives
a) Sphider-agent is more specific than Star-agent.
b) Longer rules are more specific than shorter rules.
2) More permissive rules override more restrictive rules.
a) Allow is more permissive then Disallow.
Initial scan of robots text should produce 4 categories.
1) Sphider-agent permits
2) Sphider-agent denys
3) Star-agent permits
4) Star-agent denys
Steps to eliminate conflicts and result in only 2 categories, permit and deny
1) Sphider-agent permits vs Sphider-agent denys:
Exact matches and we drop the deny (more permissive)
2) Star-agent permits vs Star-agent denys:
Exact matches and we drop the deny (more permissive)
3) Sphider-agent permits vs Star-agent denys:
Exact matches and we drop the Star-agent deny (more specific)
Special case: Sphider-agent "Allow: /" negates ALL Star-agent denys!
4) Sphider-agent denys vs Star-agent permits:
Exact matcfhes and we drop the Star-agent permit (more specific)
Special case: Sphider-agent "Disallow /" negates ALL star-agent permits!
At this point, we combine Sphider-agent permits with Star-agent-permits, for a single permit list.
Then we combine the Sphider-agent denys with the Star-agent denys for a single deny list.
We are now finished editing checkRobotsTxt function and focus goes to Sphider.
When examining a URL, as it is now, if the URL has a match in the deny list ($omit), it does not get indexed and we move on to the next URL.
The new procedure will be:
if (URL matches deny list) {
if (URL match permit list) {
if (length of allow rule > length of deny rule) {
INDEX
} else {
DO NOT INDEX
}
} else {
DO NOT INDEX
}
} else {
INDEX
}