Search found 309 matches
- Sat Sep 09, 2023 10:53 pm
- Forum: Sphider Help
- Topic: Improvements to Sphider handling of robots.txt
- Replies: 1
- Views: 6219
Re: Improvements to Sphider handling of robots.txt
This is a partial solution to fixing checkRobotsTxt() function. It reads the robots.txt file, considering both the * user-agent and the Sphider user-agent, both allows and disallows. It produces a master array of denys and allows which are compiled based on the rules mentioned in a previous post. Th...
- Sat Sep 09, 2023 3:55 pm
- Forum: Sphider Help
- Topic: unicode indexing
- Replies: 18
- Views: 12963
Re: unicode indexing
It is possible for a particular database to have a character set and collation DIFFERENT than defaults. Before giving up completely, let's be sure that character set really is the issue. Both the "install.php" and manual "tables.sql" provided make every effort to be utf8mb4! Go t...
- Fri Sep 08, 2023 5:44 pm
- Forum: Sphider Help
- Topic: Improvements to Sphider handling of robots.txt
- Replies: 1
- Views: 6219
Improvements to Sphider handling of robots.txt
The checkRobotsTxt() function in Sphider is deficient. It is not case sensitive, but that is a minor problem easily corrected. Of more major cancern is the lack of support for the Allow directive. I have gathered some thoughts on what needs to be done and would appreciate any comment or suggestion a...
- Fri Sep 08, 2023 5:33 pm
- Forum: Sphider Help
- Topic: How to 'safely' access pdftotext from the http process
- Replies: 1
- Views: 6128
Re: How to 'safely' access pdftotext from the http process
Yes, Sphider can index PDF files. However, the translation of PDF to text is not native to Sphider. That is done by a utility, pdftotext. This is pretty much a standard executable in Linux based systems, typically residing at /usr/bin/pdftotext. On a Windows system, pdftotext.exe is NOT present by d...
- Fri Sep 08, 2023 5:17 pm
- Forum: Sphider Help
- Topic: unicode indexing
- Replies: 18
- Views: 12963
Re: unicode indexing
Yes. the utf8mb4 is the correct encoding. As for collation, I tend to use utf8mb4_general_ci, BUT --- it really doesn't matter that much. Different collations may present different sorting, but the IMPORTANT thing is that it be utf8mb4!!! Why MySQL doesn't have full 4 byte UTF8 encoding as a default...
- Fri Sep 08, 2023 3:49 pm
- Forum: Sphider Help
- Topic: unicode indexing
- Replies: 18
- Views: 12963
Re: unicode indexing
Sphider is fully Unicode capable. Example: If you index a Russian web site, then want to search for "грузовик", are you actually searching for "грузовик", or some numeric Unicode equivalent? Also, check your database. Is the default character set UTF8_mb4, or just UTF8? MySql def...
- Fri Sep 08, 2023 4:19 am
- Forum: Sphider Help
- Topic: Robots.txt - would allow support be useful for many
- Replies: 6
- Views: 7359
Re: Robots.txt - would allow support be useful for many
I tried you mod out and there are issues. Using the 5.3.0 checkRobotsTxt(): Disallowed files and directories in robots.txt: https://sphider.worldspaceflight.com/contact/ https://sphider.worldspaceflight.com/include/ https://sphider.worldspaceflight.com/download/ https://sphider.worldspaceflight.com/...
- Fri Sep 08, 2023 12:09 am
- Forum: Sphider Help
- Topic: Robots.txt - would allow support be useful for many
- Replies: 6
- Views: 7359
Re: Robots.txt - would allow support be useful for many
You are correct. Looking closer you can see the existing function ONLY looks for "disallows" and not allows". There is definitely room for improvement here. Also, thanks for your proposed improvements. I will look it over and may incorporate it, or some version of it, in a future (nex...
- Thu Sep 07, 2023 3:45 pm
- Forum: Sphider Help
- Topic: An issue with PDF indexing.
- Replies: 16
- Views: 13067
Re: An issue with PDF indexing.
This is crazy. I tried the url and got that page, and NO LINKS FOUND! By email, lets see the full log (looks like it started but then hung...?), the setting screen, and the site advanced edit screen. Also see if there is anything relevant in the error logs. (As shipped, Sphider has logging turned on...
- Thu Sep 07, 2023 2:43 am
- Forum: Sphider Help
- Topic: Robots.txt - would allow support be useful for many
- Replies: 6
- Views: 7359
Re: Robots.txt - would allow support be useful for many
First off, Sphider will index https sites. It follows robots.txt. To allow Sphider, but disallow all other bots, try something like this: User-agent: * Disallow: / User-agent: Sphider (sphidersearch.com) Allow: / You could also go into the Settings tab and change the User Agent string to something y...