Search found 302 matches

by captquirk
Tue Sep 26, 2023 7:23 pm
Forum: Sphider Help
Topic: unicode indexing
Replies: 18
Views: 8291

Re: unicode indexing

Your database does not seem to be the issue. With web pages, it is possible for a Unicode character to appear correctly in a browser window, BUT the source code behind that page is not Unicode, but a replacement. For example: Browser displays: Д Source code is: Д Another possible issue is ...
by captquirk
Wed Sep 13, 2023 1:33 am
Forum: Sphider Help
Topic: An issue with PDF indexing.
Replies: 16
Views: 8550

Re: An issue with PDF indexing.

This issue has been resolved. It turns out that Sphider uses the PHP exec() function to run the pdftotext converter Some Linux installations, for security reasons, block some functions such as exec(). A slight change to php.ini allowed exec() to function and pdf's were then indexed. As I said, the d...
by captquirk
Sun Sep 10, 2023 5:00 am
Forum: Sphider Help
Topic: Robots.txt - would allow support be useful for many
Replies: 6
Views: 4486

Re: Robots.txt - would allow support be useful for many

Check this: viewtopic.php?p=547#p547 While not a FINAL solution, user-agent: Sphider Allow: / should now allow access, disregarding all the disallows in user-agent: *. Any desired disallows need to be added to user_agent: Sphider, even if they are duplicates of some disallows in user-agent: *. Feedb...
by captquirk
Sat Sep 09, 2023 10:53 pm
Forum: Sphider Help
Topic: Improvements to Sphider handling of robots.txt
Replies: 1
Views: 3337

Re: Improvements to Sphider handling of robots.txt

This is a partial solution to fixing checkRobotsTxt() function. It reads the robots.txt file, considering both the * user-agent and the Sphider user-agent, both allows and disallows. It produces a master array of denys and allows which are compiled based on the rules mentioned in a previous post. Th...
by captquirk
Sat Sep 09, 2023 3:55 pm
Forum: Sphider Help
Topic: unicode indexing
Replies: 18
Views: 8291

Re: unicode indexing

It is possible for a particular database to have a character set and collation DIFFERENT than defaults. Before giving up completely, let's be sure that character set really is the issue. Both the "install.php" and manual "tables.sql" provided make every effort to be utf8mb4! Go t...
by captquirk
Fri Sep 08, 2023 5:44 pm
Forum: Sphider Help
Topic: Improvements to Sphider handling of robots.txt
Replies: 1
Views: 3337

Improvements to Sphider handling of robots.txt

The checkRobotsTxt() function in Sphider is deficient. It is not case sensitive, but that is a minor problem easily corrected. Of more major cancern is the lack of support for the Allow directive. I have gathered some thoughts on what needs to be done and would appreciate any comment or suggestion a...
by captquirk
Fri Sep 08, 2023 5:33 pm
Forum: Sphider Help
Topic: How to 'safely' access pdftotext from the http process
Replies: 1
Views: 3285

Re: How to 'safely' access pdftotext from the http process

Yes, Sphider can index PDF files. However, the translation of PDF to text is not native to Sphider. That is done by a utility, pdftotext. This is pretty much a standard executable in Linux based systems, typically residing at /usr/bin/pdftotext. On a Windows system, pdftotext.exe is NOT present by d...
by captquirk
Fri Sep 08, 2023 5:17 pm
Forum: Sphider Help
Topic: unicode indexing
Replies: 18
Views: 8291

Re: unicode indexing

Yes. the utf8mb4 is the correct encoding. As for collation, I tend to use utf8mb4_general_ci, BUT --- it really doesn't matter that much. Different collations may present different sorting, but the IMPORTANT thing is that it be utf8mb4!!! Why MySQL doesn't have full 4 byte UTF8 encoding as a default...
by captquirk
Fri Sep 08, 2023 3:49 pm
Forum: Sphider Help
Topic: unicode indexing
Replies: 18
Views: 8291

Re: unicode indexing

Sphider is fully Unicode capable. Example: If you index a Russian web site, then want to search for "грузовик", are you actually searching for "грузовик", or some numeric Unicode equivalent? Also, check your database. Is the default character set UTF8_mb4, or just UTF8? MySql def...
by captquirk
Fri Sep 08, 2023 4:19 am
Forum: Sphider Help
Topic: Robots.txt - would allow support be useful for many
Replies: 6
Views: 4486

Re: Robots.txt - would allow support be useful for many

I tried you mod out and there are issues. Using the 5.3.0 checkRobotsTxt(): Disallowed files and directories in robots.txt: https://sphider.worldspaceflight.com/contact/ https://sphider.worldspaceflight.com/include/ https://sphider.worldspaceflight.com/download/ https://sphider.worldspaceflight.com/...