Thank you for your answers that helps me making big progress in using and understanding this excellent with great potential tool such as Sphider.
More Questions/Remarks/Suggestions:
- Remark: Information about spider error 404 with URI with special characters (space, accented characters...) mentioned in previous post but would be underlooked.
Example of URL/URI causing problem:
Code: Select all
https://www.example.com/facts-about-André-and-others.html
Example of entry in Web server log file (a.b.c.d is the IP):
Code: Select all
www.example.com:443 a.b.c.d - - [02/Feb/2025:05:30:13 +0100] "GET /facts-about-Andr HTTP/1.1" 404 35991 "-" "Sphider (sphidersearch.com)"
So Sphider stops analyzing the URI at first unorthodox character found, and so of course it gets error 404 from the server.
It appears that Sphider follows the old school way, and accepts only alphanumeric characters (upper case, lower case), and hyphen: [A_Za_z-0_9].
Although today modern browsers, bots, search bots and link checkers all accept and tolerate such URL, since I am also old school person too, I agree with the way Sphider does.
So I correct all the URL with errors revealed by Sphider: thank you Sphider! If your URI satisfy Sphider criterion, it will satisfy any other applications.
- Suggestion: add availability for exec() function for requirements_check.php:.
On my hardened server, the exec() function is disabled, so the tables backup gets blank page without any information nor message.
I have to look at the Web server access.log to discover that admin.php request got error 500, so I edit admin.php to switch to error_reporting(E_ERROR | E_DEPRECATED); //Development only
But the problem still remains without any message. So this time I have to edit all *.php and set them to development error_level.
This time I got the right message in PHP error log:
[05-Feb-2025 13:32:36 Europe/Paris] PHP Fatal error: Uncaught Error: Call to undefined function exec() in /home/XXX/admin/db_backup.php:166
So then the fix is easy, enable exec() function in php.ini.
This problem leads back to my suggestion (not approved) to use developement error_level(E_ERROR | E_DEPRECATED) as default, since:
E_ERROR and E_DEPRECATED should never be hit by user, due to your non-regression test before releasing a new version. So it is equivalent in practice to error_level(0).
- In case of "should not occur" or "unpredictable" errors like the above, the plain user will be helpless due to 0 message with error_level(0).
With developement error_level(E_ERROR | E_DEPRECATED), at least he/she can copy/paste the error message and ask for help. The plain user would not even be aware that there is server error 500 since he/she would not know or look at Web server acces.log. Without the message (hidden by error_level(0)), it would be very difficult to investigate and help the user (and he/she would also have to reproduce the problem again to get data messages if guided : waste of time).
The annoying manual editing of all *.php to development error_level (and reset them back later) leads me to the next suggestion

:
- Suggestion: single use error_reporting.php to be included (require_once) by all other *.php to ease change of error_reporting level.
Note: but this would break granularity of error_level() setting if the intention was to have different level for each php.
Content for error_reporting.php:
Code: Select all
//error_reporting(E_ERROR | E_DEPRECATED); //Development only
error_reporting(0);
- Tip: if the site uses Content Security Policy (CSP), to allow javascript (script-src) and styles (style-src) used by Sphider, verify that your CSP directives have:
Code: Select all
script-src 'self' https://ajax.googleapis.com
and
style-src 'self' 'unsafe-inline'
Note: I have a site with no javascript, and only native stylesheet (.css): so I have a very strict CSP.
But in order to run Sphider, now I have to relax my CSP. So it raises my next question:
- Question: From Sphider main page:
"Sphider is a lightweight web spider and search engine written in PHP. It can be implemented using a MySQL (or MariaDB) database using the MySQLi and MySQLnd PHP components. PHP 8.x is supported.".
There is no mention of javascript, and it seems effectively that javascript is not mandatory, with the sacrifice of loosing the spelling suggestion/autocompletion capability ("Did you mean...").
Am I correct? Is it easy to get rid of javascript? How? Just by not invoking in the <script> tag? Or there is some interaction between javscript and PHP to adapt (in /js_suggest/) ?
I got some answers by myself by practicing, to be corrected:
Without javascript:
- admin.php: no more /admin/dbmain.js, used in Database tab of admin panel. So no more tables backup/restore, manual check case for table selection.
This is not a problem if you can use either phpMyAdmin or command line to invoke mysqldump.
- search.php: no more jquery.min.js from
https://ajax.googleapis.com, /calendar/calendar.js, /js_suggest/autocomplete.js. The search still works but without suggestion/autocompletion.
- Remark: for command line for indexing:
I have just to be careful to run the requirements_check.php also in command line, since there is different php.ini in command line mode (cli) and Web mode (fpm).
/etc/php/8.2/cli/php.ini
/etc/php/8.2/fpm/php.ini
Code: Select all
php requirements_check.php
Fsockopen - CHECK!<br>Mysqlnd - CHECK!<br>PHP 7 or greater - CHECK!<br>Curl - CHECK!<br>Iconv - CHECK!<br>Mbstring - CHECK!<br>Imagick - CHECK!<br>(Imagick is not needed for Sphiderlite.)<br><br><strong>Congratulations! You can use either Sphider or Sphiderlite.</strong><br>
Big advantage for me: during legnthy indexation, I can play with search.php, and with admin.php (to change some settings)
or even to follow progession of indexing ("Sites" tab shows number of links and keywords), statistics on tables etc...
- Remark: always read the documentation (RTFM)
The SphiderUserGuide.pdf has this interesting information that solves one of my problem raised in my earlier post:
Code: Select all
"Ignoring parts of a page
Sphider includes an option to exclude parts of pages from being indexed. This can, for example,
be used to prevent search result flooding when certain keywords appear on certain part in most
pages (like a header, footer or a menu). Any part of a page between
<!--sphider_noindex--> and <!--/sphider_noindex--> tags is not indexed, however links in it are followed."
I discover this, and in the meantime I see that you already answer my question with post
viewtopic.php?p=682#p682 . Thank you!
Other interesting information about search capability mentioned in the documentation:
Wildcard search (*)
"-" for negate search
AND/OR/Phrase search
- Remark: SphiderUserGuide.pdf, page 5/52: minor formatting error. Extra bullet introduced.
Code: Select all
(bullet) Supports excluding words (by putting a '-' in front of a word, any page including that word
(bullet) will be omitted from the results).
Should be:
Code: Select all
(bullet) Supports excluding words (by putting a '-' in front of a word, any page including that word will be omitted from the results).
- Remark: the mobile CSS might not be completely 100% responsive with dynamic width shrinking on desktop (cf. screenshot).

- Sphider_2025-02-04_215123.jpg (73.47 KiB) Viewed 6969 times

- Sphider_2025-02-04_215248.jpg (118.5 KiB) Viewed 6969 times
- Remark about security:
I would like to share some security tool and practices I use.
I recommend using the excellent and free "ConfigServer Security and Firewall" (csf/lfd) which is much better than fail2ban etc...
https://configserver.com/configserver-s ... -firewall/
/admin and /settings are .htaccess protected, but with csf/lfd if more than 3 failed login attempts (number customizable) the IP will be blocked,
and notification sent to you by e-mail.
To complement, I always insert a small PHP code snippet (front-end to admin.php for example) that sends an e-mail to notice first successful login (one e-mail per day per IP per successful login).
Same thing: the same code snippet is added e.g. to phpMyAdmin entry point. So although my phpMyAdmin has been protected by .htaccess too and notified by csf/lfd on failed attempts, I also receive e-mail notifying first sucessful login per day to phpMyAdmin.
The code snippet is really simple, it takes the IP of the visitor, and search in a text file is this IP is already logged for the day. If yes, nothing to do.
If no, appends the IP and the current date (YYYY-MM-DD) to the fext file, and sends an e-mail to the webmaster/technician.
- Remark: Sphider front-end/back-end
Thanks to your answer, I think I understand now that the admin.php and spider.php do not need to be run on the server being indexed.
This is great, since then I can run admin.php/spider.php on for example a virtual machine, without the restrictions on the target on production server,
that could be hardened without possibility to satisfy all requirements_check.php (fsockopen, all_url_fopen, exec...).
Afterwards, it just a matter to backup all tables and import to the target server, and upload all files and directories (except /admin) to the target server.
Note: we still need to protect /settings (with .htaccess) since they contain sensitive files: my.cnf and database.php.
This also leads to my 2 next suggestions

:
- Suggestion: get rid of my.cnf (content redundant with database.php) so the user will no need to fill it.
Sphider should and could be able to build on the fly the parameters from database.php and provide them to execute either "mysql" or "mysqldump" command line.
- Suggestion: to access database, search.php uses another file for example /settings/search_database.php with another mySQL user and password. This user has only strictly needed permissions (for example read only, no create database or table, no drop table...), in anycase much less permissions than the one declared in /settings/database.php, and only /settings/search_database.php will be uploaded. This ways the security would be a little better
- Question: I would like to use Sphider search.php as embedded inside existing HTML page. The provided templates (css and code) would be a good start for this.
It seems that it is easier to start from source code as seen by a browser (instead of starting from search.php source code), then
take just the <form> code then adapt to existing look and feel (css), without using existing css. Then after click on Search button,
adapt progressively the output according to existing look and feel. Is it the right way to do it?