Character corruption

Post by **captquirk** » Tue May 07, 2019 5:42 pm

Sphider operates using the UTF-8 character set. When a web page is loaded for indexing, Sphider checks the HTTP headers to determine the character set in use. If it is not UTF-8, the data is converted to UTF-8.
In its zeal to ensure everything is UTF-8, Sphider then performs a SECOND check. This is redundant and unnecessary. Normally, this second check is harmless. BUT.... there exists a small number of UTF-8 characters which can be misinterpreted as ISO-8859-1/Windows-1252.
The most egregious example is the e-acute (é). Sphider then tries to convert UTF-8 to UTF-8, which does not work well! The "é" becomes "Ã©"!!!

To solve this problem in Sphider 2.4.1, 2.4.1-PDO, and 3.0.0-MB, three files need to be edited to comment out these redundant conversions.
Placing a double slash (//) at the beginning of each line will comment them out. The common thing is a function, "isUtf8", so you can recognize the lines.
In Sphider 2.4.1: spider.php, lines 484, 486, 487, and 1553; spiderfuncs.php, line 641; search_results.php, line 97.
In Sphider 2.4.1-PDO: spider.php, lines 483, 485, 488, and 1598; spiderfuncs.php, line 638; search_results.php, line 97.
In Sphider 3.0.0-MB: spider.php, lines 485, 487, 488, and 1555; spiderfuncs.php, line 641; search_results.php, line 97.

The issue will be resolved in the next releases.