Always problems with with accented characters

Come here for help or to post comments on Sphider
Post Reply
Equinoxe58
Posts: 14
Joined: Fri Oct 06, 2023 8:02 pm

Always problems with with accented characters

Post by Equinoxe58 »

Good morning,
I have a problem with accented characters on my remote site in search results with Sphider 5.5.
The problem does not exist with wampserver locally.
Below I give you an example with the same search.
I specify that I made the spider of my remote site from my local site (timeout problem on the remote site), and after finishing my scan, I imported the data on the remote site. You follow me ?
Thank you in advance for your help.
Capture d'écran 2024-02-19 172607.png
Capture d'écran 2024-02-19 172607.png (21.6 KiB) Viewed 1287 times
Capture d'écran 2024-02-19 172538.png
Capture d'écran 2024-02-19 172538.png (21.03 KiB) Viewed 1287 times
The first img is on remote site
The second on local site

the tables are strictly identical since it is an export / import
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Always problems with with accented characters

Post by captquirk »

I understand the issue. The database is properly formatted on your local machine. Unfortunately, during the export/import operation, Unicode characters get corrupted. That is why your remote site gives improper results.

I have yet to find a definitive solution to this issue. This has happened to me also.

What I have found that does help is to be sure both local and remote versions of MySQL (or MariaDB) match. Also, any options charset options need to be utf8mb4. Latin1 seems to keep popping up and this can lead to corruption on import.

Also running the indexing from the remote location assures correct database encoding. If this proves impossible due to frequent stalls (500 errors), try indexing from the command prompt if you have such access. I know this is an extra step, but correct data on the remote site is important as that is what your viewers will be seeing.

UPDATE: Here are three links which may help with your problem.
https://stackoverflow.com/questions/524 ... a-sql-file
You are getting what is called "mojibake".
https://en.wikipedia.org/wiki/Mojibake
There is a fix, depending on the exact cause. The issue may be either double-encoding or Latin1 got involved somehow. The fix for each is different but described here.
https://mysql.rjweb.org/doc.php/charcol ... ious_cases

Hope this helps.
Equinoxe58
Posts: 14
Joined: Fri Oct 06, 2023 8:02 pm

Re: Always problems with with accented characters

Post by Equinoxe58 »

Good morning,
I did some tests and comparisons for the accent problems of the French language (among others). I have 2 laptops, one old and one new.
I did 2 crawler tests on my remote site with the same settings.
With the old PC, in my database, the links appear with non-compliant characters. Example in the description field: Ville de La Machine (Nièvre). Accueil du site, numéros d'urgence, informations locales, résultats sportifs de la ville, petites histoires machinoises
But in search, weird characters normally appear with accents.

On the other hand, with the new PC, it's the opposite, strange characters do not appear in the database, but are present in the search.

I still found a difference between my 2 computers in terms of mysql:

On the new PC:
Serveur : MySQL (127.0.0.1 via TCP/IP)
Type de serveur : MySQL
Connexion au serveur : SSL n'est pas utilisé Documentation
Version du serveur : 8.2.0 - MySQL Community Server - GPL
Version du protocole : 10
Utilisateur : root@localhost
Jeu de caractères du serveur : UTF-8 Unicode (utf8mb4)

On the old PC:
Serveur : MySQL (127.0.0.1 via TCP/IP)
Type de serveur : MySQL
Connexion au serveur : SSL n'est pas utilisé Documentation
Version du serveur : 5.7.36 - MySQL Community Server (GPL)
Version du protocole : 10
Utilisateur : root@localhost
Jeu de caractères du serveur : cp1252 West European (latin1)

On the remote server
Serveur: Localhost via UNIX socket
Type de serveur: MySQL
Version du serveur: 5.6.33-0ubuntu0.14.04.1 - (Ubuntu)
Version du protocole: 10
Utilisateur: ville-la-machine@localhost
Jeu de caractères du serveur: UTF-8 Unicode (utf8)

So in the end, the result is correct in the search when the character set of the database is "cp1252 West European (latin1)".
Only I can't change this value on the remote site.
Do you have an idea ?
Thanks in advance.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Always problems with with accented characters

Post by captquirk »

The server character set may or may not be the database character set!

If you go into MySQL and run:
SELECT * FROM INFORMATION_SCHEMA.SCHEMATA;
you will see what the database character set is. The output should show all the databases (there are several MySQL specific databases), the character set for each (one being the server) and the collations used in each.

It is important that the sphider database character set be "utf8mb4". If the character set is shown simply as "utf8", then it is actually "utf8mb3"! MySQL has historically used "utf8" as an alias for "utf8mb3", and this has been the default. The Latest versions supposedly are changing the alias for "utf8" to "utf8mb4", but I am unsure if that is in effect yet. Besides, your newest PC has the newest MySQL and it is already "utf8mb4" --- at least the server character set is. You will need to confirm it is also the same for the database.

Once we know the actual database character sets for all three setups, we can proceed.

Does the web server only contain the site itself and you are indexing from the two PC's, or is there also an index on the server itself?
Equinoxe58
Posts: 14
Joined: Fri Oct 06, 2023 8:02 pm

Re: Always problems with with accented characters

Post by Equinoxe58 »

Good morning,
I finally found the solution to my problem of accents on characters in French among others.
I tried your solution of switching all my bases to utf8mb4,
It worked locally but when I exported the base to my remote site, it no longer worked. The accents were absent again.
In the end, the solution was very simple (even if I searched for a long time). I exported my file (database save) in iso8859-1 instead of utf8 and it worked. The accents reappeared.
I hope this can help some people who have encountered this problem.
Thank you again for your valuable help.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Always problems with with accented characters

Post by captquirk »

Having it work by exporting as 8859-1 sounds rather weird, BUT ---
If the import is also iso-8859-1 (and you are unable to change it), then it does make sense in a weird way. Exporting utf8 is the correct way, but if the import is iso8859-1, there WILL be corruption. Exporting as iso8859-1 of a utf8mb4 database will be gibberish, but the iso9959-1 accepts it as the same gibberish! The utf8mb4 database, however, sees it correctly.

Yes, thinking about it makes me want to bang my head on the wall, but you can't argue with results! LOL!
Post Reply