Page 1 of 2

unicode indexing

Posted: Fri Sep 08, 2023 11:35 am
by denver
I have installed Sphider 5.3.0. Everything if fine, but the search does not return the words in non-Latin characters. What's the problem?

Re: unicode indexing

Posted: Fri Sep 08, 2023 3:49 pm
by captquirk
Sphider is fully Unicode capable.
Example: If you index a Russian web site, then want to search for "грузовик", are you actually searching for "грузовик", or some numeric Unicode equivalent?

Also, check your database. Is the default character set UTF8_mb4, or just UTF8? MySql defaults to UTF8, which is only 3 byte, while UTF8_mb4 is the full 4-byte Unicode.

We have one user who indexes and searches in a combination of English and Gurbani.

After indexing a site, if you look in the database, links table, fulltxt column, you will see if Unicode characters are present.

Also check the site being indexed. Does it actually contain Unicode characters, or are they a numeric Unicode replacement?

Hope this helps.

Re: unicode indexing

Posted: Fri Sep 08, 2023 5:00 pm
by denver
Hi, thank you for your answer.

Is that correct?
Image

Re: unicode indexing

Posted: Fri Sep 08, 2023 5:17 pm
by captquirk
Yes. the utf8mb4 is the correct encoding. As for collation, I tend to use utf8mb4_general_ci, BUT --- it really doesn't matter that much. Different collations may present different sorting, but the IMPORTANT thing is that it be utf8mb4!!!

Why MySQL doesn't have full 4 byte UTF8 encoding as a default is beyond me!

Re: unicode indexing

Posted: Sat Sep 09, 2023 12:21 pm
by denver
OK, I have identified the problem in MySQL server settings:

show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |

show variables like 'collation%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | utf8_general_ci |
| collation_database | latin1_swedish_ci |
| collation_server | latin1_swedish_ci |

+----------------------+-------------------+

Obviously, utf8mb4 should be used instead of latin1, but my hosting plan (shared hosting) does not allow to change these settings :cry:

I will have to find an alternative to Sphider and MySQL-based search engines.

Re: unicode indexing

Posted: Sat Sep 09, 2023 3:55 pm
by captquirk
It is possible for a particular database to have a character set and collation DIFFERENT than defaults. Before giving up completely, let's be sure that character set really is the issue. Both the "install.php" and manual "tables.sql" provided make every effort to be utf8mb4!

Go to the MySQL prompt as you did before and run these commands:
USE (your db_name);
SELECT @@character_set_database, @@collation_database;
If the character set is NOT utf8mb4, we are kind of shafted!

If the character set is correct, then we need to look for other places for the issue... like the pages being indexed. I can explain that in more detail, but let's first be sure it isn't the database itself.

Re: unicode indexing

Posted: Mon Sep 25, 2023 10:59 am
by denver
Sorry for delay.
Image

The Unicode characters are still missing in DB.

Re: unicode indexing

Posted: Tue Sep 26, 2023 7:23 pm
by captquirk
Your database does not seem to be the issue.
With web pages, it is possible for a Unicode character to appear correctly in a browser window, BUT the source code behind that page is not Unicode, but a replacement. For example:
Browser displays: Д
Source code is: Д

Another possible issue is that the source code is indeed UTF-8, but the code states that the character set is Latin-1. Sphider then tries to convert a UTF-8 character to UTF-8, believing it to be a Latin-1 character. That never ends well!

It is important that the source code character set be declared as the ACTUAL character set being used in the source, Then the content must consist of the actual characters in that character set.Getting the source code character set correct can also be frustrating. I found the best way is to set that in the headers before anything is displayed html-wise. Here is how I do that in PHP:
header('Content-Type: text/html; charset=UTF-8');

If you are not specifying a page character set, or not doing it in a header, your host may be setting the character set for you!!! In other words, their default header may be overriding your set specified in a meta tag like this:
header('Content-Type: text/html; charset=UTF-8');
If you are not specifying a page character set, or not doing it in a header, your host may be setting the character set for you!!! In other words, their default header may be overriding your set specified in a meta tag like this:
<meta charset="UTF-8">
Headers come before meta tags and often override them. If you can set your own headers, you win. Like I said, frustrating.

Hopefully this helps, but I'm still here if it doesn't....

Re: unicode indexing

Posted: Wed Sep 27, 2023 11:23 am
by denver
I created a dummy file with

Code: Select all

<?php header('Content-Type: text/html; charset=utf-8'); ?>
<!doctype html>
containing Unicode characters, but after indexing the search returns only latin :(

Does it mean that I cannot override the server-set headers?

Re: unicode indexing

Posted: Wed Sep 27, 2023 5:17 pm
by captquirk
I have to think you should be able to override whatever your host has as a default...

I would like to see a portion of source code for a page containing Unicode characters...

Even better would be to see the site itself. If you are not comfortable sharing that publicly, you may do so by private message.