unicode indexing
unicode indexing
I have installed Sphider 5.3.0. Everything if fine, but the search does not return the words in non-Latin characters. What's the problem?
Re: unicode indexing
Sphider is fully Unicode capable.
Example: If you index a Russian web site, then want to search for "грузовик", are you actually searching for "грузовик", or some numeric Unicode equivalent?
Also, check your database. Is the default character set UTF8_mb4, or just UTF8? MySql defaults to UTF8, which is only 3 byte, while UTF8_mb4 is the full 4-byte Unicode.
We have one user who indexes and searches in a combination of English and Gurbani.
After indexing a site, if you look in the database, links table, fulltxt column, you will see if Unicode characters are present.
Also check the site being indexed. Does it actually contain Unicode characters, or are they a numeric Unicode replacement?
Hope this helps.
Example: If you index a Russian web site, then want to search for "грузовик", are you actually searching for "грузовик", or some numeric Unicode equivalent?
Also, check your database. Is the default character set UTF8_mb4, or just UTF8? MySql defaults to UTF8, which is only 3 byte, while UTF8_mb4 is the full 4-byte Unicode.
We have one user who indexes and searches in a combination of English and Gurbani.
After indexing a site, if you look in the database, links table, fulltxt column, you will see if Unicode characters are present.
Also check the site being indexed. Does it actually contain Unicode characters, or are they a numeric Unicode replacement?
Hope this helps.
Re: unicode indexing
Hi, thank you for your answer.
Is that correct?

Is that correct?

Re: unicode indexing
Yes. the utf8mb4 is the correct encoding. As for collation, I tend to use utf8mb4_general_ci, BUT --- it really doesn't matter that much. Different collations may present different sorting, but the IMPORTANT thing is that it be utf8mb4!!!
Why MySQL doesn't have full 4 byte UTF8 encoding as a default is beyond me!
Why MySQL doesn't have full 4 byte UTF8 encoding as a default is beyond me!
Re: unicode indexing
OK, I have identified the problem in MySQL server settings:
show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
show variables like 'collation%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | utf8_general_ci |
| collation_database | latin1_swedish_ci |
| collation_server | latin1_swedish_ci |
+----------------------+-------------------+
Obviously, utf8mb4 should be used instead of latin1, but my hosting plan (shared hosting) does not allow to change these settings
I will have to find an alternative to Sphider and MySQL-based search engines.
show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |
show variables like 'collation%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | utf8_general_ci |
| collation_database | latin1_swedish_ci |
| collation_server | latin1_swedish_ci |
+----------------------+-------------------+
Obviously, utf8mb4 should be used instead of latin1, but my hosting plan (shared hosting) does not allow to change these settings

I will have to find an alternative to Sphider and MySQL-based search engines.
Re: unicode indexing
It is possible for a particular database to have a character set and collation DIFFERENT than defaults. Before giving up completely, let's be sure that character set really is the issue. Both the "install.php" and manual "tables.sql" provided make every effort to be utf8mb4!
Go to the MySQL prompt as you did before and run these commands:
If the character set is correct, then we need to look for other places for the issue... like the pages being indexed. I can explain that in more detail, but let's first be sure it isn't the database itself.
Go to the MySQL prompt as you did before and run these commands:
If the character set is NOT utf8mb4, we are kind of shafted!USE (your db_name);
SELECT @@character_set_database, @@collation_database;
If the character set is correct, then we need to look for other places for the issue... like the pages being indexed. I can explain that in more detail, but let's first be sure it isn't the database itself.
Re: unicode indexing
Sorry for delay.

The Unicode characters are still missing in DB.

The Unicode characters are still missing in DB.
Re: unicode indexing
Your database does not seem to be the issue.
With web pages, it is possible for a Unicode character to appear correctly in a browser window, BUT the source code behind that page is not Unicode, but a replacement. For example:
Browser displays: Д
Source code is: Д
Another possible issue is that the source code is indeed UTF-8, but the code states that the character set is Latin-1. Sphider then tries to convert a UTF-8 character to UTF-8, believing it to be a Latin-1 character. That never ends well!
It is important that the source code character set be declared as the ACTUAL character set being used in the source, Then the content must consist of the actual characters in that character set.Getting the source code character set correct can also be frustrating. I found the best way is to set that in the headers before anything is displayed html-wise. Here is how I do that in PHP:
header('Content-Type: text/html; charset=UTF-8');
If you are not specifying a page character set, or not doing it in a header, your host may be setting the character set for you!!! In other words, their default header may be overriding your set specified in a meta tag like this:
Hopefully this helps, but I'm still here if it doesn't....
With web pages, it is possible for a Unicode character to appear correctly in a browser window, BUT the source code behind that page is not Unicode, but a replacement. For example:
Browser displays: Д
Source code is: Д
Another possible issue is that the source code is indeed UTF-8, but the code states that the character set is Latin-1. Sphider then tries to convert a UTF-8 character to UTF-8, believing it to be a Latin-1 character. That never ends well!
It is important that the source code character set be declared as the ACTUAL character set being used in the source, Then the content must consist of the actual characters in that character set.Getting the source code character set correct can also be frustrating. I found the best way is to set that in the headers before anything is displayed html-wise. Here is how I do that in PHP:
header('Content-Type: text/html; charset=UTF-8');
If you are not specifying a page character set, or not doing it in a header, your host may be setting the character set for you!!! In other words, their default header may be overriding your set specified in a meta tag like this:
If you are not specifying a page character set, or not doing it in a header, your host may be setting the character set for you!!! In other words, their default header may be overriding your set specified in a meta tag like this:header('Content-Type: text/html; charset=UTF-8');
Headers come before meta tags and often override them. If you can set your own headers, you win. Like I said, frustrating.<meta charset="UTF-8">
Hopefully this helps, but I'm still here if it doesn't....
Re: unicode indexing
I created a dummy file with
containing Unicode characters, but after indexing the search returns only latin 
Does it mean that I cannot override the server-set headers?
Code: Select all
<?php header('Content-Type: text/html; charset=utf-8'); ?>
<!doctype html>

Does it mean that I cannot override the server-set headers?
Re: unicode indexing
I have to think you should be able to override whatever your host has as a default...
I would like to see a portion of source code for a page containing Unicode characters...
Even better would be to see the site itself. If you are not comfortable sharing that publicly, you may do so by private message.
I would like to see a portion of source code for a page containing Unicode characters...
Even better would be to see the site itself. If you are not comfortable sharing that publicly, you may do so by private message.