unicode indexing

Come here for help or to post comments on Sphider
denver
Posts: 9
Joined: Fri Sep 08, 2023 11:32 am

unicode indexing

Post by denver »

I have installed Sphider 5.3.0. Everything if fine, but the search does not return the words in non-Latin characters. What's the problem?
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: unicode indexing

Post by captquirk »

Sphider is fully Unicode capable.
Example: If you index a Russian web site, then want to search for "грузовик", are you actually searching for "грузовик", or some numeric Unicode equivalent?

Also, check your database. Is the default character set UTF8_mb4, or just UTF8? MySql defaults to UTF8, which is only 3 byte, while UTF8_mb4 is the full 4-byte Unicode.

We have one user who indexes and searches in a combination of English and Gurbani.

After indexing a site, if you look in the database, links table, fulltxt column, you will see if Unicode characters are present.

Also check the site being indexed. Does it actually contain Unicode characters, or are they a numeric Unicode replacement?

Hope this helps.
denver
Posts: 9
Joined: Fri Sep 08, 2023 11:32 am

Re: unicode indexing

Post by denver »

Hi, thank you for your answer.

Is that correct?
Image
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: unicode indexing

Post by captquirk »

Yes. the utf8mb4 is the correct encoding. As for collation, I tend to use utf8mb4_general_ci, BUT --- it really doesn't matter that much. Different collations may present different sorting, but the IMPORTANT thing is that it be utf8mb4!!!

Why MySQL doesn't have full 4 byte UTF8 encoding as a default is beyond me!
denver
Posts: 9
Joined: Fri Sep 08, 2023 11:32 am

Re: unicode indexing

Post by denver »

OK, I have identified the problem in MySQL server settings:

show variables like 'char%';
+--------------------------+----------------------------+
| Variable_name | Value |
+--------------------------+----------------------------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | latin1 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /usr/share/mysql/charsets/ |

show variables like 'collation%';
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | utf8_general_ci |
| collation_database | latin1_swedish_ci |
| collation_server | latin1_swedish_ci |

+----------------------+-------------------+

Obviously, utf8mb4 should be used instead of latin1, but my hosting plan (shared hosting) does not allow to change these settings :cry:

I will have to find an alternative to Sphider and MySQL-based search engines.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: unicode indexing

Post by captquirk »

It is possible for a particular database to have a character set and collation DIFFERENT than defaults. Before giving up completely, let's be sure that character set really is the issue. Both the "install.php" and manual "tables.sql" provided make every effort to be utf8mb4!

Go to the MySQL prompt as you did before and run these commands:
USE (your db_name);
SELECT @@character_set_database, @@collation_database;
If the character set is NOT utf8mb4, we are kind of shafted!

If the character set is correct, then we need to look for other places for the issue... like the pages being indexed. I can explain that in more detail, but let's first be sure it isn't the database itself.
denver
Posts: 9
Joined: Fri Sep 08, 2023 11:32 am

Re: unicode indexing

Post by denver »

Sorry for delay.
Image

The Unicode characters are still missing in DB.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: unicode indexing

Post by captquirk »

Your database does not seem to be the issue.
With web pages, it is possible for a Unicode character to appear correctly in a browser window, BUT the source code behind that page is not Unicode, but a replacement. For example:
Browser displays: Д
Source code is: Д

Another possible issue is that the source code is indeed UTF-8, but the code states that the character set is Latin-1. Sphider then tries to convert a UTF-8 character to UTF-8, believing it to be a Latin-1 character. That never ends well!

It is important that the source code character set be declared as the ACTUAL character set being used in the source, Then the content must consist of the actual characters in that character set.Getting the source code character set correct can also be frustrating. I found the best way is to set that in the headers before anything is displayed html-wise. Here is how I do that in PHP:
header('Content-Type: text/html; charset=UTF-8');

If you are not specifying a page character set, or not doing it in a header, your host may be setting the character set for you!!! In other words, their default header may be overriding your set specified in a meta tag like this:
header('Content-Type: text/html; charset=UTF-8');
If you are not specifying a page character set, or not doing it in a header, your host may be setting the character set for you!!! In other words, their default header may be overriding your set specified in a meta tag like this:
<meta charset="UTF-8">
Headers come before meta tags and often override them. If you can set your own headers, you win. Like I said, frustrating.

Hopefully this helps, but I'm still here if it doesn't....
denver
Posts: 9
Joined: Fri Sep 08, 2023 11:32 am

Re: unicode indexing

Post by denver »

I created a dummy file with

Code: Select all

<?php header('Content-Type: text/html; charset=utf-8'); ?>
<!doctype html>
containing Unicode characters, but after indexing the search returns only latin :(

Does it mean that I cannot override the server-set headers?
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: unicode indexing

Post by captquirk »

I have to think you should be able to override whatever your host has as a default...

I would like to see a portion of source code for a page containing Unicode characters...

Even better would be to see the site itself. If you are not comfortable sharing that publicly, you may do so by private message.
Post Reply