unicode indexing

Come here for help or to post comments on Sphider
denver
Posts: 9
Joined: Fri Sep 08, 2023 11:32 am

Re: unicode indexing

Post by denver »

Sending PM failed. "Internal Server Error"
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: unicode indexing

Post by captquirk »

I sent you a PM with instructions.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: unicode indexing

Post by captquirk »

Your code looks perfect!
We have already established your tables are utf8mb4.

Is there some way the fields in the table are NOT utf8mb4??? This does not seem likely, but I suppose anything is possible.

Go into mysql, connect to the database you are using, and run a 'show create table' for links and keywords.

You should get (i condensed) the following:
show create table links;

| links | CREATE TABLE `links` (
`link_id` int NOT NULL AUTO_INCREMENT,
`site_id` int DEFAULT NULL,
`url` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
`title` varchar(200) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL,
`description` text CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
`fulltxt` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
`indexdate` date DEFAULT NULL,
`size` float DEFAULT NULL,
`md5sum` varchar(32) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL,
`visible` int DEFAULT '0',
`level` int DEFAULT NULL,
PRIMARY KEY (`link_id`),
KEY `url` (`url`),
KEY `md5key` (`md5sum`),
KEY `site_id` (`site_id`),
CONSTRAINT `links_ibfk_1` FOREIGN KEY (`site_id`) REFERENCES `sites` (`site_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=22027 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci |


show create table keywords;

| keywords | CREATE TABLE `keywords` (
`keyword_id` int NOT NULL AUTO_INCREMENT,
`keyword` varchar(35) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
PRIMARY KEY (`keyword_id`),
UNIQUE KEY `kw` (`keyword`),
KEY `keyword` (`keyword`(10))
) ENGINE=InnoDB AUTO_INCREMENT=172089 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci |
Any field of type 'varchar' should have utf8mb4.

Would also be interested to see what is actually in the database. Try these queries:
select keyword from keywords where keyword like 'esp%';
and
select title, fulltxt from links where url like '%test%'
denver
Posts: 9
Joined: Fri Sep 08, 2023 11:32 am

Re: unicode indexing

Post by denver »

Sorry for delay again )

Code: Select all

| links | CREATE TABLE `links` (
  `link_id` int(11) NOT NULL AUTO_INCREMENT,
  `site_id` int(11) DEFAULT NULL,
  `url` varchar(255) NOT NULL,
  `title` varchar(200) DEFAULT NULL,
  `description` text,
  `fulltxt` longtext,
  `indexdate` date DEFAULT NULL,
  `size` float DEFAULT NULL,
  `md5sum` varchar(32) DEFAULT NULL,
  `visible` int(11) DEFAULT '0',
  `level` int(11) DEFAULT NULL,
  PRIMARY KEY (`link_id`),
  KEY `url` (`url`),
  KEY `md5key` (`md5sum`),
  KEY `site_id` (`site_id`),
  CONSTRAINT `links_ibfk_1` FOREIGN KEY (`site_id`) REFERENCES `sites` (`site_id`) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=6057 DEFAULT CHARSET=utf8mb4 |

Code: Select all

mysql> select keyword from keywords where keyword like 'esp%';
+---------+
| keyword |
+---------+
| espaa   |
+---------+

Code: Select all

| Espaa |  Espaa Espaa Espaa Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.  |
It looks like 'utf8mb4' is missing in 'varchar'.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: unicode indexing

Post by captquirk »

Not all tables have varchar() fileds, and of those that do, only TWO are critical for this situation. Keywords and links are the critical tables. Go into the site setting and "Clear site", now go into the Clean tables tab and clean keywords. Your links and keywords tables should now be empty.
Run the following from MySQL and then reindex. Fingers crossed, all will be fine.
SET FOREIGN_KEY_CHECKS = 0;
DROP TABLE links;
DROP TABLE keywords;
SET FOREIGN_KEY_CHECKS = 1;

CREATE TABLE links (
link_id int NOT NULL AUTO_INCREMENT,
site_id int DEFAULT NULL,
url varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
title varchar(200) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL,
description text CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
fulltxt longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
indexdate date DEFAULT NULL,
size float DEFAULT NULL,
md5sum varchar(32) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci DEFAULT NULL,
visible int DEFAULT '0',
level int DEFAULT NULL,
PRIMARY KEY (link_id),
KEY url (url),
KEY md5key (md5sum),
KEY site_id (site_id),
CONSTRAINT links_ibfk_1 FOREIGN KEY (site_id) REFERENCES sites (site_id) ON DELETE CASCADE ON UPDATE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;

CREATE TABLE keywords (
keyword_id int NOT NULL AUTO_INCREMENT,
keyword varchar(35) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL,
PRIMARY KEY (keyword_id),
UNIQUE KEY kw (keyword),
KEY keyword (keyword(10))
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci;
denver
Posts: 9
Joined: Fri Sep 08, 2023 11:32 am

Re: unicode indexing

Post by denver »

I followed the steps as above. No changes (
mysql> select keyword from keywords where keyword like 'esp%';
+---------+
| keyword |
+---------+
| espaa |
+---------+
1 row in set (0,00 sec)
1. "Query OK, 0 rows affected" was a response to each query. Is that OK?
2. In phpMyAdmin I can manually enter a non-Latin1 Unicode character into a cell.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: unicode indexing

Post by captquirk »

"0 rows affected" is expected as there was no operation affecting actual data. Just a table creation which has no data.

Can you run a "show create table keywords;" again just to confirm the table does have utf8mb4 varchar.

The fact you can manually enter a unicode character into 'keywords' is a good sign. Please confirm that the manual operation sticks and the manual change is retained.

Then we have to figure out why Sphider is'nt writing unicode into the table ...

This one is turning into a real head scratcher!
denver
Posts: 9
Joined: Fri Sep 08, 2023 11:32 am

Re: unicode indexing

Post by denver »

1. utf8mb4 varchar? I can't see it...

Image

2. "españa" was entered manually:

Image
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: unicode indexing

Post by captquirk »

I am puzzled...
Obviously, I am no guru and am missing something.
The 'keyword' field is capable of holding Unicode characters.

What version of MySQL (or possibly MariaDB) is running? MySQL, from what I read, has finally made utf8 and utf8mb4 the same instead of utf8 == utf8mb3. If the DB is MariaDB, there may be subtle differences in the output.

I KNOW the code, in a typical installation, writes Unicode into the tables... So why isn't that happening here?

One more thing to check, other than the DB version, is the contents of the 'fulltxt' field in the links table for this page. SELECT fulltxt from links WHERE url like (whatever the url is for that page). Does the full text have the same issue?
Post Reply