Problem of HTML entities and quotes in search result

Come here for help or to post comments on Sphider
Post Reply
Entropie
Posts: 2
Joined: Fri Apr 05, 2019 9:39 am

Problem of HTML entities and quotes in search result

Post by Entropie »

Hello,

When I make a search, the results are correct (in terms of occurrences), but the way they are displayed is not :
- All HTML entities (such as " " are displayed without "&")
- All texts after quotes " ' " dissapear (truncated result).

Here is below an example :
==================
6. [56.00%] EquiTerre » Qui sommes-nousnbsp;? » Les acteurs » Elise <== "&" disappeared
Les acteurs. Elise.
http://www.equiterre.fr/elise - 37.8kb

7. [54.00%] EquiTerre » C\est vous qui le dites ! <== not quote anymore
C <== truncated result
http://www.equiterre.fr/temoignages - 75.8kb

8. [52.00%] EquiTerre » Nos activités » Pour enfants » Camps d\été <== not quote anymore
Le programme de nos camps d <== truncated result
http://www.equiterre.fr/activites-enfants-camps-ete - 38.0kb
==================

Do you know if there is somewhere a kind of setting I did not set correctly in sphider settings or PHP settings ?
I use PHP version 7.3, and all my pages are in UTF8 (without BOM).

Thank you very much for your help, and congratulations for keeping on mainting this wonderful tool Sphider !
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Problem of HTML entities and quotes in search result

Post by captquirk »

Probably a filter somewhere. I'll take a look.
--------------------------------------------
UPDATE 4/6: There seems to be a twofold issue. Using Sphider v2.3, none of the full text of pages from http://www.equiterre.fr is being stored. This is probably related to the truncation issue. Trying a different French language site, specifically, https://www.gouvernement.fr, this is not the case. However, even with https://www.gouvernement.fr, some French words are not being stored properly. That is the second issue.

Now, going back to Sphider v1.6, French words from https://www.gouvernement.fr DO indeed store properly. However, http://www.equiterre.fr throws all kinds of errors and will not index at all in v1.6! It doesn't index in the even older v1.42.

I will first need to address the storage issue discovered by comparing v1.6 and v2.3 results from https://www.gouvernement.fr. Then I can revisit your specific problem(s).

I'll get to the bottom of it, but it may take be a bid longer than I thought!
------------------------------------------
UPDATE 4/7: The secondary issue I encountered I believe has been resolved... little more testing to be sure. This will appear in the next release. This also fixed your issue with the "&" disappearing from "&nbsp;". However, for some reason, your site will not store the full text of a page, and I believe this to be related to the truncation issue. As the pro9blem that got me sidetracked winds down, I can concentrate more on getting this truncation issue resolved.

Thank you for your patience.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Problem of HTML entities and quotes in search result

Post by captquirk »

The problem of the "&" in the title disappearing and leaving "nbsp;" is resolved in Sphider 2.4. Also in the titles, the missing apostrophes are restored.

Second issue is the truncation occurring in the description. This occurs when the apostrophe (&apos:) character is encountered. Sphider gets the description from meta tags in the 'getHeadData' function. The simple apostrophe (&apos;) is interpreted as a delimiter in the description. While this COULD be "fixed" by removing the character from the delimiters, this could cause problems on many other sites. A suggested approach is that in the meta descriptions, the apostrophe used, '&apos;', be replaced by '&rsquo;', which is an apostrophe variation angled downward to the left. This would not be a delimiter and thus full description would be stored, not truncated.

Although you may not have noticed, the full text descriptions are not being stored. I am in the process of tracing this down. The full text IS present early in the process, that is, it IS being picked up, but is being "lost" somewhere further in. I will advise when I find the cause.

-----
UPDATE: The full text is being lost when the Spider function 'removeEmoji" executes. Emojis have presented indexing problems in the past, throwing sql errors. This function removes emojis from the text prior to storage. While removing this could have adverse effects on SOME sites, most sites do not have emojis... unless you are getting user comments such as is common for WordPress articles. I haven't seen emojis on your site, so a workaround would be to not execute that function.
For the regular edition of Sphider 2.3.1, in spider.php, line 479, comment out the function thus:

Code: Select all

// $fulltxt = removeEmoji($fulltxt);
In the PDO editiuon, this is line 478.
Entropie
Posts: 2
Joined: Fri Apr 05, 2019 9:39 am

Re: Problem of HTML entities and quotes in search result

Post by Entropie »

Hello CaptQuirk !

Thank you so much for your high reactivity !

I will wait for the 2.4 version, and meanwhile, apply the trick of "&apos;" replacement in my headers meta tags.
And comment out the RemoveEmoji process, since I have no emoji on my website.

Once again, tanks a lot for your time ! :-)

E.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Problem of HTML entities and quotes in search result

Post by captquirk »

Now that Sphider 2.4.0 is released, I've gone back to your site.
I've found the the function removeEmoji(0 in and of itself was not the problem, but its placement. So, in future releases this:

Code: Select all

                $host = $data['host'];
                $path = $data['path'];
                $fulltxt = $data['fulltext'];
                $fulltxt = removeEmoji($fulltxt); // Remove emojis
                if ($charset != "" && $charset != "utf-8") {
                    $fulltxt = iconv($charset, "UTF-8", $fulltxt);
                    $title = iconv($charset, "UTF-8", $title);
                    $desc = iconv($charset, "UTF-8", $desc);
                }
                $fulltxt = isUtf8($fulltxt)?$fulltxt:utf8_encode($fulltxt);
                $title = isUtf8($title)?$title:utf8_encode($title);
                $desc = isUtf8($desc)?$desc:utf8_encode($desc);
                $url_parts = parse_url($url);
                $domain_for_db = $url_parts['host'];
will be replaced with this:

Code: Select all

                $host = $data['host'];
                $path = $data['path'];
                $fulltxt = $data['fulltext'];
                if ($charset != "" && $charset != "utf-8") {
                    $fulltxt = iconv($charset, "UTF-8", $fulltxt);
                    $title = iconv($charset, "UTF-8", $title);
                    $desc = iconv($charset, "UTF-8", $desc);
                }
                $fulltxt = isUtf8($fulltxt)?$fulltxt:utf8_encode($fulltxt);
                $fulltxt = removeEmoji($fulltxt); // Remove emojis
                $title = isUtf8($title)?$title:utf8_encode($title);
                $desc = isUtf8($desc)?$desc:utf8_encode($desc);
                $url_parts = parse_url($url);
                $domain_for_db = $url_parts['host'];
The function will be moved from BEFORE all the utf8 checks to AFTER.

(SEE UPDATE BELOW)
Additionally, I have noted that two characters used in your pages, "…", and "œ", are initially picked up and crudely represented by the generic black diamond with a question mark (indicating an inability to properly recognize the character) to becoming a non-printing character after the utf8 checks. Actually, there are MANY black diamonds in the initial pass, but the utf8 check process properly get them sorted out.

The utf8 checks seem do sort out everything except "…", and "œ"! The good news is that the "…" really isn't very critical. The database shows these as a non-printing "NEL". The "œ" is also non-printing, showing in the database as "ST". The problem is that a word such as "cœur" will display as "cur" but not be indexed as such. I will try to figure this one out.

Depending on just how meticulous you want to be with your site, a quick workaround might to be to represent any occurrence of "œ" with "&oelig;". So "cœur" would be "c&oelig;ur". The "…" can be replaced with "&hellip;". Meanwhile, I will continue to look for a more proper solution.

UPDATE: The problem isn't Sphider. While Firefox displays the pages correctly, it doesn't show Sphider page extracts properly. Google Chrome, on the other hand, displays BOTH. Viewing a page extract from a Sphider search using Google Chrome properly shows both "…", and "œ".

It's a browser issue.
Post Reply