Sphider-5.0.0 - Indexing Keyword issues

Come here for help or to post comments on Sphider
scorney
Posts: 8
Joined: Wed Jul 19, 2023 3:44 pm

Sphider-5.0.0 - Indexing Keyword issues

Post by scorney »

All,

Pretty new to this environment, I really liked the shipder-5.0.0, everything seems toworks but I am seeing a limitation in keywords indexing.
I have over 1000 words in the website. Images work fine and get only about 170 row in the database tab.

When I do a search of words in any page it comes back with no finding.

I re-created the mysql database with all permission
example: mysql> GRANT ALL PRIVILEGES ON sphider_db.* TO 'scorney'@'localhost';

Ran the install.php script, no issues shown on the screen.

I also tried the CLI : php spider.php -all
I got an error which could lead to the issue, not sure....

weberver:/var/www/PmGuide/PmGuide/sphider-5.0.0/admin$ php spider.php -all
PHP Fatal error: Uncaught ArgumentCountError: Too few arguments to function indexSite(), 10 passed in /var/www/PmGuide/PmGuide/sphider-5.0.0/admin/spider.php on line 1168 and exactly 11 expected in /var/www/PmGuide/PmGuide/sphider-5.0.0/admin/spider.php:690
Stack trace:
#0 /var/www/PmGuide/PmGuide/sphider-5.0.0/admin/spider.php(1168): indexSite()
#1 /var/www/PmGuide/PmGuide/sphider-5.0.0/admin/spider.php(136): indexAll()
#2 {main}
thrown in /var/www/PmGuide/PmGuide/sphider-5.0.0/admin/spider.php on line 690

I ran the requirements_check.php:
Mysqlnd - CHECK!
PHP 7 or greater - CHECK!
Curl - CHECK!
Iconv - CHECK!
Mbstring - CHECK!
Imagick - CHECK!
(Imagick is not needed for Sphiderlite.)

Congratulations! You can use either Sphider or Sphiderlite.

Any support would be great!
Thanks in advance.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Sphider-5.0.0 - Indexing Keyword issues

Post by captquirk »

HORROR!!!
Could I possibly have an error in the code?

That certainly does seem like a possibility, and your posting the error message will be of great aid finding it --- and correcting it.

I'll be back --- with solution in hand.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Sphider-5.0.0 - Indexing Keyword issues

Post by captquirk »

Okay. Definitely some code missing from the indexAll function. :oops:

To correct, in spider.php line 1128, find:
$stmt = $db->prepare(
"SELECT url, spider_depth, required, disallowed, "
."usesitemap, ignore_robots, can_leave_domain, foreign_images FROM "
.$mysql_table_prefix."sites"
);
Replace with:
$stmt = $db->prepare(
"SELECT url, spider_depth, required, disallowed, "
."usesitemap, ignore_robots, can_leave_domain, foreign_images, "
."link_log FROM $mysql_table_prefix."sites"
);
Then further down in spider.php, line 1162, find:
$foreignimgs = $row[7];
if ($foreignimgs=='') {
$foreignimgs = 0;
}
indexSite(
$url, 1, $depth, $soption, $include, $not_include, $usesitemap,
$ignore_robots, $can_leave_domain, $foreignimgs
);
And replace with:
$foreignimgs = $row[7];
if ($foreignimgs=='') {
$foreignimgs = 0;
}
$linklog = $row[8];
if ($linklog == '') {
$linklog = 0;
}
indexSite(
$url, 1, $depth, $soption, $include, $not_include, $usesitemap,
$linklog, $ignore_robots, $can_leave_domain, $foreignimgs
);
This necessitates a new release, 5.1.0, which will be out in a few days. This will eliminate the error you reported. Release 5.1.0 will have no other changes except for an update of the settings table.

Now, if for some reasons keywords are still not being indexed, report back and we will address that as a separate issue.

This is a bit embarrassing, but such is life! Thanks for catching my goof.
scorney
Posts: 8
Joined: Wed Jul 19, 2023 3:44 pm

Re: Sphider-5.0.0 - Indexing Keyword issues

Post by scorney »

I had to remove an extra " before sites code line 1131 to make the code running.

weberver:/var/www/PmGuide/PmGuide/sphider-5.0.0/admin$ php spider.php -f
PHP Parse error: syntax error, unexpected 'sites' (T_STRING), expecting ')' in /var/www/PmGuide/PmGuide/sphider-5.0.0/admin/spider.php on line 1131

However, it doesn't capture additional keywords. I am still at ±171 words. All run without errors.
Maybe it is my web page that could potentially block the spider to collect word ?

Does spider read through tables (ul) (li) ?
I have a huge Acronym tables and I adjusted the settings for "Minimum word length in order to be indexed" to 2 and tried even 3, but same results, it doesn't change the quantity of collected keywords.
As I said before, I am really new to this and don't want to make you waste your time with newbie like me.

I do appreciate your reply and support! THANKS
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Sphider-5.0.0 - Indexing Keyword issues

Post by captquirk »

Actually, the correct line 1131 is:
."link_log FROM ".$mysql_table_prefix."sites"
Instead of removing that one "quote", you had to add a "quote" "dot" on the same line before "$mysql_table_prefix". You had the right idea, though! There were unmatched "quotes". Another oversight on my part. (Yes, I AM ashamed with myself for that!)

Now, PERHAPS, things will work better for you. As an added check, let me know the url of the website you are spidering and I'll see what happens on my end. If you do not want to make that available for all the world to see, send it to me as a private message or through email as before.
scorney
Posts: 8
Joined: Wed Jul 19, 2023 3:44 pm

Re: Sphider-5.0.0 - Indexing Keyword issues

Post by scorney »

Ok I applied the change you described and no error in the file and the spider.php is running.

However, it did not resolve the issue of capturing the keywords. I have read somewhere a potential bug in having a file index.html while indexing.
So i removed the index.html while indexing and surprise.. :o I got over 1k keywords, then put back my index.html to root directory.

So now we got this resolved, I am still working in getting the 3 characters words to be indexed it seems like it did not capture anything from my Acronyms.html page. The words are in a table. <table><tr><th>...

I'll keep on investigating...

Thanks again for your support!
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Sphider-5.0.0 - Indexing Keyword issues

Post by captquirk »

As things slowly calm down here, I can spend more time looking at this, although it seems you have made good progress.

Contents of tables DO get indexed.

Some words do not get indexed because they are in a file "include/common.txt". This probably is not the case for your acronyms, but there might be a couple? You can change this file to your liking. The idea of the file is to keep the number of results returned down in the case someone does an OR search on "the next big thing" and EVERY occurrence of "the" comes back!
scorney
Posts: 8
Joined: Wed Jul 19, 2023 3:44 pm

Re: Sphider-5.0.0 - Indexing Keyword issues

Post by scorney »

Thanks for pointing to include/common.txt I'll have a look into it and apply the necessary changes/adjustment. very good direction.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Sphider-5.0.0 - Indexing Keyword issues

Post by captquirk »

A way of circumventing the issue with the index.html is to create a sitemap.xml, then do an index using that. Re-index using a sitemap does not work so well, but a clean index will.

Re-index actually looks at existing pages for changes and adds new pages based upon an new references in those changed pages.

A clean index using a sitemap does NOT pull references from the indexed pages but indexes pages based upon what is in the sitemap.
scorney
Posts: 8
Joined: Wed Jul 19, 2023 3:44 pm

Re: Sphider-5.0.0 - Indexing Keyword issues

Post by scorney »

I created more contents for my site and did a clean index in doing:
1. "clear site"
2. removing index.html from the root directory
3. Perform Clean Index (Full)
4. re-insert index.html in root directory

Then I tried searching for new keywords and a lot of them reported no result a lot of them.
"The search "Lost" did not match any documents"

I verified and nothing in the include/common.txt that is similar.

Not sure if the settings have anything to do with this or it is something in the code ?
I'll keep on trying different ways to get all words indexed.
Post Reply