Sphider 2.4.0 not indexing

Webbo · Post by **Webbo** » Sun Apr 21, 2019 1:11 pm

Hi

I've been successfully using Sphider 2.2.0b-PDO for over a year and have looked to update to 2.4.0.

I ran mysqlnd_check.php which returns:

mysqlnd is installed and enabled
You can use Sphider classic

My VPS is running MySQL 5.7 and PHP 5.6

Rather than looking to upgrade my existing installation, I ran with a new installation. I've set up a new database, new user, etc. The install.php script creates the tables fine. I've extensively recreated the settings I had for 2.2.0b-PDO but when I come to index I get the following:

Usage: php spider.php Options: -all Reindex everything in the database -u Set url to index -f Set indexing depth to full (unlimited depth) -d Set indexing depth to -s Use sitemap if available -i Ignore robots.txt for indexing images -l Allow spider to leave the initial domain -k Index foreign images -r Set spider to reindex a site -m Set the string(s) that an url must include (use \n as a delimiter between multiple strings) -n Set the string(s) that an url must not include (use \n as a delimiter between multiple strings)

I've also tried a fresh installation of 2.4.0-PDO in case of incompatability issues but get the same results.

Here's a screen dump of my database, from which you can see that all rows are empty:

: Dtabase screenshot 190421.JPG (163.86 KiB) Viewed 30948 times

The site is running SSL. I've tried indexing on both http:// and https:// without success.

If I reindex the site under 2.0.0b-PDO, everything works fine (as attached):

: 2.2.0-PDO_x indexing screenshot.JPG (215.7 KiB) Viewed 30948 times

Any ideas what I might be doing wrong?

Thanks and regards

Webbo

Post by **captquirk** » Mon Apr 22, 2019 3:22 am

Initially, using Sphider 2.4.1 (which is yet unreleased but MUCH like 2.4.0), I tried to index http://www.grangewoodplastics.co.uk. It immediatedly failed, giving me a NOHOST error.

After installing Sphider 2.2.0 and creating a database for that, I was able to successfully index the site. I upgraded to Sphider 2.3.0, reindexed successfully, then deleted all data and did a fresh index. Success. Upgrade to Sphider 2.3.1 and repeated. Success each way. Upgrade again to Sphider 2.4.0. TRepeated both the reindex and index from scratch. Both successes.

Returning to Sphider 2.4.1 to see what was wrong, I had to do a face palm! I had entered "http://ww.grangewoodplastics.co.uk"! Yup! I didn't have enough "w's"! So I deleted, recreated... and indexed. Mind you, this is all from with a browser.

I then went over onto my Ubuntu machine where I had Sphider 2.4.0 installed but was only using the RSS indexing. The database looked like this:

: databasebefore.png (112.81 KiB) Viewed 30935 times

Next, I added a site, "http://www.grangewoodplastics.co.uk". When I saved, Sphider checked for redirects and converted to https and added a trailing slash:

: addgrangewood.png (21.79 KiB) Viewed 30935 times

These are my settings:

: settings.png (87.18 KiB) Viewed 30935 times

This time, I indexed from the command prompt:

: startindex.png (28.7 KiB) Viewed 30935 times

It ran for awhile:

: finishindex.png (95.12 KiB) Viewed 30935 times

And the database looks like this:

: databaseafter.png (115.54 KiB) Viewed 30935 times

Basically, I cannot reproduce the noindex issue. What I CAN tell you is that not EVERY page is stored. I know the initial (home) page is not stored. It IS parsed( indexed), keywords are stored, links followed, but the actual page (link) is NOT. The reason is that there is a non-fatal sql error when attempting to store the page. The reason? The page has a title that is 930 characters long! The database filed for the title is only 200 characters. Many versions ago, I had issues and rather than Sphider grind to a halt, it seemed better to skip over a bad page and keep on chugging along. The issues causing the original problems were found and accounted for... but now we have a NEW what if, what if the title is crazy long? Well, we will adjust for this in the next version. (Build a better mouse trap, nature builds a better mouse.) Now, this simply not storing and skipped over a page has other ramifications (also non-fatal). When a link is not stored, there is no link_id. Keywords are still stored, and get a keyword_id. When the database calculates the link-keyword association, an entry is make in one of 16 different table with a link-keyword id --- except for the "missing" page, THERE IS NO LINK in link-keyword! Again, these errors are non-fatal and transparent to the user. When the next version fixes the too-long-a-title issue, the link (page) will store and the other errors disappear also.

Now all that is well and nice, but it STILL doesn't explain why you aren't getting ANYTHING. I am consistently getting something like 448 pages, 777 images, and in one instance 3017 keywords.

I would like to see a screenshot of your settings page, as well as the sites tab. The correct scenario shoud be for Sphider to see the site as "https", but when initially entering and saving the site, it should be noticing the redirection and changing the stored url to be "https" even if you initially put in "http". If you MANUALLY change (edit) it back to "http" after it has been stored as "https", that may cause an issue.

I guess what I am saying is that for this particular site, if it is looking for "http", it will attempt to index the first page, but since it would be a redirect (although Sphider SUPPOSEDLY follows redirects) there may be nothing to index, but even if it does follow the redirect, the l-o-n-g title keeps the page from being stored, and if it finds links, the "https" won't match what Sphider thinks should be "http", and considers them foreign. Producing nothing.

Sorry I can't be more definitive at this time. I WILL keep looking around to see if I can recreate the issue of no indexing.

UPDATE: Using the soon-to-be-released 2.4.1, I deleted ALL data associated with the site (links and images) and cleaned all keywords. I then edited to make it "http" instead of "https". Just started running, but it IS indexing, and is already up to about 24 pages. Also, the quick fix for too-long-a-title is working and the home page is stored. So still no closer to duplicating the issue...

Webbo · Post by **Webbo** » Mon Apr 22, 2019 12:40 pm

Hi captquirk

Thanks for such a quick and detailed response.

I've initially cut the length of the title on the index page to 183 characters to ensure that's not the issue. Attempting to index again gives the same result.

Please find attached screenshots of my Settings and Sites tabs (Sites screenshot to follow -I can't seem to upload more than three files.)

: 2.4.0 Settings 1.JPG (158.31 KiB) Viewed 30928 times

: 2.4.0 Settings 2.JPG (138.79 KiB) Viewed 30928 times

: 2.4.0 Settings 3.JPG (45.36 KiB) Viewed 30928 times

I've also been setting URL must not include:...

https://www.grangewoodplastics.co.uk/lo ... gation.php
https://www.grangewoodplastics.co.uk/he ... t-page.php
https://www.grangewoodplastics.co.uk/fo ... bottom.php
https://www.grangewoodplastics.co.uk/la ... _26_17.jpg
https://www.grangewoodplastics.co.uk/layout/pop_me.png
https://www.grangewoodplastics.co.uk/la ... -html5.png
https://www.grangewoodplastics.co.uk/la ... s-blue.gif

...in my advanced options but have excluded them at this time, in case that was causing an issue.

I've also deliberately excluded Index Images in the Spider settings - again in case this is causing an issue as my VPS does not have Imagick. (I have also tried indexing with this option selected.)

Hopefully, you may notice something wrong in my settings!

Thanks once again for your kind assistance.

Kind regards

Webbo

Webbo · Post by **Webbo** » Mon Apr 22, 2019 12:43 pm

Hi captquirk

Here's the Sites screen dump I wasn't able to upload in my initial response.

: 2.4.0 Sites.JPG (62.1 KiB) Viewed 30928 times

Kind regards

Webbo

Post by **captquirk** » Mon Apr 22, 2019 4:19 pm

Looking at your settings, I can't see ANYTHING out of the ordinary. Now it's time to get serious!
Imagick - This is used to display previews in search results. It is also used during indexing to try to determine image size if sizse tags are missing. Sphider can still index without it and can still search without, so this wouldn't be an issue. But just for simplicity, let's leave
index images" UNchecked until we get this working.
Excluding urls can get tricky and CAN break the process, so we will leave that blank for now.
I would like to see the actual database. from the database screen, check all tables, then do a backup. Send me the resulting gz file. You will need to email it as the forum doesn't take those. (captq a t blog dot worldspaceflight dot com) I can verify the table structures and that the sites aand settings tables appear populated as they should. I don't anticipate an issues there, but want to cover that base anyway.
I would like to see the log file from the last attempt. That is just an html and can be a screen shot.
Next, in spider.php, edit and change lines 34 and 35 from:

Code: Select all

//error_reporting(E_ALL ^ E_NOTICE ^ E_WARNING); //Development only
error_reporting(0);

to:

Code: Select all

error_reporting(E_ALL ^ E_NOTICE ^ E_WARNING); //Development only
//error_reporting(0);

This turns on error reporting.
Then, try indexing again from a browser. Any sql errors should appear on the spidering page. Unfortunately, these errors( (if any) don't make it into the logs, so this can be important.

If this doesn't reveal the problem, then the solution is trial and error and lots of php edits trying to find just where things are going south! If you are comfortable doing that, I'm game!

Webbo · Post by **Webbo** » Mon Apr 22, 2019 6:15 pm

Hi captquirk

Thanks again for your assistance - I've just emailed you following your latest response.

Kind regards

Webbo

Post by **captquirk** » Wed Apr 24, 2019 2:58 am

This issue has been resolved!

In Sphider 2.4.0, lines 86-89:

Code: Select all

} elseif (isset($_SERVER['argv']) && $_SERVER['argc'] < 2) {
    commandLineHelp();
    die();
}

we commented out the elseif, thus:

Code: Select all

} /*elseif (isset($_SERVER['argv']) && $_SERVER['argc'] < 2) {
    commandLineHelp();
    die();
}*/

It seems that ON SOME SERVERS, the server variable $_SERVER['argv'] is getting set EVEN WHEN the user is NOT running from the command prompt. This code is new to version 2.4.0, so does not affect earlier versions. And if your server is NOT setting this variable when you are indexing from a browser, you won't have a problem.

The next version of Sphider will either revert back and remove the code, or find a viable solution that prevents the issue. The code just needs to ask the right question, which apparently it isn't in 2.4.0!

Any way, problem solved.

Webbo · Post by **Webbo** » Wed Apr 24, 2019 6:19 am

Thanks, captquirk - much appreciated!

Kind regards

Webbo

Sphider Help Forum

Sphider 2.4.0 not indexing

Sphider 2.4.0 not indexing

Re: Sphider 2.4.0 not indexing

Re: Sphider 2.4.0 not indexing

Re: Sphider 2.4.0 not indexing

Re: Sphider 2.4.0 not indexing

Re: Sphider 2.4.0 not indexing

Re: Sphider 2.4.0 not indexing

Re: Sphider 2.4.0 not indexing