Initially, using Sphider 2.4.1 (which is yet unreleased but MUCH like 2.4.0), I tried to index
http://www.grangewoodplastics.co.uk. It immediatedly failed, giving me a NOHOST error.
After installing Sphider 2.2.0 and creating a database for that, I was able to successfully index the site. I upgraded to Sphider 2.3.0, reindexed successfully, then deleted all data and did a fresh index. Success. Upgrade to Sphider 2.3.1 and repeated. Success each way. Upgrade again to Sphider 2.4.0. TRepeated both the reindex and index from scratch. Both successes.
Returning to Sphider 2.4.1 to see what was wrong, I had to do a face palm! I had entered "
http://ww.grangewoodplastics.co.uk"! Yup! I didn't have enough "w's"! So I deleted, recreated... and indexed. Mind you, this is all from with a browser.
I then went over onto my Ubuntu machine where I had Sphider 2.4.0 installed but was only using the RSS indexing. The database looked like this:
- databasebefore.png (112.81 KiB) Viewed 30935 times
Next, I added a site, "
http://www.grangewoodplastics.co.uk". When I saved, Sphider checked for redirects and converted to https and added a trailing slash:
- addgrangewood.png (21.79 KiB) Viewed 30935 times
These are my settings:
- settings.png (87.18 KiB) Viewed 30935 times
This time, I indexed from the command prompt:
- startindex.png (28.7 KiB) Viewed 30935 times
It ran for awhile:
- finishindex.png (95.12 KiB) Viewed 30935 times
And the database looks like this:
- databaseafter.png (115.54 KiB) Viewed 30935 times
Basically, I cannot reproduce the noindex issue. What I CAN tell you is that not EVERY page is stored. I know the initial (home) page is not stored. It IS parsed( indexed), keywords are stored, links followed, but the actual page (link) is NOT. The reason is that there is a non-fatal sql error when attempting to store the page. The reason? The page has a title that is 930 characters long! The database filed for the title is only 200 characters. Many versions ago, I had issues and rather than Sphider grind to a halt, it seemed better to skip over a bad page and keep on chugging along. The issues causing the original problems were found and accounted for... but now we have a NEW what if, what if the title is crazy long? Well, we will adjust for this in the next version. (Build a better mouse trap, nature builds a better mouse.) Now, this simply not storing and skipped over a page has other ramifications (also non-fatal). When a link is not stored, there is no link_id. Keywords are still stored, and get a keyword_id. When the database calculates the link-keyword association, an entry is make in one of 16 different table with a link-keyword id --- except for the "missing" page, THERE IS NO LINK in link-keyword! Again, these errors are non-fatal and transparent to the user. When the next version fixes the too-long-a-title issue, the link (page) will store and the other errors disappear also.
Now all that is well and nice, but it STILL doesn't explain why you aren't getting ANYTHING. I am consistently getting something like 448 pages, 777 images, and in one instance 3017 keywords.
I would like to see a screenshot of your settings page, as well as the sites tab. The correct scenario shoud be for Sphider to see the site as "https", but when initially entering and saving the site, it should be noticing the redirection and changing the stored url to be "https" even if you initially put in "http". If you MANUALLY change (edit) it back to "http" after it has been stored as "https", that may cause an issue.
I guess what I am saying is that for this particular site, if it is looking for "http", it will attempt to index the first page, but since it would be a redirect (although Sphider SUPPOSEDLY follows redirects) there may be nothing to index, but even if it does follow the redirect, the l-o-n-g title keeps the page from being stored, and if it finds links, the "https" won't match what Sphider thinks should be "http", and considers them foreign. Producing nothing.
Sorry I can't be more definitive at this time. I WILL keep looking around to see if I can recreate the issue of no indexing.
UPDATE: Using the soon-to-be-released 2.4.1, I deleted ALL data associated with the site (links and images) and cleaned all keywords. I then edited to make it "http" instead of "https". Just started running, but it IS indexing, and is already up to about 24 pages. Also, the quick fix for too-long-a-title is working and the home page is stored. So still no closer to duplicating the issue...