Premature completion

petermk · Post by **petermk** » Tue Nov 26, 2019 12:02 pm

Thank you captkirk for taking on this project! I have been using the original sphider for many years, but it did not survive a php upgrade.

I have installed Sphider 3.3.0-MB on a shared server running LiteSpeed web server.

I find that the sphider indexer does not extend to all the accessible pages on my site (that is, those not blocked by robots.txt) -- it completes after 146 pages, which is about a tenth of the site.

Any suggestions as to why this happens?

Cheers
Peter

Post by **captquirk** » Wed Nov 27, 2019 4:00 am

Give me a link to your site and I will have a look.

Just a couple ideas that may or may not apply...
Do you have a sitemap, and is it linked to from the main page? Indexing using a sitemap.xml is a surefire way of finding pages.
Do you have any links that are only in javascript, like a javascript menu system? Sphider does not "read" javascript so any links which reside ONLY in javascript will not be picked up unless they are found elsewhere.

Of course there may be something else going on. Let me have a go and see what happens.

petermk · Post by **petermk** » Wed Nov 27, 2019 4:43 am

Thank you, Captain. I'm sorry I got your name wrong!

I'll investigate and report back.

Cheers

petermk · Post by **petermk** » Wed Nov 27, 2019 9:20 am

Hi captquirk, and thanks for responding.

The problematic website is http://www.marquis-kyle.com.au

The home page has a link to a human-readable site map at http://www.marquis-kyle.com.au/sitemap.shtml

There is also sitemap.xml generated by http://www.web-site-map.com/xml_sitemap.php -- I see that, like the sphider index, it is incomplete (but has more entries than the sphider index).

Another anomaly I notice is that the sphider indexer seems to be interpreting relative links in an odd way:

So if a page in the directory marquis-kyle.com.au/mt/
contains a link to ./../aboutblog.htm
the indexer goes looking for marquis-kyle.com.au/mt/aboutblog.htm
instead of marquis-kyle.com.au/aboutblog.htm
and returns the message "Not text or html"

I'd appreciate any comments or suggestions!

Cheers

Post by **captquirk** » Wed Nov 27, 2019 7:04 pm

Using Sphider 3.3.0-MB, I added a new site: http://www.marquis-kyle.com.au/
I set index level to: full
In settings, I checked "Index pdf".
Minimum words per page is 10, minimum word length is 3.

Using these settings, I ran the spider. There were 1668 links discovered, a couple being duplicates.
The end result was 989 pages indexed with 18,686 keywords.
466 pages were not indexed due to a no index flag in the meta tag.
A couple pages reported having less than 10 words. I looked at one, which was actually a pdf containing only images, so no surprise there.

Why you are only getting 146 pages is strange. When looking at the "Sites" tab, is there any indication that indexing was not completed? If so, try resuming. The only other thing I can think of is possibly a timeout issue. Try indexing from the command prompt.

Now, concerning the problem with "./../aboutblog.htm" or "./../autobio.htm"...
Your method of referencing these pages is somewhat indirect, although COMPLETELY VALID! Any browser worth its salt will properly interpret the reference. Looking at the Sphider code, it SEEMS it should be functioning the same way. Why it is not doing so for this site is something I need to look at more deeply.

As I said, the method IS VALID, but if I may borrow from one of your pages, "Do as much as necessary, as little as possible."
Consider replacing "./../aboutblog.htm" with simply "/aboutblog.htm". Both a browser and Sphider will understand.

Back to the issue of only getting 146 pages indexed....
Sphider can index from a sitemap, but it has to be in the form of "sitemap.xml". This is different format than your "sitemap.shtml". During normal indexing, of course, Sphider will read and pick up links from the shtml page just as it would any other valid html page. Just out of curiousity, I ran a sitemap generator on http://www.marquis-kyle.com.au/, and got one with 1008 links, including 6 pdf files. (I used Sitemap Generator 9 from Microsys. http://www.microsystools.com/products/s ... generator/)

Let me know how you make out, and if still having issues I'll see what else I can come up with.

petermk · Post by **petermk** » Thu Nov 28, 2019 2:39 am

Thank you, captquirk -- I have done some more investigation:

First, I changed those relative links ( ./../somefile.htm ) to relative-to-root ( /somefile.htm ), and ran sphider. The sphider indexing came to a stop after indexing about 100 pages (just stopped responding, no completion message) after trying to index a string of non-existing URLs ( like /mt/cartes/somefile.htm where the /mt and /cartes directories are both in the root directory, not nested ).

Next, I tried starting the sphider from a URL three steps deep from the root ( /mt/002177.php ). This time it got to about 800 pages before stopping at a non-existing URL with nested directories.

Next, I installed an evaluation copy of the Microsys sitemap generator and produced a sitemap.xml which I uploaded to the root directory. I checked 'Index using a sitemap' and ran the indexer. This time it got to about 1100 pages before stopping at a non-existent URL ( http://www.marquis-kyle.com.au/sp/mt/000108.htm ).

I'm no programmer, but it seems to me that there is a fault in the way the spider is following links and building its list of pages. And perhaps that fault is causing the process to lose its way.

Captquick, since you mentioned that you indexed pdf files, I'm guessing you are running sphider on a Windows machine -- is that right? Is it significant?

I'll stand by for any further suggestions...

Post by **captquirk** » Thu Nov 28, 2019 4:49 pm

If you could send me your spider log I might get a clue as to what is happening.

There is something strange here as I have indexed many sites using relative links with no issue. But even I come up with some invalid links while the source code appears totally valid. Give me a bit and I WILL figure out how the references are getting mangled!

Regarding pdf files, I index using Ubuntu. As far as I know, pdf-to-text converters are standard in Linux installations with a typical location of /usr/bin. Other converters, doc-to-text, xls-to-text, and ppt-to-text are optional and may or may not be present. (I also do testing on a Windows platform, but I prefer Linux. It takes me back to my glory UNIX days after I migrated away from mainframe work.)

As an aside, this is a holiday here in the States, so I may not get much done today or tomorrow...

Sphider Help Forum

Premature completion

Premature completion

Re: Premature completion

Re: Premature completion

Re: Premature completion

Re: Premature completion

Re: Premature completion

Re: Premature completion