Page 1 of 1

Question about the sitemap

Posted: Wed Dec 13, 2023 11:37 am
by chef-olaf
I have the following question / problem.

If a domain has a sitemap.xml with a redirect to sitemap_index.xml
and there is also a reference to /sitemap-1.xml.gz, only the domain is indexed and no other pages.

Since I index via CLI with the options -r -f -s I have a small problem

With the index switched off using a sitemap, if available over 80 pages are indexed.


Sphider 5.4.1

Operating system DEBIAN 11
Webserver Apache 2.4.56-1~deb11u2with Nginx 1.24.0.1-v.debian.11+p18.0.57.0+t231106.2014
php83 8.3.0-debian.11.231124.0933
CPU Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (6 core(s))
Memory 16GB

Kind regards

Olaf

Re: Question about the sitemap

Posted: Wed Dec 13, 2023 4:08 pm
by captquirk
A short history of Sphider...
Originally, Sphider read ONLY a sitemap.xml file. This worked fine on small websites, which Sphider was intended for.

I did expand this so that Sphider could use sitemap.xml as an index, and now will accept further xml files.

HOWEVER, I did NOT take xml.gz into consideration. So the gz files are indeed being ignored.

Also, Sphider does not handle redirects very well, so sitemap_index.xml is probably not even being read.

Something like:

Code: Select all

  <sitemapindex>
    <loc>https://www.example.com/sitemap_index.xml</loc>
  </sitemapindex>
in the main sitemap.xml would work, But then the gz files would still be an issue.

While the handling of redirects may be beyond what Sphider can handle (without a lot of rework), decompressing gz files does lend itself to a future enhancement. If you are interested, the relevant code would be in spiderfuncs.php about line 425.
_____________________
(And just as a FYI, a fix for the "Index decimals" bug - and improvements - is being tested. Thanks! Great catch!)

Re: Question about the sitemap

Posted: Thu Dec 14, 2023 12:38 am
by captquirk
Tentative enhancement to allow reading xml.gz files:

Code: Select all

/**
 * Read the sitemap.xml file on the server
 *
 * @param string $input_file Sitemap file name
 *
 * @return array $links Array of links found in sitemap.xml
 */
function getSiteMap($input_file)
{
    $links = '';
    $sitemap = simplexml_load_file($input_file);
    if ($sitemap != '') {
        $links = array ();
        foreach ($sitemap as $url) {
            // For some reason, wlwmanifest.xml interfers with the recursion
            // Therefore, let's ignore it
            if (preg_match("/wlwmanifest\.xml$/i", $url->loc ?? '')) {
                continue;
            }
            if (preg_match("/\.xml.gz$/i", $url->loc ?? '')) {
                $submap_gz = $url->loc;
                foreach ($submap_gz as $input3) {
					$zd = gzopen($input3, "r");
					$contents = gzread($zd, 100000);
					gzclose($zd);
                    $sitemap_gz = simplexml_load_string($contents);
                    if ($sitemap_gz != '') {
                        foreach ($sitemap_gz as $url_gz) {
                            $links[] = ($url_gz->loc);
                        }
                    }
                }
            }
            if (preg_match("/\.xml$/i", $url->loc ?? '')) {
                $submap = $url->loc;
                foreach ($submap as $input2) {
                    $sitemap2 = simplexml_load_file($input2);
                    if ($sitemap2 != '') {
                        foreach ($sitemap2 as $url2) {
                            $links[] = ($url2->loc);
                        }
                    }
                }
            } else {
				if (!preg_match("/\.gz$/i", $url->loc)) {
                    $links[] =($url->loc);
                }
            }
        }
            $links = explode(",", (implode(",", $links)));
    }
    return $links;
}
This will work for both the full version of Sphider and SphiderLite. This revised function replaces the current getSiteMap function in spiderfuncs.php.

Re: Question about the sitemap - THANKS

Posted: Fri Dec 15, 2023 6:25 am
by chef-olaf
Many thanks for the fast and good support says

olaf

Re: Question about the sitemap

Posted: Fri Dec 15, 2023 4:37 pm
by captquirk
:D Thanks!