Question about the sitemap

Come here for help or to post comments on Sphider
Post Reply
chef-olaf
Posts: 13
Joined: Wed Dec 06, 2023 7:38 am

Question about the sitemap

Post by chef-olaf »

I have the following question / problem.

If a domain has a sitemap.xml with a redirect to sitemap_index.xml
and there is also a reference to /sitemap-1.xml.gz, only the domain is indexed and no other pages.

Since I index via CLI with the options -r -f -s I have a small problem

With the index switched off using a sitemap, if available over 80 pages are indexed.


Sphider 5.4.1

Operating system DEBIAN 11
Webserver Apache 2.4.56-1~deb11u2with Nginx 1.24.0.1-v.debian.11+p18.0.57.0+t231106.2014
php83 8.3.0-debian.11.231124.0933
CPU Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (6 core(s))
Memory 16GB

Kind regards

Olaf
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Question about the sitemap

Post by captquirk »

A short history of Sphider...
Originally, Sphider read ONLY a sitemap.xml file. This worked fine on small websites, which Sphider was intended for.

I did expand this so that Sphider could use sitemap.xml as an index, and now will accept further xml files.

HOWEVER, I did NOT take xml.gz into consideration. So the gz files are indeed being ignored.

Also, Sphider does not handle redirects very well, so sitemap_index.xml is probably not even being read.

Something like:

Code: Select all

  <sitemapindex>
    <loc>https://www.example.com/sitemap_index.xml</loc>
  </sitemapindex>
in the main sitemap.xml would work, But then the gz files would still be an issue.

While the handling of redirects may be beyond what Sphider can handle (without a lot of rework), decompressing gz files does lend itself to a future enhancement. If you are interested, the relevant code would be in spiderfuncs.php about line 425.
_____________________
(And just as a FYI, a fix for the "Index decimals" bug - and improvements - is being tested. Thanks! Great catch!)
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Question about the sitemap

Post by captquirk »

Tentative enhancement to allow reading xml.gz files:

Code: Select all

/**
 * Read the sitemap.xml file on the server
 *
 * @param string $input_file Sitemap file name
 *
 * @return array $links Array of links found in sitemap.xml
 */
function getSiteMap($input_file)
{
    $links = '';
    $sitemap = simplexml_load_file($input_file);
    if ($sitemap != '') {
        $links = array ();
        foreach ($sitemap as $url) {
            // For some reason, wlwmanifest.xml interfers with the recursion
            // Therefore, let's ignore it
            if (preg_match("/wlwmanifest\.xml$/i", $url->loc ?? '')) {
                continue;
            }
            if (preg_match("/\.xml.gz$/i", $url->loc ?? '')) {
                $submap_gz = $url->loc;
                foreach ($submap_gz as $input3) {
					$zd = gzopen($input3, "r");
					$contents = gzread($zd, 100000);
					gzclose($zd);
                    $sitemap_gz = simplexml_load_string($contents);
                    if ($sitemap_gz != '') {
                        foreach ($sitemap_gz as $url_gz) {
                            $links[] = ($url_gz->loc);
                        }
                    }
                }
            }
            if (preg_match("/\.xml$/i", $url->loc ?? '')) {
                $submap = $url->loc;
                foreach ($submap as $input2) {
                    $sitemap2 = simplexml_load_file($input2);
                    if ($sitemap2 != '') {
                        foreach ($sitemap2 as $url2) {
                            $links[] = ($url2->loc);
                        }
                    }
                }
            } else {
				if (!preg_match("/\.gz$/i", $url->loc)) {
                    $links[] =($url->loc);
                }
            }
        }
            $links = explode(",", (implode(",", $links)));
    }
    return $links;
}
This will work for both the full version of Sphider and SphiderLite. This revised function replaces the current getSiteMap function in spiderfuncs.php.
chef-olaf
Posts: 13
Joined: Wed Dec 06, 2023 7:38 am

Re: Question about the sitemap - THANKS

Post by chef-olaf »

Many thanks for the fast and good support says

olaf
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Question about the sitemap

Post by captquirk »

:D Thanks!
Post Reply