I have the following question / problem.
If a domain has a sitemap.xml with a redirect to sitemap_index.xml
and there is also a reference to /sitemap-1.xml.gz, only the domain is indexed and no other pages.
Since I index via CLI with the options -r -f -s I have a small problem
With the index switched off using a sitemap, if available over 80 pages are indexed.
Sphider 5.4.1
Operating system DEBIAN 11
Webserver Apache 2.4.56-1~deb11u2with Nginx 1.24.0.1-v.debian.11+p18.0.57.0+t231106.2014
php83 8.3.0-debian.11.231124.0933
CPU Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (6 core(s))
Memory 16GB
Kind regards
Olaf
Question about the sitemap
Re: Question about the sitemap
A short history of Sphider...
Originally, Sphider read ONLY a sitemap.xml file. This worked fine on small websites, which Sphider was intended for.
I did expand this so that Sphider could use sitemap.xml as an index, and now will accept further xml files.
HOWEVER, I did NOT take xml.gz into consideration. So the gz files are indeed being ignored.
Also, Sphider does not handle redirects very well, so sitemap_index.xml is probably not even being read.
Something like:
in the main sitemap.xml would work, But then the gz files would still be an issue.
While the handling of redirects may be beyond what Sphider can handle (without a lot of rework), decompressing gz files does lend itself to a future enhancement. If you are interested, the relevant code would be in spiderfuncs.php about line 425.
_____________________
(And just as a FYI, a fix for the "Index decimals" bug - and improvements - is being tested. Thanks! Great catch!)
Originally, Sphider read ONLY a sitemap.xml file. This worked fine on small websites, which Sphider was intended for.
I did expand this so that Sphider could use sitemap.xml as an index, and now will accept further xml files.
HOWEVER, I did NOT take xml.gz into consideration. So the gz files are indeed being ignored.
Also, Sphider does not handle redirects very well, so sitemap_index.xml is probably not even being read.
Something like:
Code: Select all
<sitemapindex>
<loc>https://www.example.com/sitemap_index.xml</loc>
</sitemapindex>
While the handling of redirects may be beyond what Sphider can handle (without a lot of rework), decompressing gz files does lend itself to a future enhancement. If you are interested, the relevant code would be in spiderfuncs.php about line 425.
_____________________
(And just as a FYI, a fix for the "Index decimals" bug - and improvements - is being tested. Thanks! Great catch!)
Re: Question about the sitemap
Tentative enhancement to allow reading xml.gz files:
This will work for both the full version of Sphider and SphiderLite. This revised function replaces the current getSiteMap function in spiderfuncs.php.
Code: Select all
/**
* Read the sitemap.xml file on the server
*
* @param string $input_file Sitemap file name
*
* @return array $links Array of links found in sitemap.xml
*/
function getSiteMap($input_file)
{
$links = '';
$sitemap = simplexml_load_file($input_file);
if ($sitemap != '') {
$links = array ();
foreach ($sitemap as $url) {
// For some reason, wlwmanifest.xml interfers with the recursion
// Therefore, let's ignore it
if (preg_match("/wlwmanifest\.xml$/i", $url->loc ?? '')) {
continue;
}
if (preg_match("/\.xml.gz$/i", $url->loc ?? '')) {
$submap_gz = $url->loc;
foreach ($submap_gz as $input3) {
$zd = gzopen($input3, "r");
$contents = gzread($zd, 100000);
gzclose($zd);
$sitemap_gz = simplexml_load_string($contents);
if ($sitemap_gz != '') {
foreach ($sitemap_gz as $url_gz) {
$links[] = ($url_gz->loc);
}
}
}
}
if (preg_match("/\.xml$/i", $url->loc ?? '')) {
$submap = $url->loc;
foreach ($submap as $input2) {
$sitemap2 = simplexml_load_file($input2);
if ($sitemap2 != '') {
foreach ($sitemap2 as $url2) {
$links[] = ($url2->loc);
}
}
}
} else {
if (!preg_match("/\.gz$/i", $url->loc)) {
$links[] =($url->loc);
}
}
}
$links = explode(",", (implode(",", $links)));
}
return $links;
}
Re: Question about the sitemap - THANKS
Many thanks for the fast and good support says
olaf
olaf
Re: Question about the sitemap
Thanks!