Page 1 of 1

Index from sitemaps when sitemap is a list of sitemaps

Posted: Mon Feb 28, 2022 10:33 pm
by captquirk
Sphider can index from sitemaps if they are simple sitemaps. If the initial sitemap is a list of links to additional sitemaps (popular with larger websites), it doesn't work. This mod shows promise to correct that. There may be situations which interfere in this working, but only testing will identify them.

In spiderfuncs.php, change the function getSiteMap(0 as follows:

Code: Select all

function getSiteMap($input_file)
{
    $links = '';
    $sitemap = simplexml_load_file($input_file);
    if ($sitemap != '') {
        $links = array ();
        foreach ($sitemap as $url) {
// START MOD PART 1
			// For some reason, wlwmanifest.xml interfers with the recursion
			// Therefore, let's ignore it
            if (preg_match("/wlwmanifest\.xml$/i", $url->loc)) {
				continue;
			}
            if (preg_match("/\.xml$/i", $url->loc)) {
                $submap = $url->loc;
                foreach ($submap as $input2) {
                    $sitemap2 = simplexml_load_file($input2);
                    if ($sitemap2 != '') {
                        foreach ($sitemap2 as $url2) {
                            $links[] = ($url2->loc);
                        }
                    }
                }
            } else {
// END MOD PART 1
                $links[] =($url->loc);
// START MOD PART 2
            }
// END MOD PART 2
        }
        $links = explode(",", (implode(",", $links)));
    }
    return $links;
}