Index from sitemaps when sitemap is a list of sitemaps

Post Reply
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Index from sitemaps when sitemap is a list of sitemaps

Post by captquirk »

Sphider can index from sitemaps if they are simple sitemaps. If the initial sitemap is a list of links to additional sitemaps (popular with larger websites), it doesn't work. This mod shows promise to correct that. There may be situations which interfere in this working, but only testing will identify them.

In spiderfuncs.php, change the function getSiteMap(0 as follows:

Code: Select all

function getSiteMap($input_file)
{
    $links = '';
    $sitemap = simplexml_load_file($input_file);
    if ($sitemap != '') {
        $links = array ();
        foreach ($sitemap as $url) {
// START MOD PART 1
			// For some reason, wlwmanifest.xml interfers with the recursion
			// Therefore, let's ignore it
            if (preg_match("/wlwmanifest\.xml$/i", $url->loc)) {
				continue;
			}
            if (preg_match("/\.xml$/i", $url->loc)) {
                $submap = $url->loc;
                foreach ($submap as $input2) {
                    $sitemap2 = simplexml_load_file($input2);
                    if ($sitemap2 != '') {
                        foreach ($sitemap2 as $url2) {
                            $links[] = ($url2->loc);
                        }
                    }
                }
            } else {
// END MOD PART 1
                $links[] =($url->loc);
// START MOD PART 2
            }
// END MOD PART 2
        }
        $links = explode(",", (implode(",", $links)));
    }
    return $links;
}
Post Reply