All title tags are showing as Untitled document

Come here for help or to post comments on Sphider
Post Reply
kraisor
Posts: 4
Joined: Wed Dec 16, 2020 4:03 pm

All title tags are showing as Untitled document

Post by kraisor »

Hey there,

I'm trying to index apnews.com and all of the title tags are showing as "Untitled document". I'm not sure what the issue is, everything else seems just fine.

Any help would be appreciated, thanks!
Attachments
sphider.JPG
sphider.JPG (72.82 KiB) Viewed 15074 times
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: All title tags are showing as Untitled document

Post by captquirk »

Let me look at the issue. I'll post what I find.

I'LL BE BACK!
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: All title tags are showing as Untitled document

Post by captquirk »

Issue is confirmed! The 'title" column of the links table is not being populated for apnews.com.
Now to figure out WHY!

UPDATE: Sphider is unable to find the title tags (<title> and </title>). In the case of apnews.com, there are TWO causes for this.
The first is a simple code fix, but the second is being more elusive.
Viewing the page source, it is very obvious the title tags ARE present! The opening tag is not be recognized because of the option added. This is the easy fix in Sphider. Taking just the top of the html and running it through the fixed spider confirms that titles are found.
BUT - when scanning the entire page of html, the tags are not "seen". SOMETHING in the page is interfering.

Stand by...
kraisor
Posts: 4
Joined: Wed Dec 16, 2020 4:03 pm

Re: All title tags are showing as Untitled document

Post by kraisor »

That's a weird issue, appreciate you looking into this!
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: All title tags are showing as Untitled document

Post by captquirk »

Weird indeed! There is SOMETHING in the page that prevents Sphider from seeing the title tags.
If I copy the source html into a local file and scan it with Sphider, the tags are not found. If I delete everything from the css down and save that fragment, title tags are found and read.
I have been unable to determine WHERE in the source html things go awry. When I think I am closing in, something changes.
Very sure it is NOT the css portion....

I'll keep looking and post back when I find something definitive.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: All title tags are showing as Untitled document

Post by captquirk »

TO OTHER USERS WHO MAY EXPERIENCE THE SAME PROBLEM OF TITLES BEING BLANK:
Sphider looks for the title tags, and the script (as of 3.5.2 and Lite 1.2.2) is looking for the form <title> or <title >. If you happen to have any attributes in the tag, the title won't be found. This is an easy fix:
Edit spiderfuncs.php, line 856, from:
if (preg_match("@<title *>(.*?)<\/title*>@si", $file, $regs)) {
To:
if (preg_match("@<title.*>(.*?)<\/title*>@si", $file, $regs)) {
For the specific site this thread involves, the missing titles are a TWO part problem. This suggestion only fixes only one of them. Titles will not appear in the search without this fix, but this fix ALONE will not make the titles appear! This is a SECOND problem and research is ongoing...
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: All title tags are showing as Untitled document

Post by captquirk »

SPECIFICALLY FOR kraisor:
I found a way to capture the titles!
The regex normally used gets mangled somewhere. It is a valid script, but at some point it stops working.
I HAVE FOUND A REGEX WHICH DOES WORK!
This is specifically for apnews.com and will not work for others.

Edit admin/spiderfuncs.php, line 856, to read:
if (preg_match("@<title data-rh=\"true\">(.*?)<\/title*>@si", $file, $regs)) {
Mostly likely, after making this change you will need to go to the site options and "Clear site", then begin a new index. The reason is that a re-index will only work if page content has changed.

Hope this works as well for you as it did in my tests. No clue why the original wildcard script is not working for you, but this should be a good workaround.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: All title tags are showing as Untitled document

Post by captquirk »

A definitive fix has been found!

It seems that under certain circumstances, the regex on line 856 of spiderfuncs.php entered a runaway state. The way to fix this has FINALLY been found.

spiderfuncs.php, line 856 should read:
if (preg_match("@<title.*?>(.*?)<\/title.*?>@si", $file, $regs)) {
Anyone experiencing cases of missing titles (when there ARE in fact title tags present) will benefit from this modification. The code change will be reflected in later versions of Sphider.
Post Reply