Page 1 of 1

All title tags are showing as Untitled document

Posted: Wed Dec 16, 2020 4:04 pm
by kraisor
Hey there,

I'm trying to index apnews.com and all of the title tags are showing as "Untitled document". I'm not sure what the issue is, everything else seems just fine.

Any help would be appreciated, thanks!

Re: All title tags are showing as Untitled document

Posted: Wed Dec 16, 2020 5:05 pm
by captquirk
Let me look at the issue. I'll post what I find.

I'LL BE BACK!

Re: All title tags are showing as Untitled document

Posted: Wed Dec 16, 2020 5:47 pm
by captquirk
Issue is confirmed! The 'title" column of the links table is not being populated for apnews.com.
Now to figure out WHY!

UPDATE: Sphider is unable to find the title tags (<title> and </title>). In the case of apnews.com, there are TWO causes for this.
The first is a simple code fix, but the second is being more elusive.
Viewing the page source, it is very obvious the title tags ARE present! The opening tag is not be recognized because of the option added. This is the easy fix in Sphider. Taking just the top of the html and running it through the fixed spider confirms that titles are found.
BUT - when scanning the entire page of html, the tags are not "seen". SOMETHING in the page is interfering.

Stand by...

Re: All title tags are showing as Untitled document

Posted: Wed Dec 16, 2020 11:07 pm
by kraisor
That's a weird issue, appreciate you looking into this!

Re: All title tags are showing as Untitled document

Posted: Thu Dec 17, 2020 1:25 am
by captquirk
Weird indeed! There is SOMETHING in the page that prevents Sphider from seeing the title tags.
If I copy the source html into a local file and scan it with Sphider, the tags are not found. If I delete everything from the css down and save that fragment, title tags are found and read.
I have been unable to determine WHERE in the source html things go awry. When I think I am closing in, something changes.
Very sure it is NOT the css portion....

I'll keep looking and post back when I find something definitive.

Re: All title tags are showing as Untitled document

Posted: Thu Dec 17, 2020 4:30 pm
by captquirk
TO OTHER USERS WHO MAY EXPERIENCE THE SAME PROBLEM OF TITLES BEING BLANK:
Sphider looks for the title tags, and the script (as of 3.5.2 and Lite 1.2.2) is looking for the form <title> or <title >. If you happen to have any attributes in the tag, the title won't be found. This is an easy fix:
Edit spiderfuncs.php, line 856, from:
if (preg_match("@<title *>(.*?)<\/title*>@si", $file, $regs)) {
To:
if (preg_match("@<title.*>(.*?)<\/title*>@si", $file, $regs)) {
For the specific site this thread involves, the missing titles are a TWO part problem. This suggestion only fixes only one of them. Titles will not appear in the search without this fix, but this fix ALONE will not make the titles appear! This is a SECOND problem and research is ongoing...

Re: All title tags are showing as Untitled document

Posted: Thu Dec 17, 2020 7:57 pm
by captquirk
SPECIFICALLY FOR kraisor:
I found a way to capture the titles!
The regex normally used gets mangled somewhere. It is a valid script, but at some point it stops working.
I HAVE FOUND A REGEX WHICH DOES WORK!
This is specifically for apnews.com and will not work for others.

Edit admin/spiderfuncs.php, line 856, to read:
if (preg_match("@<title data-rh=\"true\">(.*?)<\/title*>@si", $file, $regs)) {
Mostly likely, after making this change you will need to go to the site options and "Clear site", then begin a new index. The reason is that a re-index will only work if page content has changed.

Hope this works as well for you as it did in my tests. No clue why the original wildcard script is not working for you, but this should be a good workaround.

Re: All title tags are showing as Untitled document

Posted: Wed Dec 23, 2020 9:55 pm
by captquirk
A definitive fix has been found!

It seems that under certain circumstances, the regex on line 856 of spiderfuncs.php entered a runaway state. The way to fix this has FINALLY been found.

spiderfuncs.php, line 856 should read:
if (preg_match("@<title.*?>(.*?)<\/title.*?>@si", $file, $regs)) {
Anyone experiencing cases of missing titles (when there ARE in fact title tags present) will benefit from this modification. The code change will be reflected in later versions of Sphider.