An issue with PDF indexing.

Come here for help or to post comments on Sphider
vradova
Posts: 8
Joined: Fri Sep 01, 2023 1:07 pm

Re: An issue with PDF indexing.

Post by vradova »

I use PHP version 8.1.22
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: An issue with PDF indexing.

Post by captquirk »

Okay, I got Centos & working with Sphider 5.2.1.
Log file:
Spidering http://localhost/
1. Retrieving: http://localhost/ at 01:24:59.
Size of page: 10.78kb. Starting indexing at 01:24:59.
Indexed
Links found: 5. New links: 5
2. Retrieving: http://localhost/ at 01:25:00.
already in database
3. Retrieving: http://localhost/1974-1.pdf at 01:25:00.
Size of page: 0.00kb. Starting indexing at 01:25:00.
Indexed
Links found: 0. New links: 0
4. Retrieving: http://localhost/1995-2.pdf at 01:25:03.
Size of page: 0.00kb. Starting indexing at 01:25:04.
Indexed
Links found: 0. New links: 0
5. Retrieving: http://localhost/2021-4.pdf at 01:25:15.
Size of page: 0.00kb. Starting indexing at 01:25:15.
Indexed
Links found: 0. New links: 0
6. Retrieving: http://localhost/manual at 01:25:21.
Unreachable: http 404
Links found: 0. New links: 0

Completed at 01:25:21.
It reports each PDF as having a page size of 0.00kb, whiole my Sphider 5.3.0 in Ubuntu has REAL sizes. Also, looking at the size column for links table in the CentOS database, size is report as 0. But the pages DID index! I can enter a search term, such as "Algonquin" and get the proper result! The fulltxt column and title column of the links table are also appropriately populated.

Now I want to find out WHY size is not being recorded. Sphider version??? That will be my next test.

On your end, try searching for a word that may be unique to a particular PDF file and see what results you get...

There IS something amiss, but is it just the size??? Or is the issue deeper on your end???

It WILL get figured out.

PS: I do not think PHP version is the issue at the moment. My CentOS is 8.0, yours 8.1, and my Ubuntu 8.2. Deprecation differences, but nothing major.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: An issue with PDF indexing.

Post by captquirk »

I tried various things to isolate the issue, such as stemming off, stemming on, various settings changes... to no avail.

Then I upgraded the CentOS Sphider from 5.2.1 to 5.3.0....
Here is the log file:
Spidering http://localhost/
1. Retrieving: http://localhost/ at 03:44:36.
Size of page: 10.78kb. Starting indexing at 03:44:36.
Indexed
Links found: 5. New links: 5
2. Retrieving: http://localhost/ at 03:44:36.
already in database
3. Retrieving: http://localhost/1974-1.pdf at 03:44:36.
Size of page: 766.05kb. Starting indexing at 03:44:36.
Indexed
Links found: 0. New links: 0
4. Retrieving: http://localhost/1995-2.pdf at 03:44:36.
Size of page: 4258.07kb. Starting indexing at 03:44:36.
Indexed
Links found: 0. New links: 0
5. Retrieving: http://localhost/2021-4.pdf at 03:44:37.
Size of page: 3159.87kb. Starting indexing at 03:44:37.
Indexed
Links found: 0. New links: 0
6. Retrieving: http://localhost/manual at 03:44:38.
Unreachable: http 404
Links found: 0. New links: 0

Completed at 03:44:38.
The PDF's are showing a file size! Looking at the links table in the database, the size column values match the log values.

Since Sphider was equipped to handle multibyte strings, emulation was done in the absence of mbstring extension in PHP. It had appeared to be working well, but in Sphider 5.3.0 I removed the emulations and Sphider will now require the mbstring extension. For some reason, this did the trick!!!

Last minute checking and I plan to release 5.3.0 on Monday, with the companion SphiderLite 2.4.0.

Let's cross our fingers and hope 5.3.0 will straighten out this mess...
vradova
Posts: 8
Joined: Fri Sep 01, 2023 1:07 pm

Re: An issue with PDF indexing.

Post by vradova »

Great!

Thanks again! Keep my fingers crossed.

Vlade.
vradova
Posts: 8
Joined: Fri Sep 01, 2023 1:07 pm

Re: An issue with PDF indexing.

Post by vradova »

Hi, captquirk!

I did an update for Sphider. Now I have the latest version 5.3.0. But with no success in indexing pdf files.

Notice that I get when I try to index pdf files:

[Back to admin]

Spidering https://myccr.com/scripts/links.php?cat=3
1. Retrieving: https://myccr.com/scripts/links.php?cat=3 at 09:48:12.
Size of page: 18.76kb. Starting indexing at 09:48:14.
Indexed
Links found: 186. New links: 186
2. Retrieving: https://myccr.com/sites/default/files/s ... 1974-1.pdf at 09:48:15.
Size of page: 766.05kb.

What you would need to investigate the issue. I can send all the details you need through email?

Thanks, Vlade.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: An issue with PDF indexing.

Post by captquirk »

This is crazy. I tried the url and got that page, and NO LINKS FOUND!
By email, lets see the full log (looks like it started but then hung...?), the setting screen, and the site advanced edit screen. Also see if there is anything relevant in the error logs. (As shipped, Sphider has logging turned on in a number of places. We may have to turn it on.)

I want to replicate the setup as closely as I can.

We are missing SOMETHING.... they say the devil is in the details.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: An issue with PDF indexing.

Post by captquirk »

This issue has been resolved. It turns out that Sphider uses the PHP exec() function to run the pdftotext converter

Some Linux installations, for security reasons, block some functions such as exec(). A slight change to php.ini allowed exec() to function and pdf's were then indexed.

As I said, the devil is in the details.
Post Reply