An issue with PDF indexing.

Come here for help or to post comments on Sphider
vradova
Posts: 8
Joined: Fri Sep 01, 2023 1:07 pm

An issue with PDF indexing.

Post by vradova »

Hi!

I have installed Sphider version 5.2.1. I tried to index my pdf files but with no success. I have a log of this index tried to attach to this post, but file is to big.

Here is the part of it:
"Spidering https://domain.com/scripts/links.php?cat=3

1. Retrieving: https://domain.com/scripts/links.php?cat=3 at 13:35:07.
Size of page: 18.58kb. Starting indexing at 13:35:09. Page contains less than 1 words
Links found: 186. New links: 186
2. Retrieving: https://domain.com/sites/default/files/ ... 1974-1.pdf at 13:35:09.
Size of page: 0.00kb. Starting indexing at 13:35:09. Page contains less than 1 words
Links found: 0. New links: 0
3. Retrieving: https://domain.com/sites/default/files/ ... 1975-1.pdf at 13:35:09.
Size of page: 0.00kb. Starting indexing at 13:35:09. Page contains less than 1 words
Links found: 0. New links: 0
4. Retrieving: https://domain.com/sites/default/files/ ... 1975-2.pdf at 13:35:09.
Size of page: 0.00kb. Starting indexing at 13:35:09. Page contains less than 1 words
Links found: 0. New links: 0
5. Retrieving: https://domain.com/sites/default/files/ ... 1975-3.pdf at 13:35:09.
Size of page: 0.00kb. Starting indexing at 13:35:09. Page contains less than 1 words
Links found: 0. New links: 0
6. Retrieving: https://domain.com/sites/default/files/ ... 1975-4.pdf at 13:35:09.
Size of page: 0.00kb. Starting indexing at 13:35:09. Page contains less than 1 words
Links found: 0. New links: 0
7. Retrieving: https://domain.com/sites/default/files/ ... 1976-1.pdf at 13:35:09.
Size of page: 0.00kb. Starting indexing at 13:35:09. Page contains less than 1 words
Links found: 0. New links: 0
8. Retrieving: https://domain.com/sites/default/files/ ... 1976-2.pdf at 13:35:09.
Size of page: 0.00kb. Starting indexing at 13:35:10. Page contains less than 1 words
Links found: 0. New links: 0"

Thanks in advance, Vlade.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: An issue with PDF indexing.

Post by captquirk »

By default, Sphider assumes that you are using a Linux based operating system and the pdftotext executable is located at "/usr/bin/pdftotext".
It is possible that:
1) pdftotext isn't installed, although with a Linux system that is unlikely.
2) pdftotext is located somewhere else. You can change the path on the settings tab.
3) You are using a Windows based operating system. You then need to install pdftotext (Adobe Acrobat is a likely source), tell Sphider you have a Windows OS in settings, and change the appropriate path in settings (using Windows convention with backslashes and ending with ".exe").

Let us know if you need more assistance.
vradova
Posts: 8
Joined: Fri Sep 01, 2023 1:07 pm

Re: An issue with PDF indexing.

Post by vradova »

Hi!

Thanks for the fast response!

I am using a Linux-based OS. Centos 7 version on the VPS server on which I have full control. The pdftotext is located at "/usr/bin/pdftotext". I did a manual convert and it did it right. Did this through a Linux shell.

This is the output for the version and location:

[root@ccc ~]# pdftotext -v
pdftotext version 0.26.5
Copyright 2005-2014 The Poppler Developers - http://poppler.freedesktop.org
Copyright 1996-2011 Glyph & Cog, LLC
[root@ccc ~]# whereis pdftotext
pdftotext: /usr/bin/pdftotext /usr/share/man/man1/pdftotext.1.gz

Sphider is indexing other things I have on that website, but not the files with PDF extension.

Thanks!
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: An issue with PDF indexing.

Post by captquirk »

On the settings tab, be sure "Index PDF files" is checked. (It probably is, but I just have to mention it in the off-chance... LOL!)

Also, in the admin directory, there are several sub-directories. I believe3 you mentioned you are getting log files, so check that the "tmp" directory has the same owner and permissions as "log". Sphider uses pdftotext to read/convert the PDF file (just as you did) and stores the result in the "tmp" directory. It then indexes the contents of "tmp". If the permissions and owner are not correct, nothing will be put in "tmp", thus nothing to index.

If that doesn't work, I am going to have to start to think!!!!

(Of course, I was once concerned about a PDF file of my own not getting indexed --- until I realized the PDF was ALL GRAPHICS and NO text! :oops: )
vradova
Posts: 8
Joined: Fri Sep 01, 2023 1:07 pm

Re: An issue with PDF indexing.

Post by vradova »

Hi!

Now i got this:

Spidering https://domain.com/scripts/links.php?cat=3
1. Retrieving: https://domain.com/scripts/links.php?cat=3 at 06:53:57.
Size of page: 18.58kb. Starting indexing at 06:53:59.
Indexed
Links found: 186. New links: 186
2. Retrieving: https://domain.com/sites/default/files/ ... 1974-1.pdf at 06:54:00.
Size of page: 0.00kb. Starting indexing at 06:54:00.
Indexed
Links found: 0. New links: 0
3. Retrieving: https://domain.com/sites/default/files/ ... 1975-1.pdf at 06:54:00.
Size of page: 0.00kb. Starting indexing at 06:54:00. Page is a duplicate.
Links found: 0. New links: 0
4. Retrieving: https://domain.com/sites/default/files/ ... 1975-2.pdf at 06:54:00.
Size of page: 0.00kb. Starting indexing at 06:54:01. Page is a duplicate.
Links found: 0. New links: 0
5. Retrieving: https://domain.com/sites/default/files/ ... 1975-3.pdf at 06:54:01.
Size of page: 0.00kb. Starting indexing at 06:54:01. Page is a duplicate.
Links found: 0. New links: 0

It starts to index. I saw the tmp file. But it does not index, and the first file size i 0KB. Other PDF files have an error Page is duplicate.

PDF files are text, I can attach one that I convert through CLI.

Thanks a lot, Vlade.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: An issue with PDF indexing.

Post by captquirk »

We have a mystery! We don't need to blab the content of your pages to the world, so email me say two of the PDF files and I will try them locally to see what kind of results I get.
It seems you have things configured correctly. Is there somethi8ng different with these files? I can't imagine there would be, but ...
Send them to captq and use the blog . worldspaceflight . com domain. (HEY!. I'm not paranoid? LOL!)
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: An issue with PDF indexing.

Post by captquirk »

I tried indexing the pdf's locally.
1. Retrieving: http://localhost/ at 18:51:01.
Size of page: 10.78kb. Starting indexing at 18:51:01.
Indexed
Image 1. Retrieving: http://localhost/icons/ubuntu-logo.png at 18:51:01.
Indexed
Links found: 5. New links: 5
2. Retrieving: http://localhost/ at 18:51:01.
already in database
3. Retrieving: http://localhost/1974-1.pdf at 18:51:01.
Size of page: 766.05kb. Starting indexing at 18:51:01.
Indexed
Links found: 0. New links: 0
4. Retrieving: http://localhost/1995-2.pdf at 18:51:03.
Size of page: 4258.07kb. Starting indexing at 18:51:03.
Indexed
Links found: 0. New links: 0
5. Retrieving: http://localhost/2021-4.pdf at 18:51:08.
Size of page: 3159.87kb. Starting indexing at 18:51:08.
Indexed
Links found: 0. New links: 0
6. Retrieving: http://localhost/manual at 18:51:17.
Unreachable: http 404
Links found: 0. New links: 0

Completed at 18:51:18.
Statistics for site http://localhost/
Last indexed:
2023-09-02
Pages indexed:
4
Images indexed:
1
Total index size:
6891
Cached texts:
167.77kb
Total number of keywords:
5287
Site size:
8,194.77kb
So obviously indexing on my end worked. Now we have to figure out why I can do it but you can't. Pretty sure you are NOT a novice and know what you are doing, so we have to figure out just what in blazes is going on. The problem is NOT the PDF files.

I doubt CentOS is the issue, but I need to know that for sure. I run Ubuntu but have a CentOS7 VM available. Let me install Sphider on it and run again. I don't expect any different results, but I need to know for sure.

What version of Sphider do you have installed? Maybe I broke something somewhere along the line... (Wouldn't be the first time, sadly.) I am currently running the soon to be released Sphider 5.3.0. I will install the same version you have in the CentOS VM to better duplicate your setup.

A snapshot of your settings would also be useful.
vradova
Posts: 8
Joined: Fri Sep 01, 2023 1:07 pm

Re: An issue with PDF indexing.

Post by vradova »

Sphider version is 5.2.1

In the attachment, there is a settings snapshot.
Attachments
screenshot-2023.09.03-00_32_04.png
screenshot-2023.09.03-00_32_04.png (250.46 KiB) Viewed 3846 times
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: An issue with PDF indexing.

Post by captquirk »

Thanks. Give me a bit...
My VM of CentOS 7 had a dated version of PHP (now at least a reasonable 8.0), but my database version is also old and I am struggling to get it updated. I am used to Ubuntu, so CentOS is trying to intimidate me! I won't let it.

Once that is done I can better simulate your setup to see if I can replicate the issue.

I'll be back...
vradova
Posts: 8
Joined: Fri Sep 01, 2023 1:07 pm

Re: An issue with PDF indexing.

Post by vradova »

Thanks!
Post Reply