How to 'safely' access pdftotext from the http process

Come here for help or to post comments on Sphider
Post Reply
wiringmaze
Posts: 4
Joined: Wed Sep 06, 2023 5:34 pm

How to 'safely' access pdftotext from the http process

Post by wiringmaze »

pdftotext is essential to Sphider. And there is a CLI using admin/spider.php, but I'm not yet familiar with all the commands I would need to replicate the web interface (e.g., to initiate reindexing).

From the standard web interface to the Sphider/admin/ portal, I can initiate indexing, but it seems to silently fail as pdftotext is 'beyond' the access of the http process. By silently fail, I think it moves on to the next document, cannot index it, and moves on again...

Am I missing a critical step?
Is there a 'safe' setting in the php configuration that grants it execute permission, which not exposing the world with too much access?
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: How to 'safely' access pdftotext from the http process

Post by captquirk »

Yes, Sphider can index PDF files. However, the translation of PDF to text is not native to Sphider. That is done by a utility, pdftotext.
This is pretty much a standard executable in Linux based systems, typically residing at /usr/bin/pdftotext.

On a Windows system, pdftotext.exe is NOT present by default. A common way to get it is to install Adobe Acrobat (reader). Where it resides on the system is not consistent. Also, the path may not be a know Windows executable path, so in addition to installing the utility, you have to add the path to the Windows PATH variable.

You can test the utility from the command prompt.
In Linux: pdftotext file.pdf -
In Windows: pdftotext.exe file.pdf -

This will give a text translation to standard output.

With Sphider, in the absence of pdftotext will produce nothing. This can be because wither pdftotext does not exist on the system, or the specified path in Settings is wrong. On a Windows system, you must tell Sphider you are using a Windows OS, and the path specified must be typical Windows format (C:\yadayada).
Post Reply