Fulltxt field

Come here for help or to post comments on Sphider
Post Reply
rakone
Posts: 2
Joined: Fri Jul 23, 2021 2:19 pm

Fulltxt field

Post by rakone »

And just a suggestion. This is what the fulltext field of one of the DMOZ pages looked like in the Sphiderlinks table :

\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n\r\n \r\n\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n\r\n \r\n \r\n \r\n\r\n \r\n \r\n\r\n \r\n\r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\n \r\n \r\n \r\n \r\n DMOZ \r\n \r\n\r\n #OrganizeTheWeb \r\n \r\n \r\n \r\n Important Notice \r\n Welcome to our archive of dmoz.org. \r\n Visit resource-zone \r\n to stay in touch with the community.\r\n \r\n \r\n\r\n \r\n \r\n\r\n \r\n\r\n Follow @dmoz \r\n \r\n \r\n\r\n \r\n \r\n \r\n \r\n \r\n \r\n About \r\n Become an Editor \r\n Suggest a Site \r\n ...

To save space in my crawler, I trim all that plus double spaces out of the fulltxt field.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: Fulltxt field

Post by captquirk »

Sphider does remove multiple white space. It also tries to preserve the original page as much as possible.

Since Sphider is open source, it is possible to edit out multiple occurrences of "\r\n". This would be done in spiderfuncs.php.

EDIT: It also happens to be that "\r\n" is not HTML. It is a DOS/Windows carriage return. Is is sometime found in doc or txt files created in windows, possibly from a pdf file converted to text on a Windows machine. Sphider retains this formatting when indexing such files (doc or pdf).
Post Reply