Fulltxt field

Come here for help or to post comments on Sphider
Post Reply
Posts: 2
Joined: Fri Jul 23, 2021 2:19 pm

Fulltxt field

Post by rakone »

And just a suggestion. This is what the fulltext field of one of the DMOZ pages looked like in the Sphiderlinks table :

\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n\r\n \r\n\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n\r\n \r\n \r\n \r\n\r\n \r\n \r\n\r\n \r\n\r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\n \r\n \r\n \r\n \r\n DMOZ \r\n \r\n\r\n #OrganizeTheWeb \r\n \r\n \r\n \r\n Important Notice \r\n Welcome to our archive of dmoz.org. \r\n Visit resource-zone \r\n to stay in touch with the community.\r\n \r\n \r\n\r\n \r\n \r\n\r\n \r\n\r\n Follow @dmoz \r\n \r\n \r\n\r\n \r\n \r\n \r\n \r\n \r\n \r\n About \r\n Become an Editor \r\n Suggest a Site \r\n ...

To save space in my crawler, I trim all that plus double spaces out of the fulltxt field.
User avatar
Site Admin
Posts: 188
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA

Re: Fulltxt field

Post by captquirk »

Sphider does remove multiple white space. It also tries to preserve the original page as much as possible.

Since Sphider is open source, it is possible to edit out multiple occurrences of "\r\n". This would be done in spiderfuncs.php.

EDIT: It also happens to be that "\r\n" is not HTML. It is a DOS/Windows carriage return. Is is sometime found in doc or txt files created in windows, possibly from a pdf file converted to text on a Windows machine. Sphider retains this formatting when indexing such files (doc or pdf).
Post Reply