Page 1 of 1

Fulltxt field

Posted: Fri Jul 23, 2021 3:46 pm
by rakone
And just a suggestion. This is what the fulltext field of one of the DMOZ pages looked like in the Sphiderlinks table :

\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n\r\n \r\n\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n\r\n \r\n \r\n \r\n\r\n \r\n \r\n\r\n \r\n\r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\n \r\n \r\n \r\n \r\n DMOZ \r\n \r\n\r\n #OrganizeTheWeb \r\n \r\n \r\n \r\n Important Notice \r\n Welcome to our archive of dmoz.org. \r\n Visit resource-zone \r\n to stay in touch with the community.\r\n \r\n \r\n\r\n \r\n \r\n\r\n \r\n\r\n Follow @dmoz \r\n \r\n \r\n\r\n \r\n \r\n \r\n \r\n \r\n \r\n About \r\n Become an Editor \r\n Suggest a Site \r\n ...

To save space in my crawler, I trim all that plus double spaces out of the fulltxt field.

Re: Fulltxt field

Posted: Mon Jul 26, 2021 6:50 pm
by captquirk
Sphider does remove multiple white space. It also tries to preserve the original page as much as possible.

Since Sphider is open source, it is possible to edit out multiple occurrences of "\r\n". This would be done in spiderfuncs.php.

EDIT: It also happens to be that "\r\n" is not HTML. It is a DOS/Windows carriage return. Is is sometime found in doc or txt files created in windows, possibly from a pdf file converted to text on a Windows machine. Sphider retains this formatting when indexing such files (doc or pdf).