And just a suggestion. This is what the fulltext field of one of the DMOZ pages looked like in the Sphiderlinks table :
\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n\r\n \r\n\r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n \r\n\r\n \r\n \r\n \r\n\r\n \r\n \r\n\r\n \r\n\r\n\r\n \r\n\r\n \r\n\r\n \r\n\r\n \r\n \r\n \r\n \r\n DMOZ \r\n \r\n\r\n #OrganizeTheWeb \r\n \r\n \r\n \r\n Important Notice \r\n Welcome to our archive of dmoz.org. \r\n Visit resource-zone \r\n to stay in touch with the community.\r\n \r\n \r\n\r\n \r\n \r\n\r\n \r\n\r\n Follow @dmoz \r\n \r\n \r\n\r\n \r\n \r\n \r\n \r\n \r\n \r\n About \r\n Become an Editor \r\n Suggest a Site \r\n ...
To save space in my crawler, I trim all that plus double spaces out of the fulltxt field.
Fulltxt field
Re: Fulltxt field
Sphider does remove multiple white space. It also tries to preserve the original page as much as possible.
Since Sphider is open source, it is possible to edit out multiple occurrences of "\r\n". This would be done in spiderfuncs.php.
EDIT: It also happens to be that "\r\n" is not HTML. It is a DOS/Windows carriage return. Is is sometime found in doc or txt files created in windows, possibly from a pdf file converted to text on a Windows machine. Sphider retains this formatting when indexing such files (doc or pdf).
Since Sphider is open source, it is possible to edit out multiple occurrences of "\r\n". This would be done in spiderfuncs.php.
EDIT: It also happens to be that "\r\n" is not HTML. It is a DOS/Windows carriage return. Is is sometime found in doc or txt files created in windows, possibly from a pdf file converted to text on a Windows machine. Sphider retains this formatting when indexing such files (doc or pdf).