Pre-indexing tips

Come here for help or to post comments on Sphider
Post Reply
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Pre-indexing tips

Post by captquirk »

Many times problems with indexing a site can be prevented... at least when indexing sites over which you have control. Here are three tips to improve the odds of proper indexing.

Be consistent with the use of "http" and "https" in your links. While browsers are much more forgiving on this issue, Sphider is not. If you are indexing "http://whatever.com" and a link to "https://whatever.com" is encountered, it will be considered a "foreign domain" and will be ignored. Conversely, indexing "https://whatever.com" will ignore links to "http://whatever.com".

Be sure HTML tags are properly closed. Sphider strips tags before indexing and an missing closing tag may cause some text to be lost from the indexing process. Running a HTML checker can prevent this kind of problem. Many online HTML checkers are available as well as installable HTML validation tools. Again browsers can be more forgiving than Sphider. Just because a page LOOKS alright in a browser does not mean Sphider is going to see it the same way.

Be sure each page has the page coding identified. This can be done with headers or be noted in a meta tag. If more than one method is used, be sure they are consistent. Sphider attempts to convert all pages to UTF-8, if needed. Obviously, if the page is already UTF-8, no conversion is needed. But be careful that the encoding stated is the encoding used. For example, if the stated encoding is UTF-8, the character "ā" should be exactly that, and not "ā". But if the stated encoding is Windows-1252, you need to use "ā" and not "ā". If encoding is not specified, or wrongly specified, Sphider may either fail to encode characters or double encode characters, and you end up with strange things in the database. Another thing to mention that can (unintentionally) cause issues with page encoding is the utility you use to write your code! If you write UTF-8 code in Windows Notepad, what you actually get may not be what you expect. Be sure your utility can handle the encoding you want. If you use Windows, Notepad++ has that ability. (Be sure to check the options.) If you are using Linux, this will be less of an issue but be careful just the same.

While these three simple tips can help with sites when you have creative control, they won't be of much use on sites you can't control. But if those sites do give strange results, you may possibly now know why!
Post Reply