ssl redirect

Come here for help or to post comments on Sphider
gaddcasey1
Posts: 12
Joined: Wed Aug 15, 2018 4:48 pm

ssl redirect

Post by gaddcasey1 »

I have an apache ssl redirect (http to https) and I can't get the site to index. keep getting no host error.
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: ssl redirect

Post by captquirk »

On the Sphider settings, Sites tab, are you specifying the URL as "http://" and relying on the redirect, or are you specifying "https://"?
The former could well return "no host found". The latter should work just fine.

Just as a test, in a browser if you visit the site using "http://", are you properly redirected to "https://"?
I know I went through a "rough patch" when I began forcing all web traffic to https!

If the problem persists, post or PM the problem URL and I'll look at it.
gaddcasey1
Posts: 12
Joined: Wed Aug 15, 2018 4:48 pm

Re: ssl redirect

Post by gaddcasey1 »

Not sure what happened, but now everything is working. It kept saying all my pages were less than 1 word. I assumed it was because the pages were ssl. Thanks for your help.
gaddcasey1
Posts: 12
Joined: Wed Aug 15, 2018 4:48 pm

Re: ssl redirect

Post by gaddcasey1 »

Well it looks like the ssl redirect was disabled.So it works if the ssl redirect is disabled and breaks if you enable it. I tested another site with a ssl redirect and i got the same errors as in the picture attached.
https://www.okanoganpud.org/
1.png
1.png (93.75 KiB) Viewed 17457 times
Robots.txt

Code: Select all

#
# robots.txt
#
# This file is to prevent the crawling and indexing of certain parts
# of your site by web crawlers and spiders run by sites like Yahoo!
# and Google. By telling these "robots" where not to go on your site,
# you save bandwidth and server resources.
#
# This file will be ignored unless it is at the root of your host:
# Used:    http://example.com/robots.txt
# Ignored: http://example.com/site/robots.txt
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/robotstxt.html

User-agent: *
# CSS, JS, Images
Allow: /core/*.css$
Allow: /core/*.css?
Allow: /core/*.js$
Allow: /core/*.js?
Allow: /core/*.gif
Allow: /core/*.jpg
Allow: /core/*.jpeg
Allow: /core/*.png
Allow: /core/*.svg
Allow: /profiles/*.css$
Allow: /profiles/*.css?
Allow: /profiles/*.js$
Allow: /profiles/*.js?
Allow: /profiles/*.gif
Allow: /profiles/*.jpg
Allow: /profiles/*.jpeg
Allow: /profiles/*.png
Allow: /profiles/*.svg
# Directories
Disallow: /core/
Disallow: /profiles/
# Files
Disallow: /README.txt
Disallow: /web.config
# Paths (clean URLs)
Disallow: /admin/
Disallow: /comment/reply/
Disallow: /filter/tips/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
Disallow: /user/logout/
# Paths (no clean URLs)
Disallow: /index.php/admin/
Disallow: /index.php/comment/reply/
Disallow: /index.php/filter/tips/
Disallow: /index.php/node/add/
Disallow: /index.php/search/
Disallow: /index.php/user/password/
Disallow: /index.php/user/register/
Disallow: /index.php/user/login/
Disallow: /index.php/user/logout/
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: ssl redirect

Post by captquirk »

I initially indexed, successfully, https://www.okanoganpud.org/. Since I had SPECIFIED "https" when adding the site, I thought that might be the solution. So I tried http://www.okanoganoud.org (no https), expecting it to fail. Instead, this is what I got:
1.gif
1.gif (46.77 KiB) Viewed 17454 times
Well, that isn't what I was expecting! But then I noticed something about my screenshot compared to yours... mine begins by listing files and directories disallowed by robots.txt, whereas your does not. Now earlier versions of Sphider were not consistently producing this listing, but immediately jumping into the indexing results. Later versions will display a list of disallowed files and directories before listing the results, provided of course that the robots.txt file exists and is valid. In your case, the file DOES exist and IS valid.

I have successfully indexed www.okanoganpud.org using both Sphider 2.0.0 and PDO Sphider 2.0.0. I have set the site in Sphider both as https and plain old http. Indexing time runs 45 minutes to an hour (I did not index pdf files and limited the depth to 2). My results yield approximately 550 pages and 16,000+ keywords.

It is possible you are using an earlier version?

Please check at the top of the page on the Settings tab. What version is indicated?
If you are indeed using Sphider 2.0.0, then I am missing something as I SHOULD be able to duplicate the problem.

Please verify the version id, platform you are indexing from (Linux, Windows), and confirm the database (suspect MySQL, but let's be sure).
We WILL figure this out!

(And, it's a good thing I didn't go shooting off about my initial suspicions because I would have been WRONG!)
gaddcasey1
Posts: 12
Joined: Wed Aug 15, 2018 4:48 pm

Re: ssl redirect

Post by gaddcasey1 »

I am using sphider 2.0.0.c-pdo . I am on a CENTOS (self hosted) using mysql.
2.png
2.png (48.3 KiB) Viewed 17452 times
3.png
3.png (51.54 KiB) Viewed 17452 times
4.png
4.png (14.84 KiB) Viewed 17452 times
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: ssl redirect

Post by captquirk »

I went to my Ubuntu box, using 2.0.0.c-pdo, and used settings like yours. This time, I received very DIFFERENT results!
First, defining the site as https, I got this:
1.png
1.png (172.42 KiB) Viewed 17447 times
Then, defining the site simply as http, I received this:
2.png
2.png (172.75 KiB) Viewed 17447 times
In the first instance, a page was found but contained "less than 10 words".
In the second case, I received a NOHOST error.
In neither case did the listing of disallowed files and directories appear.

This is very odd. But you already knew that!!!
Indexing plain old http on Linux = NOHOST. Same settings, same sphider version, same http, but on Windows = successful indexing. :?
Before I ran the tests, I simply view the web page in two different browsers. Chromium (Chrome for Linux) promptly displayed the page. Firefox (on Linux) at first complained of a problem with the SSL certificate, but after I created an exception everything was fine. I was able to view the robots.txt file in both browsers.

I will add some troubleshooting code to the spidering module to see if I can find just WHY I'm getting these results. Bear with me...
User avatar
captquirk
Site Admin
Posts: 299
Joined: Sun Apr 09, 2017 8:49 pm
Location: Arizona, USA
Contact:

Re: ssl redirect

Post by captquirk »

Scenario 1: Sphider 2.0.0c-PDO, MySql, PHP7, Apache, Windows 7
- Successfully able to index both http://www.okanoganpud.org and https://www.okanoganpud.org
- Logs both times start with a listing of disallowed files and directories from robots.txt

Scenario 2: Sphider 2.0.0c-PDO, MySql, PHP7, Apache, Ubuntu 18.04
- Unable to index either http://www.okanoganpud.org and https://www.okanoganpud.org
- Using http, no disallowed listing, message that the page has fewer than 10 words
- Using https, no disallowed listing message stating "NO HOST"

I created a simple script:
<?php

//$target = "ssl://tpwd.texas.gov"; // URL after the 'ssl://' can vary, 'ssl://' must remain
//$target = "ssl://www.parks.ca.gov";
//$target = "ssl://southcarolinaparks.com";
//$target = "ssl://blog.worldspaceflight.com";
//$target = "ssl://www.worldspaceflight.com";
$target = "ssl://www.okanoganpud.org";
$port = 443; // SSL port is 443, normal port is 80
$fsocket_timeout = 30;
$fp = fsockopen($target, $port, $errno, $errstr, $fsocket_timeout);
if (!$fp) {
echo "FAIL<br>";
die($errno.", ".$errstr.", ".$s);
} else {
echo $errno."<br>".$errstr."<br>SUCCESS";
}
fclose($fp);

?>
This script allows me to test various https (ssl) sites. For each run, I uncommented out a single line with a $target.
In the Windows environment, each $target yielded "SUCCESS".
In the Ubuntu environment, each $target except www.okanoganpud.org yielded "SUCCESS". Www.okanoganput.org yielded "FAIL".
Thinking MAYBE my Ubuntu box could have an issue (I recently upgraded from 16.04) I copied the script to my hosting web server for worldspaceflight.com, also Ubuntu based, and tried again. Identical results.

Running the script with "ssl://" deleted from $target for the problem site gave "SUCCESS". This was for port 443. Same result for port 80.

So what is this all telling me? It tells the problem is specific to https://www.oknoganpud.org, and not all SSL sites. I see that I the issue occurs when I am indexing from Linux platform, but not from a Windows platform. NOW THIS TRULY HAS ME BAFFLED!!! Obviously, the IS a problem... is it that Windows is just too stupid to realize it? I've been searching the web to come up with some kind of explanation on that one, but so far have had no success.

Suffice to say, Sphider IS having trouble with https://www.okanoganpud.org, but not with other SSL sites (that I have found so far). Referencing the site as plain old "http" is, theoretically, finding the site, but finding no content, hence no even parsing robots.txt is being accomplished. This likely has to do with 1) a problem with https, and 2) the forced redirection.

If I understand correctly, you were successfully indexing http://www.okanoganpud.ore BEFORE implementing SSL. Or was it before FORCING https?
If it is the latter case (you had a certificate but weren't forcing https access), then the problem MIGHT be with the directive being used to force https access. From personal experience, they are 999 ways to force https, and I had to go through 900 of them before I got it to work right. Still, this is something that can be fixed with a bit of experimenting.

IF, however, the problem began when the SSL certificate was put in place, but no directive that it HAD to be used, then the problem would be with the certificate itself. I consider this highly unlikely.

In summary, and I hate to say this, the problem seems to be something in the sites configuration. Why on earth I can still index from Windows is a mystery I would like to solve. When my test script starts saying "SUCCESS" when run from a Linux platform, the issue should be resolved.

Going back to forcing https from .htaccess, this is what ultimately worked for me (modified to use your url):
Rewrite Engine On

RewriteCond %{SERVER_PORT} 80
RewriteRule ^(.*)$ https://www.okanoganpud.org/$1 [R=301,L}
gaddcasey1
Posts: 12
Joined: Wed Aug 15, 2018 4:48 pm

Re: ssl redirect

Post by gaddcasey1 »

If I disable the line below I can index the site. My temporary fix is to disable the line, index the site, then re-enable the SSL redirect. This is not ideal so I hope I can fix the issue.
4.png
4.png (50.15 KiB) Viewed 17438 times
I was concerned that The SSL was not set up properly, but from the research, I have done it seems like it is fine. I have also tested another SSL site https://www.bleepingcomputer.com/ and I get the same result.
Attachments
6.png
6.png (6.45 KiB) Viewed 17438 times
gaddcasey1
Posts: 12
Joined: Wed Aug 15, 2018 4:48 pm

Re: ssl redirect

Post by gaddcasey1 »

I tried to reproduce your success, but I am getting the same error. the .htaccess code causes httpd to crash. (I am using centos 7). I found a recomended https redirect on drupal.com that looked similar to your code. I still causes issues with sphider.

Code: Select all

RewriteEngine on

#RewriteCond %{SERVER_PORT} 80
#RewriteRule ^(.*)$ https://www.okanoganpud.org/$1 [R=301,L}

RewriteCond %{HTTPS} !on
RewriteCond %{HTTP_HOST} ^www\.okanoganpud\.org*
RewriteRule ^(.*)$ https://www.okanoganpud.org/ [L,R=301]

#Drupal Clean URLs
RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME !-d
RewriteRule ^(.*)$ index.php?Q=$1 [L,QSA]
8.png
8.png (6.8 KiB) Viewed 17435 times
Post Reply