|
.htaccess Functions »
Introduction |
First, welcome to the world of building really powerful websites! Come with me on a short journey of a few simple lines of code in a very special file called .htaccess. .htaccess is an incredibly powerful configuration file used to configure Apache HTTP Servers running under some flavor of Unix. Using .htaccess, one can set up security, stop bad bots and spiders dead in their tracks, prevent hot linking (bandwidth stealing), enable and use server side includes (SSI), set up search engine friendly redirects, and much, much more. Just add a few lines of code, call it .htaccess, and put some real power in your web site. Many of you may see this as somewhat simplistic. If so, it isn't written for you, so, move on. On the other hand, many of you may never have heard of .htaccess, or if you have, it has been in the context of enabling custom error pages or password protecting some of your directories. If you fall into either of these categories, then this tutorial is for you. In either case, .htaccess can do much more. |
.htaccess is a Unix thing, or some variant thereof. It is NOT a Windows thing. It has nothing to do with Windows, or IIS, or anything Microsoft HTTP servers or services.
.htaccess is a simple ASCII text file. As such, it can be made and edited with notepad or any other text editor. Of course, if you're here, then you should be way past notepad; perhaps PrimalScript or VEdit (a couple of my personal favorites), or some other good editor.
...which reminds me; .htaccess is the filename. Period. It's not htaccess.txt, or somefile.htaccess, or anything else you can magically conjure up to change it to. It's just .htaccess. Unlike the Windows world where the "." separates the filename from the extension, the "." here defines .htaccess as a Unix/Linux system file.
Anyway, open a text editor and create a file called .htaccess. You may have to type in one character as some editors won't let you save an empty file. Also, your editor may try to append its default extension to the file name, as in .htaccess.txt. You'll have to remove the .txt.
Let's move on...
Before we begin the really fun stuff, we have to get some words of caution out there...
Although Apache is pretty lenient toward ill formed .htaccess files, it can be aggravated. Most commands in .htaccess are meant to be on one line only. So, don't use your Java/C++ prowess here and concatenate a bunch of lines of code; And, make sure you turn off the line wrap in your text editor.
When you whip out your favorite FTP program for uploading your .htaccess files, insure you upload as ASCII, not BINARY. Afterwards, insure to CHMOD the permissions to 644 (RW-R--R--). This makes sure the file is server readable, but not browser readable.
Depending on the circumstances, one may use multiple .htaccess files. Accordingly, .htaccess files affect the directories (and the subdirectories below) that they're placed in. The server will use the .htaccess closest to the content being requested as the .htaccess file. Place your global .htaccess file as high in the directory structure as possible; at or above the root directory if you can.
But, before you start putting these things everywhere, read this entire document to insure you aren't doing things over, and over, and over, and over, and over, and over, and over... You could wind up in an infinite loop of redirects or errors.
Oh yeah, comments in .htaccess start with the number symbol (#).
Finally, check with your ISP! Don't make light of this step; .htaccess can severely compromise an Admin's attempt at server configuration. Thus, many ISPs don't allow you to set up your own custom configuration files, and consider attempts to do so as hacking. So, Make sure you're allowed to use .htaccess files on your domain before you attempt to upload any.
To recap:
Syntax: ErrorDocument Code /Directory/filename.ext
Have you ever been surfing the ole WWW and come across a link that returned a "HTTP 404 - File not found" message? It's a turn-off, huh. Other sites give you an explanation and a way to get back to their site, but this site just doesn't get it. You want to know when things go wrong, and you want to keep these people on your site, right? .htaccess to the rescue!
This is the "biggie" that everyone seems to think of when speaking of .htaccess. To that end, if you're going to make your own error documents, you need to know the error codes a server will return. Server error codes are defined in Request For Comments RFC-2616, Sec. 10, and are listed below. If you need more detailed info, click the link.
|
Successful Client Requests »
200 - OK Client Request Redirected »
300 - Multiple Choices |
Client Requests Errors »
400 - Bad Request |
Server Errors »
500 - Internal Server Error |
Not all of the error codes need custom pages. In fact, some would be problematic, like a 200 page. It would wind up throwing your domain into an infinite loop. The biggies are:
Just type in the following lines to define the calls to your custom error pages
in .htaccess:
ErrorDocument 400 /errors/badrequest.html
Of course, you'll have to write the actual pages, but if you were to define custom pages for all of the examples listed above, your .htaccess would look like:
# Defining custom error pages...
ErrorDocument 400 /errors/badrequest.html
ErrorDocument 401 /errors/authreq.html
ErrorDocument 403 /errors/forbidden.html
ErrorDocument 404 /errors/notfound.html
ErrorDocument 500 /errors/servererr.html
We programmers are generally a lazy lot; our basic credo being "Work smarter, not harder." Typically, we're genetically geared to write code once, and use it a lot instead of writing a lot of code a lot of times. For example, all of the error documents listed above use the same menu on the left side of the page... I wrote it once and with one line of code on the error page, I just call it up and plug it in. Another advantage is that if I ever change it, I change it one place and it shows up everywhere I've called it. These are called Server Side Includes, or SSI for short.
** CAUTION ** Don't do this and beg forgiveness; ASK FIRST! This could violate the TOS with your web host and wind up losing you your domain and the fees you paid to get it. And depending on where you are in the food chain, it could be considered hacking, which is illegal many places. But, if you get the go ahead, well, go ahead.
# Define SSI types and handlers...
AddType text/html .shtml
AddHandler server-parsed .shtml
Options Indexes FollowSymLinks Includes
Congrats! SSI is now enabled. The first line is just a comment. The second line tells the server that web pages with a .shtml (Server parsed .html) extension are valid. The third line is the actual SSI handler bit and that any file ending in .shtml should be parsed for server side inclusions. As for the last line, don't worry about what it means, but trust me... it should be there.
Now, be a good web weenie and rename all 743 of your web pages with a .shtml extension. What? What do you mean, "I don't want to"? In that case, change the code above to:
# Define SSI types and handlers...
AddType text/html .html .shtml
AddHandler server-parsed .html .shtml
Options Indexes FollowSymLinks Includes
Now, all of your files ending in .html or .shtml will be parsed for SSI. There's an upside and downside to this. The downside is that now, all three of your new .shtml files and the two .html files you changed will parse for SSI. So will the 741 .html files you didn't change, putting a considerable load on your server with no reason for it. The upside is that if you continue naming everything with .html, none of your users will know that you're using SSI, which can be a handy security thing. Just be aware that there can be performance tradeoffs.
One more thing; if you're going to keep all SSI pages with a .shtml extension, and you want your index page to use SSI, you'll need to add the following line to your .htaccess:
# Define index alternates...
DirectoryIndex index.shtml index.html
Now, index.shtml is your default page, but if it can't be found, then index.html will load instead.
Syntax: xBitHack status
Valid status values are:
• off - no action on executable files
• on - all files with user-execute bit set will be parsed
• full - same as on, but allows clients and proxies to cache results
XBitHack (pronounced "X bit hack") is an alternative method to evoke SSI, and it's a two step process. First, CHMOD the page files, all of the page files, and only the page files, that you want parsed to 744 instead of 644. This is what tells the server to parse the page. Next, issue the .htaccess directive:
# Turn on SSI using xBitHack...
XBitHack on
And, you're done. xBitHack and SSI are now on. One caution though; if you used the AddType/AddHandler method above, REMOVE THEM! It's xBitHack or AddHandler, but not both. Personally, I prefer to use the AddType/AddHandler method because I can look at the file extension and know if I'm parsing the page.
Syntax: Redirect [Code] url-path url
Syntax: Redirect [Status] url-path url
Syntax: RedirectTemp url-path url
Syntax: RedirectPermanent url-path url
Have you ever wanted to reorganize your web pages to make better sense? Has your ISP ever said "We've reorganized our servers, so you have to reorganize your stuff"? Have you ever wanted to just change some things around, but thought you couldn't because the search engines already ranked your pages and you didn't want to tempt fate? .htaccess provides a simple and search engine friendly alternative. Redirect maps an old page's URL to a new URL. It's actually pretty cool; the client requests the old page. .htaccess returns the page's new URL. The client then requests the page with the new URL. The client never knows anything happened, and the search engines just update their stuff automatically.
So, here's your situation. You set up all of your custom error pages just as described above.
# Defining custom error pages...
ErrorDocument 400 /errors/badrequest.html
ErrorDocument 401 /errors/authreq.html
ErrorDocument 403 /errors/forbidden.html
ErrorDocument 404 /errors/notfound.html
ErrorDocument 500 /errors/servererr.html
But you've set up your fancy-shmancy photos database and want to capture its errors to your /errors subdirectory. You're going to have all these error log files in with your custom error pages, so you want to move your error pages to the new /cep (CustomErrorPages) directory instead. You're thinking "I'll just rewrite their new locations in my .htaccess like I did before,"
# Defining custom error pages...
ErrorDocument 400 /cep/badrequest.html
ErrorDocument 401 /cep/authreq.html
ErrorDocument 403 /cep/forbidden.html
ErrorDocument 404 /cep/notfound.html
ErrorDocument 500 /cep/servererr.html
But, you did such a fine job on the original error docs that the folks over at the 404 Research Lab have linked your error pages to show the rest of the world how it's done. The snippet above will tell your server where the new files are, but not the rest of the web. You can't just move your error pages and leave it at that... So, you redirect them (make sure you change yoursite.com to your actual site):
#Setting permanent redirects...
Redirect 301 /errors/badrequest.html http://www.yoursite.com/cep/badrequest.html
Redirect 301 /errors/authreq.html http://www.yoursite.com/cep/authreq.html
Redirect 301 /errors/forbidden.html http://www.yoursite.com/cep/forbidden.html
Redirect 301 /errors/notfound.html http://www.yoursite.com/cep/notfound.html
Redirect 301 /errors/servererr.html http://www.yoursite.com/cep/servererr.html
Some notes on Redirect:
Syntax: DirectoryIndex filename.ext [filelist]
Where:
• "filename" is name of the webpage to display
• "ext" is the webpage filename extension
• "[filelist]" are any additional filenames and extensions
to append to the first webpage filename
Gee, Sam, I was wondering if you could shed some light on all this DirectoryIndex stuff for me. What is it? What does it do? Help me Sam!
Well, have you ever wondered why it is that when you go to mysite.com, it loads default.html, and when you go to yoursite.com, it loads index.html, or some other file? On most servers, there's a pre-defined list of files: index.html, index.shtml, index.php, index.asp, default.html, default.shtml, and so on through all of the "normal" file names. Using .htaccess, anyone can define this...
# Define default page...
DirectoryIndex myHomePage.html
Now, Joe-blow (the user) goes to www.yoursite.com and when he gets there, myHomePage.html will be loaded. That's all there is to it. But wait, there's more, and if you act now and order in the next 30 seconds... sorry, the TV is running in the background. Anyway, you can append more than one filename to the DirectoryIndex directive. Try this.
# Define default page...
DirectoryIndex myHomePage.shtml myHomePage.php myHomePage.html
This causes myHomePage.shtml to load first. If the server can't find it, it will load myHomePage.php instead. If it can't find it, it will resort to loading to myHomePage.html. It reads the file list from left to right.
So, let's start thinking about security a little bit. Let's say you keep all your .mp3 music files in a /music directory, and you keep all of your .jpg photos in your /images directory. You don't want someone just snoopin' around your directories and pimpin' your stuff. First, make you a couple of new html pages, specifically, musicIndex.html and imageIndex.html. Put the musicIndex.html page in your /music directory, and the imageIndex.html page in your /images directory. Now, make you a new .htaccess file and throw the lines below in it, and stick a copy of the new .htaccess in each, your /music and /images directories.
# Define default page...
DirectoryIndex musicIndex.html imageIndex.html
Ole Joe-blow (the user) finds out you have some music and photos. He
starts snoopin' and in an effort to see what you have (or what he wants to steal),
he tries to get some directory listings and types:
http://www.mysite.com/images
or
http://www.mysite.com/music
When he does, he'll simply get the appropriate .html pages and he's done. For an alternate way of doing this, see the section entitled "Prevent Directory Listing."
Syntax: AddType type ext
Where:
• "type" is the application type
• "ext" is the normal filename extension associated with the file
Sometimes, servers aren't set up to serve all content to all people. God knows why, but you may want to serve something like audio, video, or bitmap images to the world. If that's the case, then look below:
# Adding MIME types...
AddType application/x-shockwave-flash swf
AddType application/msword .doc
AddType audio/x-mpgurl m3u
AddType audio/x-pn-realaudio ram rm
AddType image/vnd.wap.wbmp wbmp
The first line (after the comment) adds Shockwave with its corresponding .swf file extension, the second adds MS Word .docs, the third is for .mp3 files, the forth is for Real Audio .ram files, and finally, we add Windows bitmap .bmp files.
Have fun with all that.
Hot linking is bad! Come on, say it with me... three times;
"Hot linking is bad!"
"Hot linking is bad!"
"Hot linking is bad!"
Otherwise known as bandwidth stealing, hot linking refers to linking directly to non-HTML content (like images, music, or javaScript or pearl code) that belongs to someone else. In essence, someone else is using your code, showing your photos, or playing your music without either paying for the content or the bandwidth to deliver it. Now, refer to the paragraph above and say it three more times. Using .htaccess, we can prevent this:
# Don't steal my content...
RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?myurl.com/.*$ [NC]
RewriteRule \.(mp3|gif|jpg|js|css|pl)$ - [F]
The snippet above will stop anyone from directly linking to any .mp3 music files, .gif or .jpg images, or any .js, .css, or .pl javaScript, cascading style sheets, or pearl code on your server. And if you want to get a little revenge, you can serve these nice folks a little "alternate" content. You can see what I serve up, but be don't be too amazed... it's just a litte 3K picture saying don't steal. You can do it too, but remember that if it's a six mega pixal pic of a steaming turd, until they find out, you're still paying for the bandwidth.
# Don't steal my content...
RewriteEngine on
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://(www\.)?myurl.com/.*$ [NC]
RewriteRule \.(mp3|gif|jpg|js|css|pl)$ http://www.myurl.com/dontsteal.gif [R,L]
A couple of quick notes: First, make sure you replace myurl.com with your actual domain name. Second, "mod rewrite" needs to be enabled on your server for this to work. Fire off an email to your web host and they can tell you if this is on. You can put it in the same email where you ask permission to make your .htaccess file in the first place.
If you happen to use, or want to use .htaccess for password protecting some portions of your website (more on this in the next section), then the location of the .htpasswd file would be plainly visible. This could be a real security no-no! To prevent someone from looking at your .htaccess file (or any other file for that matter), put the following into your .htaccess file:
# No peeking at my .htaccess file...
<Files .htaccess>
order allow,deny
deny from all
</Files>
The first line tells the server that the file named .htaccess is having rules applied to it.  The second tell the server what the rules can be, and the third line tells the server what the specific rule is.
Configured properly, most servers will now return a 403 error code to anyone trying to view your .htaccess file. Go on, give it a try here. However, just for a little added protection, put the code above into your .htaccess file and make sure you CHMOD your .htaccess file to 644 (RW-R--R--).
When one builds stuff for the world wide web, it is, by default, available to
the whole wide world. Generally, this is the concept we're
after, but on occasion, we want to be a little more reserved. For example,
those photos you want to protect may be of junior (or juniorette if that's what
you have) and you want all the grandparents to be able to see them, but keep
the rest of the world out. That's cool. .htaccess can deliver the
goods... and the photos only to the grandparents. Here's what we'll need:
Let's start with the User IDs and passwords. We already know the User
IDs... Grammy, Grampy, Granny, and Papaw. We'll give them the following
passwords:
Grammy's password is "spoiled"
Grampy's password is "rotten"
Granny's password is "cookies" and
Papaw's password is "gonefishin"
Next, we need a utility to encrypt the passwords. Listed below are a
couple of places you can use to encrypt your passwords:
4webhelp online encryption utility
Alterlinks online encryption utility
I like both of these utilities because they make you confirm the passwords you've
typed in.
Now whip out that ASCII editor you've been using for the .htaccess file, and
make a new file. Name this one .htpasswd. After you have the file
created, go to one of the online encryption utilities above and type in:
grammy
spoiled
spoiled (to confirm)
Note that we typed in grammy, not Grammy. UserIDs and passwords are case
sensitive. Anyway, the utility should come back with "grammy" followed
by a colon, and something cryptic. Something like:
grammy:51mwL4OPxfBoI
or
grammy:211ahsKkwIbzw
Copy and paste that into the .htpasswd file you've just created. Now, go back to the online encryption utility and type in the rest for Grampy, Granny, and Papaw. Copy and paste the returns from the encryption utility into the .htpasswd file, each on one line only. Our .htpasswd file should look something like this:
# UserIDs and passwords...
grammy:62TfNbMH9aDrw
grampy:564BNxv5yRobU
granny:95AB0MbUnNilw
papaw:20ZYH7Bmawfgs
Now, let's tackle the .htaccess file. You'll need just a few lines of code, and we'll talk about each one.
AuthName "Realm Name"
AuthUserFile /complete/server/path/.htpasswd
AuthType Basic
require valid-user
The AuthName "Realm Name" directive is the text that will appear at the top of the user validation dialog box when it pops up. When we write ours, we'll change "Realm Name" to "Grandparents Only."
The AuthUserFile /complete/server/path/.htpasswd directive is the full server path to the .htpasswd file. It is not a URL. And, as with the .htaccess file, the .htpasswd file should be uploaded as ASCII, not BINARY, and should be CHMOD to 644 (RW-R--R--). Also, with security in mind, you shouldn't upload your .htpasswd file to any web accessible directory (for example, www.mysite.com/.htpasswd). Rather, it should be in a server path like /users/local/safedir/.htpasswd. Check your directory structure, but make sure you put it above the entry point for web users.
The AuthType directive sets the type of encryption, and is set to Basic. As of this writing, only type "Basic" is available. Currently, another mode (Digest) is being developed, and should be more secure, but no release date is available yet.
The require valid-user directive is what tells Apache to fetch the credentials to get the UserID and password with a login window.
Ultimately, for our purposes, our .htaccess entry should look like this:
# Set up password protection...
AuthName "Grandparents Only"
AuthUserFile /users/local/safedir/.htpasswd
AuthType Basic
require valid-user
Make your .htpasswd file and you're all set.
Syntax: IndexIgnore filespec [filelist]
Where:
• "filespec" is file specification to block
• "[filelist]" are any additional file specifications to
append to the first filespec
So, you've decided "I think Sam is on to something with that not letting people see my /music or /images directories" idea. But, you're not real comfy with trying to manage all those .htaccess files that would be running around either.
Most servers are now configured to prevent directory listings, but if yours isn't, take the matter into your own hands.
# Prevent directory listings...
IndexIgnore *
All done. The "*" in the code snippet above is a wildcard that matches all files and tells Apache not to list anything, at least from the point where it encounters the directive and anything below that point in the directory structure. Of course, you may want to list some stuff. In that case, try this:
# Prevent directory listings...
IndexIgnore *.jpg *.mp3
Now, no one could list any of the photos or music files you're so interested in protecting, but all of the .html and .txt files, or anything not .mp3 and .jpg, could be listed.
If your server has already been configured to show no directory listings and you want your users to have that kind of freedom, just tell .htaccess, as in:
# Show directory listings...
Options +Indexes
This will turn the directory listings on. Be careful though, as this could be a security nightmare and someone else could "own" the domain you paid for.
So, you want to find out about puppies, or transistor theory, or .htaccess, or anything else... how do you do it? It's fairly simple; you just head over to Google, Yahoo, MSN, or any other search engine and look it up. Question is, how do they find it? Well, they use spiders, also known as crawlers or web bots. Basically, these are automated programs that traverse the web looking for web pages to index. These are good things if you want your content found. With a little luck, normal people can head over to the search engines and find your content too.
However, believe me when I tell you that the bad guys are out there in full force, and they use spiders and web bots too! Some of these leeches don't make any pretense and figure if you're too stupid to protect yourself and your content, then you deserve what you get. In essence, if you don't want to have your stuff abused, or stolen, then don't put it on the web. Mark Pilgrim said it best:
"Some will say that the Internet is a public place, and if I don’t want something abused, I shouldn’t put it on the Internet. Well, that’s true. It is also true that if I don’t want to get mugged, I shouldn’t leave my house, and if I don’t want calls from telemarketers, I shouldn’t have a phone. But I like leaving my house, I like having a phone, and I like having this web site. I fight back against telemarketers who abuse my phone, and now I’m fighting back against robots who abuse my web site."
Others that use spiders will tell you that their cause is noble; it is to prevent plagiarism (turnitin.com), or content theft and copyright violation (Cyveillance and Copyscape), or some other transgression. Indeed, these causes may all be noble, but the methodology employed is less than admirable. For instance, is it fair to assume you are a copyright thief and download hundreds of pages of content, only to leave you with the bandwidth bill and find out you're an honest citizen? Of course, it's not enough to check once... these things may hit your site two, five, or 70 times a month.
The robots.txt standard was developed and adopted to prevent unwanted robots from traversing a web site. And, you should probably have a robots.txt file on your site for the good spiders. While Google, Yahoo, MSN, and others honor it, others don't, and couldn't care less if you don't want their bot on your site. For those folks, we have .htaccess.
Keep in mind that banning any particular bot from your site is a very personal decision. What some would call a good bot, others would never want around. So, make your own decisions and take some, none, all, any, or whatever you like from my own list (below) with a grain of salt. If one subscribes to my ideas of what bad bots are, it's a pretty good list to start with. Having said that...
.htaccess provides us with several mechanisms to prevent the Internet version of pond scum from getting on our site. Specifically, we can ban by IP address (where the scum reports back to), by User Agent (what the scum calls itself), by Referrer (the URL the scum uses), or all of them together. We'll look at all three methods below.
Now, you can ban IP addresses with a "order allow,deny" directive, and we'll talk about that in a minute. If you chose either, or both of the other methods, then regardless of whether you use one or all of the methods below, you'll need to have a few lines of code in your .htaccess file. Open, edit, and add the following:
# Turn on RewriteEngine
Options +FollowSymlinks
RewriteEngine On
RewriteBase /
Let's go kick some bot hiney!
One thing .htaccess lets us do is prevent an IP address or range of addresses from getting on our site, and within the concept, .htaccess lets us do it in a couple of different ways.
First, we can use the "order allow,deny" directive.
# Banning by IP Address...
order allow,deny
deny from 123.45.67.89 # blocks specific IP
deny from 221.111.222. # blocks address range
deny from 111.112.113. # blocks address range
allow from all
In the first deny, the specific IP is banned, but in the other two cases, it's blocks of addresses that are banned. Regardless of the digits in the last octet, they're all gone! Don't know why you'd do it, but cool trick, huh? One could also issue the "deny from all" statement, and NO ONE would get on.
Another method we can use to ban IP addresses is through the use of regular expressions. Regular expressions (regex or regexp for short) are ways to formulate strings that match search patterns. Entire books have been written about the subject, but a couple of good tutorials can be found at regular-expressions.info and Virginia.edu.
# Banning by IP Address...
RewriteCond %{REMOTE_ADDR} ^65\.118\.41\.(19[2-9]¦2[0-1][0-9]¦22[0-3])$ [OR] #Cyveillance spybot
RewriteCond %{REMOTE_ADDR} ^63.148.99.2(2[4-9]|[3-4][0-9]|5[0-5])$ [OR] #Cyveillance spybot
RewriteCond %{REMOTE_ADDR} ^12.148.196.(12[8-9]|1[3-9][0-9]|2[0-4][0-9]|25[0-5])$ [OR] #NameProtect spybot
RewriteCond %{REMOTE_ADDR} ^12.148.209.(19[2-9]|2[0-4][0-9]|25[0-5])$ [OR] #NameProtect spybot
RewriteRule .* - [F]
Either way, it's your choice.
Just as with an IP address, we can use the "order allow,deny" directive to ban by Referrer. For example:
# Banning by Referrer...
order allow,deny
deny from .cyveillance.com # blocks www.cyveillance.com
deny from .turnitin.com # blocks www.turnitin.com
allow from all
And, we can also use regular expressions here too. Fact is, this is the method described in most .htaccess tutorials.
# Banning by Referrer...
RewriteCond %{HTTP_REFERER} guestbook [NC,OR]
RewriteCond %{HTTP_REFERER} anysite.\com [NC,OR]
RewriteCond %{HTTP_REFERER} iaea\.org [NC]
RewriteRule .* - [F]
Note the absence of the "OR" clause in the last RewriteCond statement. Here again, it's your choice, but I'd recommend the regex method.
Depending on who you ask, defining a bad bot can be a real exercise. The only thing that most agree on is that they need to be exorcised! It's pretty well agreed upon that off-line site rippers, email harvesters, bots that ignore robots.txt, and any other ilk that abuses your site without returning any value is a bad bot.
In the code snippet below, the first three are done for readability, while the last three are written using regex. They may seem a little more convoluted, but are considerably more powerful. Read up on it over at regular-expressions.info. They are way cool.
# Banning BOTS By User Agent...
RewriteCond %{HTTP_USER_AGENT} ^Alligator [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^attach [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^BackWeb [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^[A-Z]+$ [NC,OR] #Spambot
RewriteCond %{HTTP_USER_AGENT} ^(AcoiRobot¦FlickBot¦webcollage) [NC,OR] #IG
RewriteCond %{HTTP_USER_AGENT} ^(autoemailspider¦ExtractorPro) [NC] #SB
RewriteRule .* - [F,L]
Again, note the absence of the "OR" clause in the last RewriteCond statement. Now, go kick some bad bot booty! And, although the bot names are cased properly, the "NC" in the [NC,OR] portion of the clause indicates that it's "No Case", or case insensitive.
One of my biggest problems, when trying to explain this stuff to others, is that all of the other tutorials never show the big picture; they all show code snippets, but none of them show a completed .htaccess file. To that end, if you've been keeping up, or if you want to just cheat a little, here's a working .htaccess file. Of course, you'll have to change some of the names to make it work on your site. For example, you will have to change yoursite.com to your actual domain name.
For what it's worth, this document is somewhat superficial, and .htaccess can do so much more than what we've discussed here. And, I'd love to tell you that as far as security is concerned, this is the final word, but it's not. For more information on .htaccess security, see the three part series on Hardening .htaccess. Careful though. This is the stuff that can turn one into a real geek!
Also, the section below on banning bad guys makes heavy use of Regular Expressions. For more information on the subject, see the regular-expressions.info website. So, without further ado:
# Defining custom error pages...
ErrorDocument 400 /cep/badrequest.html
ErrorDocument 401 /cep/authreq.html
ErrorDocument 403 /cep/forbidden.html
ErrorDocument 404 /cep/notfound.html
ErrorDocument 500 /cep/servererr.html
# Define SSI server parsed documents...
AddType text/html .html .shtml
AddHandler server-parsed .shtml .html
Options Indexes FollowSymLinks Includes
# Define index alternates...
DirectoryIndex index.shtml index.html
# Set up password protection...
AuthName "Grandparents Only"
AuthUserFile /users/local/safedir/.htpasswd
AuthType Basic
require valid-user
# Prevent directory listings...
IndexIgnore *.jpg *.mp3
#Setting permanent redirects...
Redirect 301 /errors/badrequest.html http://www.yoursite.com/cep/badrequest.html
Redirect 301 /errors/authreq.html http://www.yoursite.com/cep/authreq.html
Redirect 301 /errors/forbidden.html http://www.yoursite.com/cep/forbidden.html
Redirect 301 /errors/notfound.html http://www.yoursite.com/cep/notfound.html
Redirect 301 /errors/servererr.html http://www.yoursite.com/cep/servererr.html
Options +FollowSymlinks
RewriteEngine On
RewriteBase /
# Banning by Referrer...
RewriteCond %{HTTP_REFERER} guestbook [NC,OR]
RewriteCond %{HTTP_REFERER} iaea\.org [NC,OR]
# Banning by IP Address...
RewriteCond %{REMOTE_ADDR} ^65\.118\.41\.(19[2-9]¦2[0-1][0-9]¦22[0-3])$ [NC,OR] #Cyveillance spybot
RewriteCond %{REMOTE_ADDR} ^63.148.99.2(2[4-9]|[3-4][0-9]|5[0-5])$ [NC,OR] #Cyveillance spybot
RewriteCond %{REMOTE_ADDR} ^12.148.196.(12[8-9]|1[3-9][0-9]|2[0-4][0-9]|25[0-5])$ [NC,OR] #NameProtect spybot
RewriteCond %{REMOTE_ADDR} ^12.148.209.(19[2-9]|2[0-4][0-9]|25[0-5])$ [NC,OR] #NameProtect spybot
# Banning by User Agent
# Address harvesters
RewriteCond %{HTTP_USER_AGENT} ^(autoemailspider¦ExtractorPro) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^E?Mail.?(Collect¦Harvest¦Magnet¦Reaper¦Siphon¦Sweeper¦Wolf) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (DTS.?Agent¦Email.?Extrac) [NC,OR]
# Download managers
RewriteCond %{HTTP_USER_AGENT} ^(Alligator¦DA.?[0-9]¦DC\-Sakura¦Download.?(Demon¦Express¦Master¦Wonder)¦FileHound) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Flash¦Leech)Get [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Fresh¦Lightning¦Mass¦Real¦Smart¦Speed¦Star).?Download(er)? [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Gamespy¦Go!Zilla¦iGetter¦JetCar¦Net(Ants¦Pumper)¦SiteSnagger¦Teleport.?Pro¦WebReaper) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(My)?GetRight [NC,OR]
# Image-grabbers
RewriteCond %{HTTP_USER_AGENT} ^(AcoiRobot¦FlickBot¦webcollage) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Express¦Mister¦Web).?(Web¦Pix¦Image).?(Pictures¦Collector)? [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image.?(fetch¦Stripper¦Sucker) [NC,OR]
# "Gray-hats"
RewriteCond %{HTTP_USER_AGENT} ^(Atomz¦BlackWidow¦BlogBot¦EasyDL¦Marketwave¦Sqworm¦SurveyBot¦Webclipping\.com) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (girafa\.com¦gossamer\-threads\.com¦grub\-client¦Netcraft¦Nutch) [NC,OR]
# Site-grabbers
RewriteCond %{HTTP_USER_AGENT} ^(eCatch¦(Get¦Super)Bot¦Kapere¦HTTrack¦JOC¦Offline¦UtilMind¦Xaldon) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web.?(Auto¦Cop¦dup¦Fetch¦Filter¦Gather¦Go¦Leach¦Mine¦Mirror¦Pix¦QL¦RACE¦Sauger) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Web.?(site.?(eXtractor¦Quester)¦Snake¦ster¦Strip¦Suck¦vac¦walk¦Whacker¦ZIP) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebCapture [NC,OR]
# Tools
RewriteCond %{HTTP_USER_AGENT} ^(curl¦Dart.?Communications¦Enfish¦htdig¦Java¦larbin) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (FrontPage¦Indy.?Library¦RPT\-HTTPClient) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(libwww¦lwp¦PHP¦Python¦www\.thatrobotsite\.com¦webbandit¦Wget¦Zeus) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Microsoft¦MFC).(Data¦Internet¦URL¦WebDAV¦Foundation).(Access¦Explorer¦Control¦MiniRedir¦Class) [NC,OR]
# Unknown
RewriteCond %{HTTP_USER_AGENT} ^(Crawl_Application¦Lachesis¦Nutscrape) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^[CDEFPRS](Browse¦Eval¦Surf) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Demo¦Full.?Web¦Lite¦Production¦Franklin¦Missauga¦Missigua).?(Bot¦Locat) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (efp@gmx\.net¦hhjhj@yahoo\.com¦lerly\.net¦mapfeatures\.net¦metacarta\.com) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Industry¦Internet¦IUFW¦Lincoln¦Missouri¦Program).?(Program¦Explore¦Web¦State¦College¦Shareware) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Mac¦Ram¦Educate¦WEP).?(Finder¦Search) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Moz+illa¦MSIE).?[0-9]?.?[0-9]?[0-9]?$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/[0-9]\.[0-9][0-9]?.\(compatible[\)\ ] [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NaverRobot [NC]
RewriteRule .* - [F]
Have fun helping your legitimate users and getting rid of the riffraff!
On many systems, the . (dot) identifies .htaccess as a system file. As such, operating systems hide them. However, all is not lost. If you can't see or find your .htaccess file to edit it, check below for a quick fix, or Google the particulars for your FTP client and OS if it isn't listed below. Also, many HTML editors (1st Page, Nvu, Dreamweaver) come with integrated FTP clients... if this is your case, check your documentation.
Using these various FTP programs:
Cute FTP Pro (v3.0):
Right click connection
select "Site Properties"
select "Actions" tab
select "Filter" button
check "Enable server side filtering"
in "Remote filter" window, type in -la
Filezilla
select "View"
check "Show hidden files"
Using these various OS programs:
Windows XP
Run Windows Explorer
select the "Tools" dropdown
select "Folder Options"
select the "View" tab
check "Show hidden files and folders
Mac OS X
Set preferences to show hidden files
Don't move, edit or delete any other files