Alicorna Digital and Web Design
Monday, 07 July 2008
Main Menu
Home
About
Contact
News & Notes
Tips & Hints
Terms of Use
Web Resources
Privacy Policy
Webrings
Email Obfuscator
Freebies
Web Graphics
Photoshop Stuff
Scripts
HTML & WebDev Tools
Login Here
Username

Password

Remember me
Password Reminder
No account yet? Create one
Technical Tips & Hints for the Web

Please do not copy, take, "borrow", steal, or otherwise use this material without permission. This is all original, and it's all protected by international copyright law. Feel free to link to this page, rewrite the information in your own words all you want (hey, nothing on this page is a secret!), and use the information for whatever purpose you wish, but don't just take the content, okay?

Just a few nifty little tips and tricks for webmasters and webmistresses.

Keeping Your Images From Being Indexed

There are at least three large search index sites that deliberately seek out and index images, music files, and other multimedia files. Several of these actually archive the images on their server and anyone who searches for, say, your name, can easily turn up a picture of you if you have one on your website. They can also access your images without ever having to read your copyright notices, without reading your Terms of Use statement, and without ever having to visit your site at all. Don't like that idea? Lots of graphic artists, web designers, photographers, and other artists online are disturbed by this practice (and at least one has brought a lawsuit against one of the indexing systems, charging copyright violation).

The good news is that it's a fairly easy task to keep these index spiders from grabbing your images, particularly if you have your own domain. At the time of this writing, there are three major sites that do this (as far as I know), and all of them can be prevented from indexing your images (or music files or whatever) with a simple robots.txt file.

This isn't the place for me to explain all the ins and outs of a robots.txt file or the robots.txt protocol (for more detailed information, try: The Web Robots Page). In a nutshell, a "robots.txt" file is a text file that you place in the root of your domain (if you don't own the domain or have access to the domain root directory, you'll have to talk to your system administrator or webhost about the robots.txt issue, but some search engines do honor a robots.txt in a subdirectory, so you may as well give it a try). What I'm going to do here is give the formula to keep the three biggest known image index spiders out of your site using robots.txt

In the most basic form:

User-agent: vscooter
User-agent: DittoSpyder
User-Agent: Googlebot-Image
User-Agent: psbot

Disallow: /


User-agent: Googlebot
Disallow: /*.gif$
Disallow: /*.jpg$
Disallow: /*.jpeg$
Disallow: /*.png$

Note that "vscooter" is the name of Alta Vista's image indexing spider (their page indexing spider is "scooter" so preventing vscooter from getting into your site doesn't keep your site from being indexed). DittoSpyder is the one for ditto.com (you don't want them at all; the only thing they do is index images!), and psbot is the spider for Picsearch, which is also an image indexer. Googlebot-Image is Google's image index spider. In this formula, all of those spiders are ordered to stay out completely.

However, Google provides conflicting information on how to keep images from being indexed. The second entry for "Googlebot" follows a peculiar method for keeping them from indexing your images, as shown (note: you can modify that to make them exclude any kind of file, including .zip, .mid, .wav, etc.). Please note that this method is not correct or valid in the usual protocol, so while it works fine for Googlebot, don't try to apply it to any other search engine bot (or your site may be excluded entirely and therefore not even get the pages indexed). Also, just so you know, here's the page at Google where I found the protocol for excluding images, and here is the page at Google that discusses the Googlebot-Image spider.

What if you don't have access to the robots.txt file? You can still protect your images using the META tag (at least somewhat). In between the <head></head> tags in your HTML, put the following tag:

<META NAME="robots" CONTENT="noimageindex">

Not all spiders recognize this tag, unfortunately. As far as I'm aware, this only works with Alta Vista. However, if you have images you really want to protect, you can put them on a page and then keep all robots/spiders from indexing the page this way:

<META NAME="robots" CONTENT="noindex">

As far as I know, all reputable index spiders do obey that.

You can also prevent a page from being put into a cache with the following META directive:

<META NAME="robots" CONTENT="nocache,noarchive">

Again, as far as I'm aware, all reputable index spiders honor that tag.

Important: You should do a follow-up if you can, to make sure your images will be removed. With Alta Vista, you can simply re-submit your site and they will remove the forbidden material when they encounter the updated robots.txt file. Google will remove images the next time they spider your site, or you can email them and request they send the spider within 48 hours. With Ditto, you need to request they remove it using an online form. Do this only after you have updated the robots.txt file or added META tags, or it's an exercise in futility.

[menu]


Preventing the Internet Archive "Wayback Machine" from Indexing Your Pages

First, let me say that the service offered by The Internet Archive is really pretty interesting. You may or may not object to having your pages archived by them. What they do is keep records, sometimes going back years (as far back as 1996), of web pages. They archive them, and you can retrieve them from various dates in history. I went and had a look at the personal homepage I had online in November 1997 (pretty embarassing now, but it was cool at the time).

However, many people, myself included, feel that when they remove pages from the web, they want to take them out of public viewing. They also don't like having their images archived, even if they're "old" ones. You'll have to decide for yourself if this is something you want to allow or disallow, and there are plenty of good reasons for letting old stuff be archived, just as there are plenty of reasons to object. This little article is to show you how to prevent it, because if you want to allow it, that's a lot easier.

Like the above information, you'll need to use a robots.txt entry or prevent robots from archiving or indexing pages with HTML. If you skipped over the stuff immediately above this entry, go have a quick read, because most of those techniques are what you'll be using for this tip. Go on. I'll wait.

Okay. Back? Here's the pertinent informtion you want to have in your robots.txt file:

User-Agent: ia_archiver
Disallow: /

What that says is that ia_archiver (the name of the Internet Archive's spider) isn't allowed to enter your site. As with any other spider, you can limit it to specific directories if you wish, as well. To combine this with other robot directives to stay out, you would do something like this:

User-agent: vscooter
User-agent: DittoSpyder
User-Agent: Googlebot-Image
User-Agent: ia_archiver
User-Agent: psbot

Disallow: /

That keeps out all of the spiders in the previous entry plus the Internet Archive spider, as well.

If you don't have access to robots.txt (i.e., your domain is hosted in such a way that it's not in the "root" for the domain), you can use the exact same HTML additions as above.

I got this information directly from the Internet Archive site, which is where you need to go once you have your robots.txt (or HTML additions) in place. Just submit your site for archiving (follow the link on the page) and when the spider comes across the robots.txt or HTML that forbids it, your pages will be removed (at least, they will be when the spider gets around to your site to be indexed and they update their own database).

If you want to remove pages that you no longer have access to (such as in an account you no longer have), you'll need to email them. My experience has been that they're quite responsive to requests to remove materials.

[menu]


Preventing Bandwidth Theft (direct-linking) with .htaccess

In a nutshell, bandwidth theft is when someone links directly to your image(s), drawing the image from your server to their pages. This causes undue traffic on your server and can cost you money, as well, if you pay for bandwidth. (Note: this can happen with all kinds of files, not just images, but I'm concentrating on "images" because that's the most common and it's easier than always adding in other files. This definitely does work for .zip, .wav, .mov, and anything else you're likely to have on your website).

This technique to prevent direct-linking is not to prevent people from just plain stealing your images by right-clicking or other means (preventing right-clicking is a JavaScript function, but it doesn't always work with all browsers and it's hardly "foolproof"; for places to find free scripts to prevent right-clicking, check out the Resources Index and visit some of the JavaScript repositories listed).

Yes, this really works to prevent direct-linking and bandwidth theft. I've tested it extensively on several domains and so far it's always worked like a charm (although I did hear from someone that she managed to circumvent it; I haven't been able to reproduce that, though, not even with extensive testing). If you're unsure about fiddling with the .htaccess file (and it can be tricky), you might want to check with your webhost or with a site such as The Comprehensive Guide to .htaccess. Note that .htaccess is very useful for lots of things, including setting up custom error pages, making password-protected directories, denying access to certain addresses, redirecting pages and lots more. It's well worth looking into.

To use this, you must have the ability to use .htaccess, and mod_rewrite must be installed on the server where your pages are hosted. In almost any decent domain hosting situation, this will be the case already, but as always, you should check with your webhost or service provider if you're unsure. The actual .htaccess file is a text file named .htaccess (yes, be sure to use the dot) and stored in the root directory (or others, for some purposes; to protect your entire domain, you want it in the root).

This assumes that the RewriteEngine is already on and enabled for your domain, but it may not be. Try it first and see or check with your system administrator and/or tech support to find out. If you're sure it's not turned on, you need to start your .htaccess file with this directive:

RewriteEngine On

Then, the exact forumla to prevent the actual bandwidth theft:

RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://yourdomain.com/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://www.yourdomain.com/.*$ [NC]
RewriteRule .*\.(gif|GIF|jpg|JPG|zip|ZIP|png|PNG|swf|SWF)$ - [F]

Note that none of these lines should wrap (but they might on this screen, depending on your screen resolution).Oh, and obviously, replace "yourdomain.com" with your domain name. You can also add in any other domains that you would like to allow to link (such as bulletin boards you use and to which you might want to remotely link an avatar or other image).

If you're good at figuring out formulae and notation, you can fairly easily alter this to prevent direct linking to any kind of file (.mid, .wav, etc.). This particular formula prevents linking to .gif, .jpg, .zip, .png, and .swf files, which you can read on the bottom line. Please, though, BE CAREFUL when playing with .htaccess. You can make all sorts of weirdnesses happen with a badly written .htaccess directive. However, don't let that deter you from learning about .htaccess! As mentioned, it's extremely useful.

When this is working, anyone who tries to access any of the prohibited files by direct linking will get either an error message or just a broken image.

The question has come up a couple of times from people who wish to allow direct-linking to only one or two specific images as to how to do that while protecting the rest of the site. I can think of several ways to do this, but this is probably the easiest: Keep all your images in a sub-directory, and put the .htaccess file to prevent linking in that directory only. Then put the images that you do want linked in a different sub-directory that isn't protected by the .htaccess, and don't put one in the top-level directory. It's a bit of a pain in the neck, but keeping images in a sub-directory isn't a bad way to structure a website (just remember when you link to the images from your pages to specify the sub-directory they're in).

The thing to remember is that you can put an .htaccess file in any directory. The higher the directory, the more it affects. Putting it in the top level directory affects the whole site. If you put it in only a sub-directory, only that sub-directory (and any below it) will be affected (this is how password-protected directories are set up). If you put an .htaccess in a sub-directory it will override "higher" .htaccess files only if it contains counter information (such as: it has a different location for a custom 404 which has already been specified). If there are no changes or "counter" information, it will keep going on the previous .htaccess directives.

[menu]


Preventing the Annoying Image Toolbar Popup in IE6+

Internet Explorer 6+ has a feature many consider annoying. Basically, when you hover on an image of 200x200px or greater, a little "image toolbar" pops up to allow you to email the image, save it to your hard drive, copy it, print it, or otherwise grab or mistreat it. A "no right click" script doesn't work against it, either.

This can be turned off in the browser, of course (I certainly turn it off!), but many people don't know this or don't know how to disable it. You, however, can make sure it won't pop up in anyone's browser while they're viewing your pages, by including this META tag in the header of the page(s):

<META HTTP-EQUIV="imagetoolbar" CONTENT="no">

If you only want to disable it for specific images (this is handy if you have an auto-generated online picture gallery, for example), you can use this tag in the image display anchor, iteself:

<img src="image.ext" GALLERYIMG="no">

(Of course, you should change "image.ext" to the actual image name).

If it bugs you to have to put in a special tag on each and every page or in each image to stop an annoying and potentially copyright-violating feature of a browser, you're not alone. It irritates a lot of people.

[menu]

If you'd like to learn more, there are many great sites out there that can help with all kinds of technical issues. Visit the Resources index for lots of resources in this area.

 

Content and design © 1998-2005, Alicorna.
No unauthorized use or reproduction.
All rights reserved.