|
Please do not copy, take, "borrow", steal,
or otherwise use this material without permission. This is all original,
and it's all protected
by international copyright law. Feel free to link to this page,
rewrite the information in your own words all you want
(hey, nothing on this page is a secret!), and use the information for
whatever purpose you wish, but don't just take the content, okay?
Just a few nifty little tips and tricks for webmasters and webmistresses.
Keeping Your Images From Being Indexed
There are at least three large search index sites that deliberately
seek out and index images, music files, and other multimedia files.
Several of these actually archive the images on their server and anyone
who searches for, say, your name, can easily turn up a picture of you
if you have one on your website. They can also access your images without
ever having to read your copyright notices, without reading your Terms
of Use statement, and without ever having to visit your site at all.
Don't like that idea? Lots of graphic artists, web designers, photographers,
and other artists online are disturbed by this practice (and at least
one has brought a lawsuit against one of the indexing systems, charging
copyright violation).
The good news is that it's a fairly easy task to keep these index
spiders from grabbing your images, particularly if you have your own
domain. At the time of this writing, there are three major sites that
do this (as far as I know), and all of them can be prevented from indexing
your images (or music files or whatever) with a simple robots.txt file.
This isn't the place for me to explain all the ins and outs of a robots.txt
file or the robots.txt protocol (for more detailed information, try:
The
Web Robots Page). In a nutshell, a "robots.txt" file is a text file
that you place in the root of your domain (if you don't own the domain
or have access to the domain root directory, you'll have to talk to
your system administrator or webhost about the robots.txt issue, but
some search engines do honor a robots.txt in a subdirectory, so you
may as well give it a try). What I'm going to do here is give the formula
to keep the three biggest known image index spiders out of your site
using robots.txt
In the most basic form:
User-agent: vscooter
User-agent: DittoSpyder
User-Agent: Googlebot-Image
User-Agent: psbot
Disallow: /
User-agent: Googlebot
Disallow: /*.gif$
Disallow: /*.jpg$
Disallow: /*.jpeg$
Disallow: /*.png$
Note that "vscooter" is the name of Alta Vista's image indexing spider
(their page indexing spider is "scooter" so preventing vscooter from
getting into your site doesn't keep your site from being indexed). DittoSpyder
is the one for ditto.com (you don't want them at all; the only thing
they do is index images!), and psbot is the spider for Picsearch, which
is also an image indexer. Googlebot-Image is Google's image index spider.
In this formula, all of those spiders are ordered to stay out completely.
However, Google provides conflicting information on how to keep images
from being indexed. The second entry for "Googlebot" follows a peculiar
method for keeping them from indexing your images, as shown (note: you
can modify that to make them exclude any kind of file, including .zip,
.mid, .wav, etc.). Please note that this method is not correct or valid
in the usual protocol, so while it works fine for Googlebot, don't try
to apply it to any other search engine bot (or your site may be excluded
entirely and therefore not even get the pages indexed). Also, just so
you know, here's
the page at Google where I found the protocol for excluding images,
and here
is the page at Google that discusses the Googlebot-Image spider.
What if you don't have access to the robots.txt file? You can still
protect your images using the META tag (at least somewhat). In between
the <head></head> tags in your HTML, put the following tag:
<META NAME="robots" CONTENT="noimageindex">
Not all spiders recognize this tag, unfortunately. As far as I'm aware,
this only works with Alta Vista. However, if you have images you really
want to protect, you can put them on a page and then keep all robots/spiders
from indexing the page this way:
<META NAME="robots" CONTENT="noindex">
As far as I know, all reputable index spiders do obey that.
You can also prevent a page from being put into a cache with the following
META directive:
<META NAME="robots" CONTENT="nocache,noarchive">
Again, as far as I'm aware, all reputable index spiders honor that
tag.
Important: You should do a follow-up if you can, to make sure
your images will be removed. With Alta Vista, you can simply re-submit
your site and they will remove the forbidden material when they
encounter the updated robots.txt file. Google will remove images the
next time they spider your site, or you can email
them and request they send the spider within 48 hours. With Ditto,
you need to request they remove it using an
online form. Do this only after you have updated the robots.txt
file or added META tags, or it's an exercise in futility.
[menu]
Preventing the Internet Archive "Wayback Machine"
from Indexing Your Pages
First, let me say that the service offered by The
Internet Archive is really pretty interesting. You may or may not
object to having your pages archived by them. What they do is keep records,
sometimes going back years (as far back as 1996), of web pages. They
archive them, and you can retrieve them from various dates in history.
I went and had a look at the personal homepage I had online in November
1997 (pretty embarassing now, but it was cool at the time).
However, many people, myself included, feel that when they remove
pages from the web, they want to take them out of public viewing. They
also don't like having their images archived, even if they're "old"
ones. You'll have to decide for yourself if this is something you want
to allow or disallow, and there are plenty of good reasons for letting
old stuff be archived, just as there are plenty of reasons to object.
This little article is to show you how to prevent it, because if you
want to allow it, that's a lot easier.
Like the above information, you'll need to use a robots.txt entry
or prevent robots from archiving or indexing pages with HTML. If you
skipped over the stuff immediately above this entry, go have a quick
read, because most of those techniques are what you'll be using
for this tip. Go on. I'll wait.
Okay. Back? Here's the pertinent informtion you want to have in your
robots.txt file:
User-Agent: ia_archiver
Disallow: /
What that says is that ia_archiver (the name of the Internet Archive's
spider) isn't allowed to enter your site. As with any other spider,
you can limit it to specific directories if you wish, as well. To combine
this with other robot directives to stay out, you would do something
like this:
User-agent: vscooter
User-agent: DittoSpyder
User-Agent: Googlebot-Image
User-Agent: ia_archiver
User-Agent: psbot
Disallow: /
That keeps out all of the spiders in the previous entry plus the Internet
Archive spider, as well.
If you don't have access to robots.txt (i.e., your domain is hosted
in such a way that it's not in the "root" for the domain), you can use
the exact same HTML additions as above.
I got this information directly from the Internet
Archive site, which is where you need to go once you have your robots.txt
(or HTML additions) in place. Just submit your site for archiving (follow
the link on the page) and when the spider comes across the robots.txt
or HTML that forbids it, your pages will be removed (at least, they
will be when the spider gets around to your site to be indexed and they
update their own database).
If you want to remove pages that you no longer have access to (such
as in an account you no longer have), you'll need to email them. My
experience has been that they're quite responsive to requests to remove
materials.
[menu]
Preventing Bandwidth Theft (direct-linking)
with .htaccess
In a nutshell, bandwidth theft is when someone links directly to your
image(s), drawing the image from your server to their pages. This causes
undue traffic on your server and can cost you money, as well, if you
pay for bandwidth. (Note: this can happen with all kinds of files, not
just images, but I'm concentrating on "images" because that's
the most common and it's easier than always adding in other files. This
definitely does work for .zip, .wav, .mov, and anything else you're
likely to have on your website).
This technique to prevent direct-linking is not to prevent people from
just plain stealing your images by right-clicking or other means (preventing
right-clicking is a JavaScript function, but it doesn't always work
with all browsers and it's hardly "foolproof"; for places
to find free scripts to prevent right-clicking, check out the Resources Index
and visit some of the JavaScript repositories listed).
Yes, this really works to prevent direct-linking and bandwidth theft.
I've tested it extensively on several domains and so far it's always
worked like a charm (although I did hear from someone that she managed
to circumvent it; I haven't been able to reproduce that, though, not
even with extensive testing). If you're unsure about fiddling with the
.htaccess file (and it can be tricky), you might want to check with
your webhost or with a site such as The
Comprehensive Guide to .htaccess. Note that .htaccess is very useful
for lots of things, including setting up custom error pages, making
password-protected directories, denying access to certain addresses,
redirecting pages and lots more. It's well worth looking into.
To use this, you must have the ability to use .htaccess, and
mod_rewrite must be installed on the server where your pages
are hosted. In almost any decent domain hosting situation, this
will be the case already, but as always, you should check with your
webhost or service provider if you're unsure. The actual .htaccess file
is a text file named .htaccess (yes, be sure to use the dot) and stored
in the root directory (or others, for some purposes; to protect your
entire domain, you want it in the root).
This assumes that the RewriteEngine is already on and enabled for your
domain, but it may not be. Try it first and see or check with your system
administrator and/or tech support to find out. If you're sure it's not
turned on, you need to start your .htaccess file with this directive:
RewriteEngine On
Then, the exact forumla to prevent the actual bandwidth theft:
RewriteCond %{HTTP_REFERER} !^$
RewriteCond %{HTTP_REFERER} !^http://yourdomain.com/.*$ [NC]
RewriteCond %{HTTP_REFERER} !^http://www.yourdomain.com/.*$ [NC]
RewriteRule .*\.(gif|GIF|jpg|JPG|zip|ZIP|png|PNG|swf|SWF)$ - [F]
Note that none of these lines should wrap (but they might on this
screen, depending on your screen resolution).Oh, and obviously, replace "yourdomain.com" with
your domain name. You can also add in any other domains that you would like to allow to link (such as bulletin boards you use and to which you might want to remotely link an avatar or other image).
If you're good at figuring out formulae and notation, you can fairly
easily alter this to prevent direct linking to any kind of file (.mid,
.wav, etc.). This particular formula prevents linking to .gif, .jpg,
.zip, .png, and .swf files, which you can read on the bottom line. Please,
though, BE CAREFUL when playing with .htaccess. You can make
all sorts of weirdnesses happen with a badly written .htaccess directive.
However, don't let that deter you from learning about .htaccess! As
mentioned, it's extremely useful.
When this is working, anyone who tries to access any of the prohibited
files by direct linking will get either an error message or just a broken
image.
The question has come up a couple of times from people who wish to
allow direct-linking to only one or two specific images as to
how to do that while protecting the rest of the site. I can think of
several ways to do this, but this is probably the easiest: Keep all
your images in a sub-directory, and put the .htaccess file to prevent
linking in that directory only. Then put the images that you do want
linked in a different sub-directory that isn't protected by the .htaccess,
and don't put one in the top-level directory. It's a bit of a pain in
the neck, but keeping images in a sub-directory isn't a bad way to structure
a website (just remember when you link to the images from your pages
to specify the sub-directory they're in).
The thing to remember is that you can put an .htaccess file in any
directory. The higher the directory, the more it affects. Putting it
in the top level directory affects the whole site. If you put it in
only a sub-directory, only that sub-directory (and any below it) will
be affected (this is how password-protected directories are set up).
If you put an .htaccess in a sub-directory it will override "higher"
.htaccess files only if it contains counter information (such as: it
has a different location for a custom 404 which has already been specified).
If there are no changes or "counter" information, it will keep going
on the previous .htaccess directives.
[menu]
Preventing the Annoying
Image Toolbar Popup in IE6+
Internet Explorer 6+ has a feature many consider
annoying. Basically, when you hover on an
image of 200x200px or greater, a little "image
toolbar" pops up
to allow you to email the image, save it to
your hard drive, copy it, print it, or otherwise
grab or mistreat it. A "no right click"
script doesn't work against it, either.
This can be turned off in the browser, of course (I certainly turn
it off!), but many people don't know this or don't know how to disable
it. You, however, can make sure it won't pop up in anyone's browser
while they're viewing your pages, by including this META tag in the
header of the page(s):
<META HTTP-EQUIV="imagetoolbar" CONTENT="no">
If you only want to disable it for specific images (this is handy
if you have an auto-generated online picture gallery, for example),
you can use this tag in the image display anchor, iteself:
<img src="image.ext" GALLERYIMG="no">
(Of course, you should change "image.ext" to the actual image name).
If it bugs you to have to put in a special tag on each and every page
or in each image to stop an annoying and potentially copyright-violating
feature of a browser, you're not alone. It irritates a lot
of people.
[menu]
If you'd like to learn more, there are many great sites out there that
can help with all kinds of technical issues. Visit the Resources
index for lots of resources in this area.
|