| Add You |
Hubs | Hubbers | Topics | Request |
| #1 in Business | Subscribe Email Print |
|
You are here: Home > Internet and Businesses Online > SEO > The Robots Text File Or How To Get Your Site Properly Spidered, Crawled, Indexed By Bots |
|
Add You - The Robots Text File Or How To Get Your Site Properly Spidered, Crawled, Indexed By Bots
Internet Marketing Expert And Blalanced Lifestyle the ending slash.You may or may not be an internet marketing expert. Perhaps you are striving to be one. No matter where you go, we live on 24 hours a day, and even if you are a super internet marketer, there's always a limit to what you can do with the time you have.There are three major categories of internet marketers. The first, is the super-hardworking sort. You are the kind that works literally 16 hour days as if this were a job you could be fired from. Unfortunately, you overwork, and you burn out quickly. Your family complains about your lifestyle. You don't have balance.The second is the one that ding-dongs from hard work to no work. You churn out content, then you sell them, make your money, then go on vacation. You are the epitome of active income, because without activity, you have no income.The final one which is by far the smartest, is the one that builds residual income streams. By this, I mean building an income stream that works without you. If you haven't learnt how to build income streams, you probably aren't really able to build much residual ones. However, this should be your primary concern, because you will have to work hard once and then forget about it.Jay Conrad Levinson, the father of Guerilla Marketing, leads an extremely balanced life. He works hard for three days of the week and takes the rest off to enjoy what the world has for him. If you are someone who desires this kind of life, you will definitely need to stop thinking of yourself as the hard working guru. It's a matter of principle. You have more responsibility than simply just making money. You have family and health to care for.The balanced lifestyle has only two basic elements. The first is personal alignment and the second is goal awareness. Personal alignment is in doing someth C. Allow everything (blank robots.txt): User-Agent: * Note that when a "blank robots.txt" is mentioned, it is not a completely blank file, but it contains the two lines above. D. Do not allow any robot on your site: User-Agent: * Note that the single forward slash means "root", which is the main entrance to your site. E. Do not allow Google to index any of your images (Google uses Googlebot-Image for images): User-Agent: Googlebot-Image F. Do not allow Google to index some of your images: User-Agent: Googlebot-Image Note the use of multiple disallows. This is allowed, no pun intended. G. Build a doorway for Google and Lycos (the Lycos robot is called T-Rex) - do not play with this unless you are 100% sure you know what you are doing: User-Agent: T-Rex H. Allow only Googlebot.. User-Agent: Googlebot Note that the commands are sequential. The example above reads in English: Let Googlebot through, then stop everyone else. If your file gets really large, or you just feel like writing notes for yourself or for potential viewers (remember, robots.txt is a public file, anyone can see it), you can do so by preceding your comment with a # sign. Although according to the standard, you can have a comment on the same line with a command, I recommend that you start every command and every comment on a new line, this way, robots will never be confused by a potential formatting glitch. Examples: This is correct, as per the standard, but not recommended (a newer robot or a badly written one might read the following as "disallow the # We... Directory", not complying to the "disallow all" command): User-Agent: * Disallow: / # We decided to stop all robots but we were very silly in typing a long comment which got truncated and made the robots.txt unusable The way I recommend that you format this is: # We decided to stop all robots and we made sure Although theoretically, each robot should comply to Best Marketing Strategy of the Year So you heard about someone stressing the importance of the robots.txt file, or noticed in your website's logs that the robots.txt file is causing an error, or somehow it is on the very top of the top visited pages, or, you read some article about the death of the robots.txt file and about how you should not bother with it ever again. Or maybe you never heard of the robots.txt file but are intrigued by all that talk about spiders, robots and crawlers. In this article, I will hopefully make some sense out of all of the above.Recently the marketing strategy of a publishing house in India is very successful. Even they have celebrated their success last month. Definitely this is the master pieces of marketing strategies. This strategy has given full emphasis to the psychology of Indians. This publishing house is none other than Reader’s Digest.Reader digest is already a successful name in India. Therefore it is easy to accept any offer from it. In the end of last year, people received mails or letters from Reader digest, telling them that they are chosen lucky winner for entering Reader digest Sweepstakes draw. In addition they are also entitled to 40% discount in subscription of Reader Digest and one to three free gifts depending upon draw.If reader digest is saying this. Therefore it looked genuine offer. More over 40% discount on reader digest is good offer. This entry allows a person three benefits first entry into a sweepstakes draw, second discounted reader digest and third is one to three free gifts. This mail and letter further offer if you subscribe now, then you will definitely get three gifts.This strategy helped the Reader digest to increase its base in India comprehensively. Success rate of this market strategy is very high. Strategy does not stop here. When one opts for reader digest subscription, he was sent an entry form for draw, to be sent back to Reader digest. Here again start more offers. If one chooses to buy from the set of Reader digest books, then he can increase his sweepstakes entry.If the person does not have money to buy these expensive books, the he can do this in part payments. In this way, these books come in the paying capacity of many. Things does not stop here, you got one more offer with book to enhance you chance for draw. This circle goes on continuously. Reade There are many folks out there who vehemently insist on the uselessness of the robots.txt file, proclaiming it obsolete, a thing of the past, plain dead. I disagree. The robots.txt file is probably not in the top ten methods to promote your get-rich-fast affiliate website in 24 hours or less, but still plays a major role in the long run. First of all, the robots.txt file is still a very important factor in promoting and maintaining a site, and I will show you why. Second, the robots.txt file is one of the simple means by which you can protect your privacy and/or intellectual property. I will show you how. Let's try to figure out some of the lingo. What is this robots.txt file? The robots.txt file is just a very plain text file (or an ASCII file, as some like to say), with a very simple set of instructions that we give to a web robot, so the robot knows which pages we need scanned (or crawled, or spidered, or indexed - all terms refer to the same thing in this context) and which pages we would like to keep out of search engines. What is a www robot? A robot is a computer program that automatically reads web pages and goes through every link that it finds. The purpose of robots is to gather information. Some of the most famous robots mentioned in this article work for the search engines, indexing all the information available on the web. The first robot was developed by MIT and launched in 1993. It was named the World Wide Web Wander and its initial purpose was of a purely scientific nature, its mission was to measure the growth of the web. The index generated from the experiment's results proved to be an awesome tool and effectively became the first search engine. Most of the stuff we consider today to be indispensable online tools was born as a side effect of some scientific experiment. What is a search engine? Generically, a search engine is a program that searches through a database. In the popular sense, as referred to the web, a search engine is considered to be a system that has a user search form, which can search through a repository of web pages gathered by a robot. What are spiders and crawlers? Spiders and crawlers are robots, only the names sound cooler in the press and within metro-geek circles. What are the most popular robots? Is there a list? Some of the most well known robots are Google's Googlebot, MSN's MSNBot, Ask Jeeves's Teoma, Yahoo!'s Slurp (funny). One of the most popular places to search for active robot info is the list maintained at http://www.robots.org. Why do I need this robots.txt file anyway? A great reason to use a robots.txt file is actually the fact that many search engines, including Google, post suggestions for the public to make use of this tool. Why is it such a big deal that Google teaches people about the robots.txt? Well, because nowadays, search engines are not a playground for scientists and geeks anymore, but large corporate enterprises. Google is one of the most secretive search engines out there. Very little is known to the public about how it operates, how it indexes, how it searches, how it creates its rankings, etc. In fact, if you do a careful search in specialized forums, or wherever else these issues are discussed, nobody really agrees on whether Google puts more emphasis on this or that element to create its rankings. And when people don't agree on things as precise as a ranking algorithm, it means two things: that Google constantly changes its methods, and that it does not make it very clear or very public. There's only one thing that I believe to be crystal clear. If they recommend that you use a robots.txt ("Make use of the robots.txt file on your web server" - Google Technical Guidelines), then do it. It might not help your ranking, but it will definitely not hurt you. There are other reasons to use the robots.txt file. If you use your error logs to tweak and keep your site free of errors, you will notice that most errors refer to someone or something not finding the robots.txt file. All you have to do is create a basic blank page (use Notepad in Windows, or the most simple text editor in Linux or on a Mac), name it robots.txt and upload it to the root of your server (that's where your home page is). On a different note, nowadays, all search engines look for the robots.txt file as soon as their robots arrive on your site. There are unconfirmed rumors that some robots might even 'get annoyed' and leave, if they don't find it. Not sure how true that is, but hey, why not be on the safe side? Again, even if you don't intend to block anything or just don't want to bother with this stuff at all, having a blank robots.txt is still a good idea, as it can actually act as an invitation into your site. Don't I want my site indexed? Why stop robots? Some robots are well designed, professionally operated, cause no harm and provide valuable service to mankind (don't we all like to "google"). Some robots are written by amateurs (remember, a robot is just a program). Poorly written robots can cause network overload, security problems, etc. The bottom line here is that robots are devised and operated by humans and are prone to the human error factor. Consequently, robots are not inherently bad, nor inherently brilliant, and need careful attention. This is another case where the robots.txt file comes in handy - robot control. Now, I'm sure your main goal in life, as a webmaster or site owner is to get on the first page of Google. Then, why in the world would you want to block robots? Here are some scenarios: 1. Unfinished site You are still building your site, or portions of it, and don't want unfinished pages to appear in search engines. It is said that some search engines even penalize sites with pages that have been "under construction" for a long time. 2. Security Always block your cgi-bin directory from robots. In most cases, cgi-bin contains applications, configuration files for those application (that might actually have sensitive information), etc. Even if you don't currently use any CGI scripts or programs, block it anyway, better safe than sorry. 3. Privacy You might have some directories on your website where you keep stuff that you don't want the entire Galaxy to see, such as pictures of a friend who forgot to put clothes on, etc. 4. Doorway pages Besides illicit attempts to increase rankings by blasting doorways all over the internet, doorway pages actually do have a very morally sound usage. They are similar pages, but each one is optimized for a specific search engine. In this case, you must make sure that individual robots do not have access to all of them. This is extremely important, in order to avoid being penalized for spamming a search engine with a series of extremely similar pages. 5. Bad bot, bad bot, what’cha gonna do... You might want to exclude robots whose known purpose is to collect email addresses, or other robots whose activity does not agree with your beliefs on the world. 6. Your site gets overwhelmed In rare situations, a robot goes through your site too fast, eating your bandwidth or slowing down your server. This is called "rapid-fire" and you'll notice it if you are reading your access log file. A medium performance server should not slow down. You may however have problems if you have a low performance site, such as one running of your personal PC or Mac, if you run poor server software, or if you have heavy scripts or huge documents. Is these cases, you'll see dropped connections, heavy slowdowns, in extremes, even a complete system crash. If this ever happens to you, read your logs, try to get the robot's IP or name, read the list of active robots and try to identify and block it. What's in a robots.txt file anyway? There are only two lines for each entry in a robots.txt file, the User-Agent, which has the name of the robot you want to give orders or the '*' wildcard symbol meaning 'all', and the Disallow line, which tells a robot all the places it should not touch. The two line entry can be repeated for every file or directory you don't want indexed, or for each robot you want to exclude. If you leave the Disallow line empty, this means you are not disallowing anything, in other words, you are allowing the particular robot to index your entire site. Some examples and a few scenarios should make it clear: A. Exclude a file from Google's main robot (Googlebot): User-Agent: Googlebot B. Exclude a section of the site from all robots: User-Agent: * Note that the directory is enclosed between two forward slashes. Although you are probably used to see URLs, links and folder references that do not end with a slash, note that a web server always needs a slash at the end. Even when you see links on websites that do not end with a slash, when that link is clicked, the web server has to do and extra step before serving the page, which is adding the slash through what we call a redirect. Always use the ending slash. C. Allow everything (blank robots.txt): User-Agent: * Note that when a "blank robots.txt" is mentioned, it is not a completely blank file, but it contains the two lines above. D. Do not allow any robot on your site: User-Agent: * Note that the single forward slash means "root", which is the main entrance to your site. E. Do not allow Google to index any of your images (Google uses Googlebot-Image for images): User-Agent: Googlebot-Image F. Do not allow Google to index some of your images: User-Agent: Googlebot-Image Note the use of multiple disallows. This is allowed, no pun intended. G. Build a doorway for Google and Lycos (the Lycos robot is called T-Rex) - do not play with this unless you are 100% sure you know what you are doing: User-Agent: T-Rex H. Allow only Googlebot.. User-Agent: Googlebot Note that the commands are sequential. The example above reads in English: Let Googlebot through, then stop everyone else. If your file gets really large, or you just feel like writing notes for yourself or for potential viewers (remember, robots.txt is a public file, anyone can see it), you can do so by preceding your comment with a # sign. Although according to the standard, you can have a comment on the same line with a command, I recommend that you start every command and every comment on a new line, this way, robots will never be confused by a potential formatting glitch. Examples: This is correct, as per the standard, but not recommended (a newer robot or a badly written one might read the following as "disallow the # We... Directory", not complying to the "disallow all" command): User-Agent: * Disallow: / # We decided to stop all robots but we were very silly in typing a long comment which got truncated and made the robots.txt unusable The way I recommend that you format this is: # We decided to stop all robots and we made sure Although theoretically, each robot should comply to t The Opportunity Cost Of Doing Business search engine is a program that searches through a database. In the popular sense, as referred to the web, a search engine is considered to be a system that has a user search form, which can search through a repository of web pages gathered by a robot.Opportunity may be seen as the existence of a situation whereby it presents itself to an individual or group of individuals to profit in someway by pursuing it in a certain manner? The results may yield a favourable outcome for the pursuer(s) but the reason only a small amount of opportunity is exploited to present its rewards is that along with most opportunity comes an element of risk.It is almost possible to show this graphically illustrating the greater the potential, the greater the risk, e.g. A small business owner contemplates opening a bistro bar in a shopping centre with a potential yield of $100 000 per annum, risking $250 000 whereas Donald Trump may risk $100m on a new golf course development, with a potential yield of $10m per annumHence, in the world of business, the universal search is for opportunity! There is plenty out there so what stops us from taking it? As has been well documented throughout the years, people have vastly different attitudes to risk, thus by definition, vastly different attitudes towards opportunity. As a result many of these great opportunities are passed by in favour of security whereby the individual(s) dismiss the opportunity and move on? What governs this thought process?We constantly are told through various mediums not to miss out on this opportunity - by simple reasoning and the law of numbers - a great ally if one wishes to use it correctly - EVERY OPPORTUNITY CANNOT BE FLAWED! Therefore by failing to at least examine each opportunity is a gross mis-service to our experience in this life.What many of us fail to realise is that we are economists constantly involved in decisions and choice under the constraint of limited resources - great to have the opportunity to examine an opportunity but held back by a famous, or infamous, e What are spiders and crawlers? Spiders and crawlers are robots, only the names sound cooler in the press and within metro-geek circles. What are the most popular robots? Is there a list? Some of the most well known robots are Google's Googlebot, MSN's MSNBot, Ask Jeeves's Teoma, Yahoo!'s Slurp (funny). One of the most popular places to search for active robot info is the list maintained at http://www.robots.org. Why do I need this robots.txt file anyway? A great reason to use a robots.txt file is actually the fact that many search engines, including Google, post suggestions for the public to make use of this tool. Why is it such a big deal that Google teaches people about the robots.txt? Well, because nowadays, search engines are not a playground for scientists and geeks anymore, but large corporate enterprises. Google is one of the most secretive search engines out there. Very little is known to the public about how it operates, how it indexes, how it searches, how it creates its rankings, etc. In fact, if you do a careful search in specialized forums, or wherever else these issues are discussed, nobody really agrees on whether Google puts more emphasis on this or that element to create its rankings. And when people don't agree on things as precise as a ranking algorithm, it means two things: that Google constantly changes its methods, and that it does not make it very clear or very public. There's only one thing that I believe to be crystal clear. If they recommend that you use a robots.txt ("Make use of the robots.txt file on your web server" - Google Technical Guidelines), then do it. It might not help your ranking, but it will definitely not hurt you. There are other reasons to use the robots.txt file. If you use your error logs to tweak and keep your site free of errors, you will notice that most errors refer to someone or something not finding the robots.txt file. All you have to do is create a basic blank page (use Notepad in Windows, or the most simple text editor in Linux or on a Mac), name it robots.txt and upload it to the root of your server (that's where your home page is). On a different note, nowadays, all search engines look for the robots.txt file as soon as their robots arrive on your site. There are unconfirmed rumors that some robots might even 'get annoyed' and leave, if they don't find it. Not sure how true that is, but hey, why not be on the safe side? Again, even if you don't intend to block anything or just don't want to bother with this stuff at all, having a blank robots.txt is still a good idea, as it can actually act as an invitation into your site. Don't I want my site indexed? Why stop robots? Some robots are well designed, professionally operated, cause no harm and provide valuable service to mankind (don't we all like to "google"). Some robots are written by amateurs (remember, a robot is just a program). Poorly written robots can cause network overload, security problems, etc. The bottom line here is that robots are devised and operated by humans and are prone to the human error factor. Consequently, robots are not inherently bad, nor inherently brilliant, and need careful attention. This is another case where the robots.txt file comes in handy - robot control. Now, I'm sure your main goal in life, as a webmaster or site owner is to get on the first page of Google. Then, why in the world would you want to block robots? Here are some scenarios: 1. Unfinished site You are still building your site, or portions of it, and don't want unfinished pages to appear in search engines. It is said that some search engines even penalize sites with pages that have been "under construction" for a long time. 2. Security Always block your cgi-bin directory from robots. In most cases, cgi-bin contains applications, configuration files for those application (that might actually have sensitive information), etc. Even if you don't currently use any CGI scripts or programs, block it anyway, better safe than sorry. 3. Privacy You might have some directories on your website where you keep stuff that you don't want the entire Galaxy to see, such as pictures of a friend who forgot to put clothes on, etc. 4. Doorway pages Besides illicit attempts to increase rankings by blasting doorways all over the internet, doorway pages actually do have a very morally sound usage. They are similar pages, but each one is optimized for a specific search engine. In this case, you must make sure that individual robots do not have access to all of them. This is extremely important, in order to avoid being penalized for spamming a search engine with a series of extremely similar pages. 5. Bad bot, bad bot, what’cha gonna do... You might want to exclude robots whose known purpose is to collect email addresses, or other robots whose activity does not agree with your beliefs on the world. 6. Your site gets overwhelmed In rare situations, a robot goes through your site too fast, eating your bandwidth or slowing down your server. This is called "rapid-fire" and you'll notice it if you are reading your access log file. A medium performance server should not slow down. You may however have problems if you have a low performance site, such as one running of your personal PC or Mac, if you run poor server software, or if you have heavy scripts or huge documents. Is these cases, you'll see dropped connections, heavy slowdowns, in extremes, even a complete system crash. If this ever happens to you, read your logs, try to get the robot's IP or name, read the list of active robots and try to identify and block it. What's in a robots.txt file anyway? There are only two lines for each entry in a robots.txt file, the User-Agent, which has the name of the robot you want to give orders or the '*' wildcard symbol meaning 'all', and the Disallow line, which tells a robot all the places it should not touch. The two line entry can be repeated for every file or directory you don't want indexed, or for each robot you want to exclude. If you leave the Disallow line empty, this means you are not disallowing anything, in other words, you are allowing the particular robot to index your entire site. Some examples and a few scenarios should make it clear: A. Exclude a file from Google's main robot (Googlebot): User-Agent: Googlebot B. Exclude a section of the site from all robots: User-Agent: * Note that the directory is enclosed between two forward slashes. Although you are probably used to see URLs, links and folder references that do not end with a slash, note that a web server always needs a slash at the end. Even when you see links on websites that do not end with a slash, when that link is clicked, the web server has to do and extra step before serving the page, which is adding the slash through what we call a redirect. Always use the ending slash. C. Allow everything (blank robots.txt): User-Agent: * Note that when a "blank robots.txt" is mentioned, it is not a completely blank file, but it contains the two lines above. D. Do not allow any robot on your site: User-Agent: * Note that the single forward slash means "root", which is the main entrance to your site. E. Do not allow Google to index any of your images (Google uses Googlebot-Image for images): User-Agent: Googlebot-Image F. Do not allow Google to index some of your images: User-Agent: Googlebot-Image Note the use of multiple disallows. This is allowed, no pun intended. G. Build a doorway for Google and Lycos (the Lycos robot is called T-Rex) - do not play with this unless you are 100% sure you know what you are doing: User-Agent: T-Rex H. Allow only Googlebot.. User-Agent: Googlebot Note that the commands are sequential. The example above reads in English: Let Googlebot through, then stop everyone else. If your file gets really large, or you just feel like writing notes for yourself or for potential viewers (remember, robots.txt is a public file, anyone can see it), you can do so by preceding your comment with a # sign. Although according to the standard, you can have a comment on the same line with a command, I recommend that you start every command and every comment on a new line, this way, robots will never be confused by a potential formatting glitch. Examples: This is correct, as per the standard, but not recommended (a newer robot or a badly written one might read the following as "disallow the # We... Directory", not complying to the "disallow all" command): User-Agent: * Disallow: / # We decided to stop all robots but we were very silly in typing a long comment which got truncated and made the robots.txt unusable The way I recommend that you format this is: # We decided to stop all robots and we made sure Although theoretically, each robot should comply to Why Do Treasure Hunts Make Such Popular Corporate Events? owadays, all search engines look for the robots.txt file as soon as their robots arrive on your site. There are unconfirmed rumors that some robots might even 'get annoyed' and leave, if they don't find it. Not sure how true that is, but hey, why not be on the safe side?Treasure hunts are one of the most popular corporate events. They take place all over the UK every single day; in towns and in the country, on foot, on bicycles and in cars.One of the reasons why treasure hunts are so popular, aside from the fact that they are great fun, is that they have something to offer everyone. Whether your preference is for action or cerebral challenge, for example puzzle solving, a well written treasure hunt will have something to appeal to most participants.Themes integrate very well into the treasure hunt format and one of the most popular examples of this is the spy themed treasure hunt. This event may be based on James Bond or spying generally with the writing following the theme and challenges suited to it, including dead letter drops and following agents placed by the organiser.More or less any theme can be used, including nationalities. Treasure hunts can also be organised abroad and they make an excellent base for sight seeing as clues take teams to famous locations and interesting places.The treasure hunt event can be merged with other event formats to increase the experience and level of challenge. One idea which works particularly well is to combine a treasure hunt with a murder mystery, which will see teams dashing around collecting clues in an attempt to solve a fiendish crime.Treasure hunts on mountain bikes in the country are very enjoyable and the New Forest in Hampshire is an alluring location, combining wonderful scenery with local pubs where events can finish with a good meal and the opportunity for vital social team bonding.One of the most popular locations for treasure hunts in the UK's fabulous capital, London. This city combines history and tradition with fascinating attractions and a reliable transport system so Again, even if you don't intend to block anything or just don't want to bother with this stuff at all, having a blank robots.txt is still a good idea, as it can actually act as an invitation into your site. Don't I want my site indexed? Why stop robots? Some robots are well designed, professionally operated, cause no harm and provide valuable service to mankind (don't we all like to "google"). Some robots are written by amateurs (remember, a robot is just a program). Poorly written robots can cause network overload, security problems, etc. The bottom line here is that robots are devised and operated by humans and are prone to the human error factor. Consequently, robots are not inherently bad, nor inherently brilliant, and need careful attention. This is another case where the robots.txt file comes in handy - robot control. Now, I'm sure your main goal in life, as a webmaster or site owner is to get on the first page of Google. Then, why in the world would you want to block robots? Here are some scenarios: 1. Unfinished site You are still building your site, or portions of it, and don't want unfinished pages to appear in search engines. It is said that some search engines even penalize sites with pages that have been "under construction" for a long time. 2. Security Always block your cgi-bin directory from robots. In most cases, cgi-bin contains applications, configuration files for those application (that might actually have sensitive information), etc. Even if you don't currently use any CGI scripts or programs, block it anyway, better safe than sorry. 3. Privacy You might have some directories on your website where you keep stuff that you don't want the entire Galaxy to see, such as pictures of a friend who forgot to put clothes on, etc. 4. Doorway pages Besides illicit attempts to increase rankings by blasting doorways all over the internet, doorway pages actually do have a very morally sound usage. They are similar pages, but each one is optimized for a specific search engine. In this case, you must make sure that individual robots do not have access to all of them. This is extremely important, in order to avoid being penalized for spamming a search engine with a series of extremely similar pages. 5. Bad bot, bad bot, what’cha gonna do... You might want to exclude robots whose known purpose is to collect email addresses, or other robots whose activity does not agree with your beliefs on the world. 6. Your site gets overwhelmed In rare situations, a robot goes through your site too fast, eating your bandwidth or slowing down your server. This is called "rapid-fire" and you'll notice it if you are reading your access log file. A medium performance server should not slow down. You may however have problems if you have a low performance site, such as one running of your personal PC or Mac, if you run poor server software, or if you have heavy scripts or huge documents. Is these cases, you'll see dropped connections, heavy slowdowns, in extremes, even a complete system crash. If this ever happens to you, read your logs, try to get the robot's IP or name, read the list of active robots and try to identify and block it. What's in a robots.txt file anyway? There are only two lines for each entry in a robots.txt file, the User-Agent, which has the name of the robot you want to give orders or the '*' wildcard symbol meaning 'all', and the Disallow line, which tells a robot all the places it should not touch. The two line entry can be repeated for every file or directory you don't want indexed, or for each robot you want to exclude. If you leave the Disallow line empty, this means you are not disallowing anything, in other words, you are allowing the particular robot to index your entire site. Some examples and a few scenarios should make it clear: A. Exclude a file from Google's main robot (Googlebot): User-Agent: Googlebot B. Exclude a section of the site from all robots: User-Agent: * Note that the directory is enclosed between two forward slashes. Although you are probably used to see URLs, links and folder references that do not end with a slash, note that a web server always needs a slash at the end. Even when you see links on websites that do not end with a slash, when that link is clicked, the web server has to do and extra step before serving the page, which is adding the slash through what we call a redirect. Always use the ending slash. C. Allow everything (blank robots.txt): User-Agent: * Note that when a "blank robots.txt" is mentioned, it is not a completely blank file, but it contains the two lines above. D. Do not allow any robot on your site: User-Agent: * Note that the single forward slash means "root", which is the main entrance to your site. E. Do not allow Google to index any of your images (Google uses Googlebot-Image for images): User-Agent: Googlebot-Image F. Do not allow Google to index some of your images: User-Agent: Googlebot-Image Note the use of multiple disallows. This is allowed, no pun intended. G. Build a doorway for Google and Lycos (the Lycos robot is called T-Rex) - do not play with this unless you are 100% sure you know what you are doing: User-Agent: T-Rex H. Allow only Googlebot.. User-Agent: Googlebot Note that the commands are sequential. The example above reads in English: Let Googlebot through, then stop everyone else. If your file gets really large, or you just feel like writing notes for yourself or for potential viewers (remember, robots.txt is a public file, anyone can see it), you can do so by preceding your comment with a # sign. Although according to the standard, you can have a comment on the same line with a command, I recommend that you start every command and every comment on a new line, this way, robots will never be confused by a potential formatting glitch. Examples: This is correct, as per the standard, but not recommended (a newer robot or a badly written one might read the following as "disallow the # We... Directory", not complying to the "disallow all" command): User-Agent: * Disallow: / # We decided to stop all robots but we were very silly in typing a long comment which got truncated and made the robots.txt unusable The way I recommend that you format this is: # We decided to stop all robots and we made sure Although theoretically, each robot should comply to Are You Prepared for a Disaster? of them. This is extremely important, in order to avoid being penalized for spamming a search engine with a series of extremely similar pages.Yesterday I look at my calendar and saw that my newsletter was on my calendar for today. I wondered what I would write about. By the end of the day, I had my topic. Back-up and Recovery System.Are you prepared for a disaster? Around 2pm yesterday I looked up from my computer and saw a reflection of smoke. I knew right away someone’s home was on fire. Thinking it was one of the homes behind me I stepped out my patio door. It was the house right beside mine that had caught fire. I immediately went out front to find two members of the family in shock. I found out 911 had been called and no one else was at home. The fire department was there within moments and started fighting the fire. I, like a lot of my neighbors, stood outside and watched with sympathy as this single mother with five children lost everything they owned.It always happens to somebody else, right? Not all the time. One minute my neighbor’s house was on fire, the next, a whole opened up in their attic and the most amazing ball of flames shot directly towards my house. It was at that time I realized that I could also lose my home. I went in, grabbed my purse and my dog and was back out in the driveway. As I stood there I realized I didn’t follow my plan.You see, since I started my business I’ve had what I considered a good plan incase of fire, I’ve even written up a disaster recovery plan. My business and its equipment are insured separate from my homeowners. It will also cover any lost income I may incur. I’ve burned copies of all my software and keep it off site along with copies of important business documents. I back up my entire hard drive daily incase my PC crashes. I keep my calendar on my PC and sync it with my palm; I also keep a paper calendar incase of power outage. Now, the plan was tha 5. Bad bot, bad bot, what’cha gonna do... You might want to exclude robots whose known purpose is to collect email addresses, or other robots whose activity does not agree with your beliefs on the world. 6. Your site gets overwhelmed In rare situations, a robot goes through your site too fast, eating your bandwidth or slowing down your server. This is called "rapid-fire" and you'll notice it if you are reading your access log file. A medium performance server should not slow down. You may however have problems if you have a low performance site, such as one running of your personal PC or Mac, if you run poor server software, or if you have heavy scripts or huge documents. Is these cases, you'll see dropped connections, heavy slowdowns, in extremes, even a complete system crash. If this ever happens to you, read your logs, try to get the robot's IP or name, read the list of active robots and try to identify and block it. What's in a robots.txt file anyway? There are only two lines for each entry in a robots.txt file, the User-Agent, which has the name of the robot you want to give orders or the '*' wildcard symbol meaning 'all', and the Disallow line, which tells a robot all the places it should not touch. The two line entry can be repeated for every file or directory you don't want indexed, or for each robot you want to exclude. If you leave the Disallow line empty, this means you are not disallowing anything, in other words, you are allowing the particular robot to index your entire site. Some examples and a few scenarios should make it clear: A. Exclude a file from Google's main robot (Googlebot): User-Agent: Googlebot B. Exclude a section of the site from all robots: User-Agent: * Note that the directory is enclosed between two forward slashes. Although you are probably used to see URLs, links and folder references that do not end with a slash, note that a web server always needs a slash at the end. Even when you see links on websites that do not end with a slash, when that link is clicked, the web server has to do and extra step before serving the page, which is adding the slash through what we call a redirect. Always use the ending slash. C. Allow everything (blank robots.txt): User-Agent: * Note that when a "blank robots.txt" is mentioned, it is not a completely blank file, but it contains the two lines above. D. Do not allow any robot on your site: User-Agent: * Note that the single forward slash means "root", which is the main entrance to your site. E. Do not allow Google to index any of your images (Google uses Googlebot-Image for images): User-Agent: Googlebot-Image F. Do not allow Google to index some of your images: User-Agent: Googlebot-Image Note the use of multiple disallows. This is allowed, no pun intended. G. Build a doorway for Google and Lycos (the Lycos robot is called T-Rex) - do not play with this unless you are 100% sure you know what you are doing: User-Agent: T-Rex H. Allow only Googlebot.. User-Agent: Googlebot Note that the commands are sequential. The example above reads in English: Let Googlebot through, then stop everyone else. If your file gets really large, or you just feel like writing notes for yourself or for potential viewers (remember, robots.txt is a public file, anyone can see it), you can do so by preceding your comment with a # sign. Although according to the standard, you can have a comment on the same line with a command, I recommend that you start every command and every comment on a new line, this way, robots will never be confused by a potential formatting glitch. Examples: This is correct, as per the standard, but not recommended (a newer robot or a badly written one might read the following as "disallow the # We... Directory", not complying to the "disallow all" command): User-Agent: * Disallow: / # We decided to stop all robots but we were very silly in typing a long comment which got truncated and made the robots.txt unusable The way I recommend that you format this is: # We decided to stop all robots and we made sure Although theoretically, each robot should comply to Death of an Automobile Dealership the ending slash.Closing a store requires considerable effort and attention and the items listed below, in no particular order, are minimal considerations when terminating a franchise and closing a dealership operation.THIS CHECKLIST IS NOT "ALL INCLUSIVE". YOU SHOULD CONSULT WITH YOUR ATTORNEY AND ACCOUNTANT AND THIS LIST SHOULD BE CONSIDERED AS AN ADDITIONAL AID FOR YOU TO USE TO BUILD UPON WHEN YOU CONFER WITH THEM.Basic Preparation1. Officers, Directors and ShareholdersBe certain to hold both directors and shareholders meetings and to obtain resolutions from each entity, authorizing the dealer to liquidate the dealership, or a substantial portion of the dealership's assets.Determine whether or not the board and shareholders may authorize you a termination bonus and prepay your for your services in "winding down the business". Consult with your accountant and attorney to determine what would be a reasonable amount of compensation in the event a company creditor challenges the transaction.Determine if it is reasonable for officers to buy themselves and their spouse vehicles. Pay "Net" "Net", as that would be the sales price if the vehicle were returned to the factory or sold to a purchaser of the business.The officers should open a new bank account, at a different bank, and: (a) use a PO Box, or Private Mail Service as a mailing address; and (b) use a different check color in order to easily determine pre and post closing checks written.Authorize payment to and pre-pay the company's attorney and accountant with a retainer. Their services will be needed to properly close the business and the company might not be able to pay them later.Authorize pre-payment of whatever services or supplies the company will need to be serviced during the C. Allow everything (blank robots.txt): User-Agent: * Note that when a "blank robots.txt" is mentioned, it is not a completely blank file, but it contains the two lines above. D. Do not allow any robot on your site: User-Agent: * Note that the single forward slash means "root", which is the main entrance to your site. E. Do not allow Google to index any of your images (Google uses Googlebot-Image for images): User-Agent: Googlebot-Image F. Do not allow Google to index some of your images: User-Agent: Googlebot-Image Note the use of multiple disallows. This is allowed, no pun intended. G. Build a doorway for Google and Lycos (the Lycos robot is called T-Rex) - do not play with this unless you are 100% sure you know what you are doing: User-Agent: T-Rex H. Allow only Googlebot.. User-Agent: Googlebot Note that the commands are sequential. The example above reads in English: Let Googlebot through, then stop everyone else. If your file gets really large, or you just feel like writing notes for yourself or for potential viewers (remember, robots.txt is a public file, anyone can see it), you can do so by preceding your comment with a # sign. Although according to the standard, you can have a comment on the same line with a command, I recommend that you start every command and every comment on a new line, this way, robots will never be confused by a potential formatting glitch. Examples: This is correct, as per the standard, but not recommended (a newer robot or a badly written one might read the following as "disallow the # We... Directory", not complying to the "disallow all" command): User-Agent: * Disallow: / # We decided to stop all robots but we were very silly in typing a long comment which got truncated and made the robots.txt unusable The way I recommend that you format this is: # We decided to stop all robots and we made sure Although theoretically, each robot should comply to the standards introduced around 1994 and enhanced in 1996, each robot acts a little differently. You are advised to check the documentation provided by the owners of those robots, you'll be surprised to discover a world of useful facts and techniques. For instance, from Google's site we learn that Googlebot completely disregards any URL that contains "&id=". Here are some sites to check: Google: http://www.google.com/bot.html Yahoo: http://help.yahoo.com/help/us/ysearch/slurp/ MSN: http://search.msn.com/docs/siteowner.aspx A database of robots is maintained at http://www.robotstxt.org/wc/active/html/contact.html A robots.txt validation tool - invaluable in finding potential typos that can completely change the way search engines see your site, can be found at: http://searchengineworld.com/cgi-bin/robotcheck.cgi There are also some extensions to the standard. For example, some robots allow wildcards in the Disallow line, some even allow different commands. My advice is: don't bother with anything outside the standard and you will not be unpleasantly surprised. A final word of caution: In this article I showed you how things should work in a perfect world. Somewhere along this article I mentioned that there are good bots and bad bots. Let's stop for a moment and think from a deranged person's perspective. Is there anything to prevent one from writing a robot program that reads a robots.txt file and specifically look at pages that you marked as "disallowed"? The answer is absolutely not, this entire standard is based on the honor system and is based on the concept that everyone should work hard to make the internet a better place. Basically, do not rely on this for real security or privacy. Use passwords when necessary. In conclusion, do not forget that indexing robots are your best friends. While you shouldn't build your site for robots, but for your human visitors, do not underestimate the power of those mindless crawlers - make sure the pages you want to be indexed are clearly seen by robots, make sure you have regular hyperlinks that robots can follow without roadblocks (robots can't follow Flash based navigation systems, for instance). To keep your site at tip top performance, to keep your logs clean, your applications, scripts and private data safe, always use a robots.txt file and make sure you read your logs to monitor all robotic activity.
HTTP = HTML link (for blogs, profiles,phorums):
Related Articles:What Drives a Successful Salesperson Are You Asking the Wrong Question? 5 Good Reasons For Using Yahoogroups To Start Your Own Ezine
|