Local Businesses You Can Trust
This special file is incredibly useful at keeping redundant or unsavory parts of your website from popping up on search engines. Before we can go into the “why”, we first must understand what robots.txt is and how it works.
Robots.txt is a file that gives instructions to any crawling web spiders about which pages to include or not include in search engine results. A page that is blocked by this file won’t show up on websites such as Google or Bing.
While the extension itself is fairly simple to add, it can be modified in a bunch of different ways. The short answer is that robots.txt is merely added to the top level directory on your hosting server to automatically block any pages or files you don’t want a search engine robot to look at.
When a web engine robot goes through a website, it will have to first go through the robots.txt file. This file works as a suggestion as to what the robot can or can not look at — if one of these robots can’t see it, they won’t add it into the search results.
Gorilla Marketing, an SEO agency in Manchester, notes that because the protocol is completely advisory, it relies on compliance from the bot in question. More malicious robots, also called malware, will ignore this, as it benefits them to skirt past any security protocols to harvest the information they need.
Malware is merely an umbrella terms that can refer to anything intrusive such as viruses, worms, spyware, and the like. Malware is specifically used to bypass limited security protocols such as robots.txt and do harm to the web user in the form of stealing information.
Robots.txt is perfect for ignoring different parts of your website that you don’t want searched. For instance, it might be useful to block duplicate pages and internal search results, along with certain files on a website such as images, downloads, or adult chats.
Depending on what you want it to do, you can have it target just about whatever you want on your website. You can block irrelevant parts of forums and chats, or focus on specific images. You can even focus on certain URLs or pages.
First, open a simple word processing program, Notepad will do. After you put in the correct format, you can then merely add it to your website’s directory. From there, it can be modified in any way depending on what you want it to block.
The format is actually quite simple. The first line is “User-agent:”, followed by the bot in question, with universal rules using the user agent *. The next lines all start with “Disallow:”, followed by the parts of your website you don’t want to bot to look at.
Whatever you put next to “User-agent:” will specify the target, while whatever follows the “Disallow:” line will be the commands you want to execute. For instance, if you put in “User-Agent: BingBot” followed by “Disallow: /tmp/” you will stop Bing from searching those files.
While not recommended, some users might want to go completely off the grid. In that case, merely put the User-agent as * (shift + 8 on most keyboards) and put / on the disallow line. To allow everything, merely delete the / and keep the user agent as *.
While we may live in the modern world, this isn’t instantaneous. Oftentimes search engines need several days to a couple of weeks to notice what parts of a website are disallowed and need to be removed from their index.
According to robotstxt.org, there are over 300 active search engine bots out there right now. The full list which can be found here. To specify the disallowed actions of a certain webpage bot, merely put their name under the user-agent line.
Robots.txt is a flexible file that can be used to block web bots from accessing parts of your website. While it doesn’t replace good security, this file type can be used in a way to make sure that only the good parts of your website show up on search engines.Tags: How Do I Add Robots.txt to My Website, Manchester, What Does Robots.txt Do, Why Use Robots.txt