Skip to main content

Don’t waste your server resources: block unwanted bots using Nginx

· 4 min read
Customer Care Engineer

block-unwanted-bots-using-nginx

Search engine bots (crawlers) are special programs that scan websites on the Internet. Search engines need them to find, index and display pages in search results. But not all bots are useful!

Sometimes your site may be visited by unwanted bots that:

  • Collect data without your permission.
  • Consume server resources, slowing it down.
  • Are used to look for vulnerabilities.

If you want to protect your site from such bots, it’s time to configure Nginx! In this article, we’ll show you how to easily and quickly block them using a special configuration file.


Why Nginx configuration instead of robots.txt?

The robots.txt file is a tool for managing search bots’ behavior. It tells them which parts of the site should not be crawled. It’s very easy to use this file: simply create one in the site’s root directory, for example:

User-agent: BadBot  

Disallow: /  

However, there is a problem: instructions in robots.txt are a recommendation rather than an enforced rule. Conscientious bots do follow this file’s instructions, but most bots simply ignore it.

By contrast, configuring Nginx allows you to physically block access for unwanted bots, guaranteeing a 100% effective result.


How Nginx blocks unwanted bots: using response 444

Unlike robots.txt, which only provides recommendations to bots, Nginx physically blocks their access. One way to achieve this is by using a special server response with the code 444. 

In Nginx, the response code 444 is an internal method of terminating the connection with the client without sending any response. This is an efficient approach to ignore unwanted requests and minimize server load.


Setting up the blocking

Step 1: How to identify unwanted bots?

Unwanted bots can be identified by their User-Agent, which is a parameter sent by all clients when visiting your site. For example, some User-Agents might look like this:

    AhrefsBot     SemrushBot     MJ12bot

You can find suspicious User-Agent values in the Nginx access log (if your site uses PHP-FPM):

sudo grep -i bot /var/log/nginx/access.log

Or in the Apache access log (if your site uses the Apache module or FastCGI as a PHP handler):

  • For Ubuntu/Debian:
sudo grep -i bot /var/log/apache2/access.log
  • For CentOS/AlmaLinux/RockyLinux:
sudo grep -i bot /var/log/httpd/access.log

If you’re using a control panel such as FASTPANEL, each site will have its own separate log file. You can analyze them individually or all at once using a command like:

  • If your site uses the Apache module or FastCGI as the PHP handler:
sudo cat /var/www/*/data/logs/*-backend.access.log |  grep -i bot | tail -500
  • If your site uses PHP-FPM:
sudo cat /var/www/*/data/logs/*-frontend.access.log |  grep -i bot | tail -500

This command will display the last 500 requests made to all your sites where the User-Agent parameter contains the word “bot.” An example of one line (one request to your site) might look like this:

IP - [03/Nov/2022:10:25:52 +0300] "GET link HTTP/1.0" 301 811 "-" "Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; [email protected])"

or

IP - [24/Oct/2022:17:32:37 +0300] "GET link HTTP/1.0" 404 469 "-" "Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)"

The bot’s User-Agent is located between the segments “compatible;” and “/version.number“ at the end of the request line in parentheses. So in the above examples, User-agents are: BLEXBot and DotBot.

Analyze the information you gather and note the User-Agent strings of the most active bots for the next step of configuring the block. 

Step 2: Create a File to Block Bots

  1. Connect to your server via SSH.
  2. Before making changes, ensure that your current Nginx configuration has no errors:
nginx -t

If everything is fine, you’ll see:

nginx: the configuration file /etc/nginx/nginx.conf syntax is ok

nginx: configuration file /etc/nginx/nginx.conf test is successful

If there are any errors, review them and fix them in the file indicated by the error messages.

  1. Create a separate file listing the bots to block:
sudo nano /etc/nginx/conf.d/block_bots.conf

Add the following code to the file:

    map $http_user_agent $block_bot {

        default 0;

        ~*AhrefsBot 1;

        ~*SemrushBot 1;

        ~*MJ12bot 1;

    }



    server {

        if ($block_bot) {

            return 444;

        }

*the rest of your server block*

    }

Here we create a map that determines which bots should be blocked.

Following this pattern, list the User-Agent strings of the bots you want to block. You must list each bot on a new line and place a semicolon ; at the end of each line as a delimiter.

After you finish building your list, press "Ctrl + O" on your keyboard to save the file, then "Ctrl + X" to exit the nano editor.

Step 3: Apply the Changes

After making your changes, always test the Nginx configuration for correctness to ensure there are no syntax errors:

sudo nginx -t

If everything is fine, you’ll see:

nginx: the configuration file /etc/nginx/nginx.conf syntax is ok

nginx: configuration file /etc/nginx/nginx.conf test is successful

If there are errors, review the output to identify and correct them in the file specified.

Then reload the Nginx configuration to apply the changes:

sudo systemctl reload nginx

In the future, if you need to add more bots to the block_bots.conf file, you should repeat this step each time. 


Conclusion

Now you know how to easily block unwanted search bots on your server using Nginx! Keep an eye on your logs and add new lines to the block_bots.conf configuration file as needed.

Make sure you only block malicious bots so that you don't prevent useful search engines like Google or Bing from indexing your site.