NGINX Lua Scraping Protection

If you have followed the guide to configuring NGINX with Lua, you can now use BotGuard for Application's scraping protection module.

Prerequisites

Getting Started

Your account manager can provide a software package which includes the following files:

  • README.md - contains this information and a link to this documentation
  • example.block.conf.nginx - an example NGINX config file demonstrating blocking/redirecting with NGINX and Lua
  • example.scraping.conf.nginx - an example NGINX config file demonstrating scraping protection with NGINX and Lua
  • lua-plugins/injector.lua - injects a script into the <head> of an HTML document
  • lua-plugins/mitigation.lua - requests an ACTION from the mitigation API
  • lua-plugins/mitigation/ - contains the plugin modules
  • lua-plugins/scraping_check.lua - contains the code to check whether the scraping protection has passed
  • lua-plugins/scraping_guard.lua - contains the code to protect and endpoint
  • lua-plugins/xss.lua - helper to protect redirects from attacks
  • lua-plugins/tests/ - contains the unit tests
  • lua-plugins/interstitial.html - an example interstitial page

Considerations

Before integrating the scraping solution it is worth considering

  • The scraping solution uses domain specific cookies. As a result any page that is on the same domain and protected by scraping will be protected by the cookie. If the page requires to load assets from a different domain, then that domain cannot also be protected by scraping as it will require its own cookie to serve the content. If content is loaded from a seperate domain, then either an IP protected content server or a subdomain is recommended.
  • Due to the many configurations that NGINX offers, starting with the Docker Sandbox demo is recommended to configuring a setup locally that matches your production setup as closely as possible. Once you have a setup running locally, using this NGINX.conf and these docker containers in your cloud environment is recommended. Only when this is working as expected is it recommended to move to a different configuration or environment. Ask your account manager for the DEMO sandbox.
  • Providing your accont manager with an accurate production environment topology of your infrastructure as well as your application level design will help them design a solution for you. As part of this demonstration applications that closely represent the real setup will allow them to forsee any issues that might occur.

Configuration

  1. Follow the configuration setup here.

  2. Server block configuration:

    Configuration Required Type Default Example Description
    $custom_scraping_fields false string NONE '{"field":"value"}' Used to configure custom fields to control scraping.
    $session_secret false string NONE 'correcthorsebatterystaple' The key that will be used to encrypt the cookies. This will default to a random string, however if you have multiple NGINX instances you will want to set this to the same value on all instances.
    $scraping_interstitial_url true string NONE '/interstitial' The route where to redirect a protected page to. You will need to create the location block to serve your custom interstitial page.
    $session_name false string session scraping-session The session name stored in the browser for the cookie. This defaults to "session" so it is recommended to set this to something unique.
    $scraping_cookie_ttl false number 15 10 Specifies how regularly the interstitial page should be shown to a user if they have navigated away from the site. It is recommended that this is set to 10-15 seconds.
    $scraping_referer_parameter false string next referer On successfully passing the scraping checks, this is the query parameter that HUMAN will look for to find out where to redirect the user to.
    $session_cookie_lifetime false number 3600 63072000 Length of time (in seconds) that the cookie will be valid for. This should be set to a high value (e.g. 2 years).
    $scraping_protection false number nil 1 This field, when set to 1 is used to define pages that should be protected against scraping. When set to 2 this field defines the interstitial page.
    $scraping_refresher true string nil /nginx/scraping/refresh This field informs scraping protection of the URL to poll to renew the user's cookie. This is a required field.
  3. Protecting an endpoint.

    To protect an endpoint from scraping, add the following to any NGINX location block that you wish to be protected:

    access_by_lua_file /etc/lua-plugins/scraping_guard.lua;
    
  4. Set up a location endpoint that displays the interstitial page. This page will be shown while the protection is running checks. See the example interstitial.html file provided.

    • Please note that it's important to protect this endpoint from XSS attacks. BotGuard offers a protection against this that is straightforward to implement. Add the following to the location block for the interstitial page:
    access_by_lua_file /etc/lua-plugins/xss.lua;
    
  5. To prevent users from repeatedly seeing the interstitial page as they browse around the site add a session refresh endpoint that the BotGuard tag will use to check the status of the current user in the background:

    location /refresh {
        set $detection_tag_mo "2";
        content_by_lua_file /etc/lua-plugins/scraping_check.lua;
    }
    
    • Note, its important that the mo value for this endpoint is 2, so that BotGuard is regularly requested and can update the status of the user. In this example we have explicitly set it for demonstration, although this is the default.
  6. Please note, it doesn't make sense to block/redirect and endpoint and to also apply scraping protection. Please choose one or the other depending on your needs.

Examples

All examples and conf files will need the following set:

set $mitigation_api_key "API_KEY"; # the api key provided to you by your account manager
set $mitigation_api_et "12"; # the event type (scraping protection)
set $detection_tag_ci "CUSTOMER_ID"; # your customer ID
set $detection_tag_dt "DETECTION_TAG_ID"; # your tag ID
set $detection_tag_si "SITE_ID"; # a site identifier, specified by the customer

#scraping management
set $session_secret "$SCRAPING_SESSION_SECRET";
set $scraping_interstitial_url "/interstitial";
set $scraping_refresher "/refresh";
set $session_name "x-reload-session"; #some descreet name for the scraping session
set $scraping_referer_parameter "next"; #customise the parameter that will be used as the query parameter (defaults to next)
set $session_cookie_lifetime 63072000;
set $scraping_cookie_ttl 15;

Catch All

The following is the most basic example. It will send all non-GET requests to the mitigation API and inject the script tag on all responses that contain a </head> and/or <body> tag. This assumes that you have unzipped the release to /etc/lua-plugins.

worker_processes auto;
pcre_jit on;

events {
    worker_connections 1024;
}

http {

    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 65;
    types_hash_max_size 2048;
    include mime.types;
    default_type application/octet-stream;
    gzip on;

    access_log /dev/stdout;

    lua_package_path "/etc/lua-plugins/?.lua;;";
    more_clear_headers Server;
    server_tokens off;

    server {
        listen 3000;
        server_name some.example.com localhost;
        resolver 8.8.8.8;
        client_header_buffer_size 8k;
        large_client_header_buffers 8 64k;
        error_log /dev/stdout debug;

        # required variables
        set $mitigation_api_key "API_KEY";
        set $detection_tag_ci "CUSTOMER_ID";
        set $detection_tag_dt "DETECTION_TAG_ID";
        set $mitigation_api_et "12";
        set $detection_tag_si "SITE_ID";
        set $detection_tag_host "sub.example.com";
        set $detection_tag_path "/ag/CUSTOMER_ID/clear.js";
        set $detection_tag_spa "0";
        set $detection_tag_mo "2";

        #scraping management
        set $session_secret "$SCRAPING_SESSION_SECRET";
        set $scraping_interstitial_url "/interstitial";
        set $scraping_refresher "/refresh";
        set $session_name "x-reload-session"; #some descreet name for the scraping session
        set $scraping_referer_parameter "next"; #customise the parameter that will be used as the query parameter (defaults to next)
        set $session_cookie_lifetime 63072000;
        set $scraping_cookie_ttl 15;

        location ~* \.(?:ico|css|js|gif|jpe?g|png|woff2|woff|ttf)$ {
            root /usr/share/nginx/html;
            index index.html index.htm;
        }

        location ^~ /refresh {
            set $detection_tag_mo "2";
            content_by_lua_file /etc/lua-plugins/scraping_check.lua;
        }

        location ^~ /interstitial {
            default_type text/html;
            header_filter_by_lua_block {
                ngx.header.content_length = nil;
            }

            set $detection_tag_spa "1";
            set $scraping_protection 2;
            body_filter_by_lua_file /etc/lua-plugins/injector.lua;
            access_by_lua_file /etc/lua-plugins/xss.lua;
            return 200 '<html><body><h1>Please wait while we check some things....</h1></body></html>';
        }

        location ^~ / {
            default_type text/html;

            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $remote_addr;

            header_filter_by_lua_block {
                ngx.header.content_length = nil;
            }
            body_filter_by_lua_file /etc/lua-plugins/injector.lua;
            lua_need_request_body on;

            set $scraping_protection 1;
            # protect all endpoints from scraping 
            access_by_lua_file /etc/lua-plugins/scraping_guard.lua;

            proxy_pass http://localhost:$BACKEND_PORT;
        }
        error_page 500 502 503 504 /50x.html;
        location = /50x.html {
            root html;
        }
    }
}

Route Management

The following example is very similar to the one above, however it defines some major differences which are listed below and commented within the example for easier reading.

  1. Variables that are shared between endpoints are part of the server, and not location block.
  2. An NGINX location block to handle signup attempts that will be redirected to if a signup is blocked by the mitigation API.
  3. The /signup route is a vanilla HTML/CSS website and redirects a blocked user to a /catch endpoint, however it informs the client that the redirect is a 200. This code is configured to deceive the client rather than inform them.
  4. Different routes to define different configurations for /login vs /signup.
  5. The /login route is an SPA and defines a response code and body to respond with when a request is blocked.
  6. The interstitial page is served at /interstitial.
  7. All endpoints are protected by scraping.
worker_processes auto;
pcre_jit on;

events {
    worker_connections 1024;
}

http {

    sendfile on;
    tcp_nopush on;
    tcp_nodelay on;
    keepalive_timeout 65;
    types_hash_max_size 2048;
    include mime.types;
    default_type application/octet-stream;
    gzip on;

    access_log /dev/stdout;

    lua_package_path "/etc/lua-plugins/?.lua;;";
    more_clear_headers Server;
    server_tokens off;

    server {
        listen 3000;
        server_name some.example.com localhost;
        resolver 8.8.8.8;

        underscores_in_headers on; #required for signal headers
        # buffers for headers and body
        client_header_buffer_size 512k;
        large_client_header_buffers 8 512k;
        client_max_body_size 100M;
        proxy_busy_buffers_size   512k;
        proxy_buffers   4 512k;
        proxy_buffer_size   256k;

        location ~* \.(?:ico|css|js|gif|jpe?g|png|woff2|woff|ttf)$ {
            root /usr/share/nginx/html;
            index index.html index.htm;
        }

        # 1. Variables that are shared between endpoints are part of the server, and not location block
        set $mitigation_api_key "API_KEY";
        set $mitigation_api_et "12";
        set $detection_tag_ci "CUSTOMER_ID";
        set $detection_tag_dt "DETECTION_TAG_ID";
        set $detection_tag_host "sub.example.com";
        set $detection_tag_path "/ag/CUSTOMER_ID/clear.js";

        #scraping management
        set $session_secret "$SCRAPING_SESSION_SECRET";
        set $scraping_interstitial_url "/interstitial";
        set $scraping_refresher "/refresh";
        set $session_name "x-reload-session"; #some descreet name for the scraping session
        set $scraping_referer_parameter "next"; #customise the parameter that will be used as the query parameter (defaults to next)
        set $session_cookie_lifetime 63072000;
        set $scraping_cookie_ttl 15;

        location ^~ /refresh {
            set $detection_tag_mo "2";
            content_by_lua_file /etc/lua-plugins/scraping_check.lua;
        }

        location ^~ /interstitial {
            default_type text/html;
            header_filter_by_lua_block {
                ngx.header.content_length = nil;
            }

            set $detection_tag_spa "1";
            set $scraping_protection 2;
            body_filter_by_lua_file /etc/lua-plugins/injector.lua;
            access_by_lua_file /etc/lua-plugins/xss.lua;
            return 200 '<html><body><h1>Please wait while we check some things....</h1></body></html>';
        }

        # 2. An NGINX location block to handle signup attempts that will be redirected to if a signup is blocked by the mitigation API
        location ^~ /signup {
            default_type text/html;

            # required variables
            set $detection_tag_spa "0";
            set $detection_tag_mo "2";
            set $detection_tag_si "SITE_ID";

            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $remote_addr;

            header_filter_by_lua_block {
                ngx.header.content_length = nil;
            }
            body_filter_by_lua_file /etc/lua-plugins/injector.lua;
            lua_need_request_body on;
            set $scraping_protection 1;
            # protect endpoint from scraping 
            access_by_lua_file /etc/lua-plugins/scraping_guard.lua;

            proxy_pass http://localhost:$BACKEND_PORT;
        }

        # 4. Different routes to define different configuration for /login vs /signup
        location ^~ /login {
            default_type text/html;

            # required variables
            set $detection_tag_spa "1";
            set $detection_tag_mo "2";
            set $detection_tag_si "SITE_ID";
            # 5. The /login route is an SPA and defines a response code and body to respond with when a request is blocked in the case of an SPA
            set $block_spa_response_code "200";
            set $block_spa_response_body '{"success":"you are now logged in"}';

            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $remote_addr;

            header_filter_by_lua_block {
                ngx.header.content_length = nil;
            }
            body_filter_by_lua_file /etc/lua-plugins/injector.lua;
            lua_need_request_body on;
            set $scraping_protection 1;
            # protect endpoint from scraping
            access_by_lua_file /etc/lua-plugins/scraping_guard.lua;

            proxy_pass http://localhost:$BACKEND_PORT;
        }

        location ^~ / {
            header_filter_by_lua_block {
                ngx.header.content_length = nil;
            }
            body_filter_by_lua_file /etc/lua-plugins/injector.lua;
            set $scraping_protection 1;
            # protect all endpoints from scraping 
            access_by_lua_file /etc/lua-plugins/scraping_guard.lua;
            proxy_pass http://localhost:$BACKEND_PORT;
        }

        error_page 500 502 503 504 /50x.html;

        location = /50x.html {
            root html;
        }
    }
}

Final Remarks

The interstitial page can be any HTML static page, or any other web serveable content that can run JavaScript. An example is supplied as part of the Lua zip package but this could be as simple as something like the following:

<html>
<head>
    <link rel="stylesheet" href="/public/checking.css">
    <style>
        section {
            text-align: center;
            background-color: #CCBCBC;
            margin: 0 auto;
            width: 80%;
            padding: 1.5em;
        }
        #message {
            padding: 1em;
            text-align: center;
        }
        .loader {
            margin: 0 auto;
            border: 16px solid #1C1D21;
            border-radius: 50%;
            border-top: 16px solid #F1E3E4;
            width: 120px;
            height: 120px;
            -webkit-animation: spin 2s linear infinite; /* Safari */
            animation: spin 2s linear infinite;
        }

        /* Safari */
        @-webkit-keyframes spin {
            0% { -webkit-transform: rotate(0deg); }
            100% { -webkit-transform: rotate(360deg); }
        }

        @keyframes spin {
            0% { transform: rotate(0deg); }
            100% { transform: rotate(360deg); }
        }
    </style>
</head>
<body>
<section>
    <div class="loader"></div>
    <div id="message"></div>
</section>
</body>
</html>