Configuring ML Based Bot Detection policy

The AI-based machine learning bot detection model complements the existing signature and threshold based rules. It detects sophisticated bots that can sometimes go undetected. The bot detection model observes user behaviors from thirteen dimensions, for example, how many times of HTTP requests are initiated by the user, whether the request uses illegal HTTP versions, whether it fetches JSON/XML resources, etc.

Compared with the traditional mechanisms to detect bots, the bot detection model saves you the trouble to experiment on an appropriate threshold to detect abnormal user behaviors. For example, how could you know how many times of HTTP requests initiated by a user should be considered as abnormal? With the traditional mechanism, you may need to experiment on different threshold values and continuously check the attack log until no related attack logs are reported for the regular traffic.

Things are much easier if you use the bot detection model. FortiWeb uses SVM (Support Vector Machine) algorithm to build up the bot detection model that self-learns the traffic profiles of regular clients. When the traffic from a new client flows in, it is compared against that of the regular clients. If they don't match, the bot detection model classifies the new client as an anomaly. When the traffic profiles of the regular clients vary dramatically (e.g. the functions of your application have changed, so that users behave differently when they visit your application), FortiWeb automatically refreshes the bot detection model to adapt to the changes.

Moreover, test shows that the bot detection model performs much better, specially when it detects crawlers and scrapers. The traffic is comprehensively evaluated from 13 dimensions. It helps increase the detection accuracy and decrease the false positive rate.

Basic Concepts

The ML Based bot detection model has three stages: sample collecting, model building, and model running.

Sample collecting

To build up a bot detection model, the system collects samples (also called vector) of users' behaviors when they are visiting your application. Each sample records a certain user's behaviors in a certain time range.

The samples are split into two parts. Three quarters of the samples are divided into training sample set. One quarter of the samples are divided into testing sample set.

Model building

During the model building stage, the system observes the training samples to self-learn user behavior profiles and builds up mathematical models using the SVM (Support Vector Machine) algorithm. The SVM parameters are used to eliminate rogue training samples and control individual sample influence on the overall result.

Multiple models are built based on different parameter combinations in the SVM algorithm. According to the training accuracy, cross-validation value, testing accuracy, and the model type you have configured, the system narrows down the selection to one model and uses it as the bot detection model.

Model running

When the bot detection model is in running state, the system compares users' behaviors against the bot detection model. If the traffic from a certain user doesn't match the model, the system will record the traffic as an anomaly. If a certain times of anomalies are recorded for this user, the system will take actions such as sending alert emails or blocking the traffic from this user.

It's possible that sometimes the traffic is false positively detected as an anomaly. The system uses Bot Confirmation to confirm whether an anomaly is indeed a bot. If the false positive detection occurs so many times that it exceeds a certain threshold, the system considers the current bot detection model invalid, and automatically updates the model.

Bot detection policy are part of a server policy. They are created on the Policy > Sever Policy page.

To create a Bot Detection policy:

Click Policy > Server Policy.
Select an existing server policy.
Please note that the machine learning policies can't be created during the server policy creation process. You should first create a server policy, then click its Edit to create a machine learning policy.
Scroll down to the Machine Learning section at the bottom of the page, click the Bot Detection tab, then click Create. The New Machine Learning dialog opens.
Click the + (Add) sign below the IP Range field to add IP/Range, so as to limit the system to collect data only from the specified IP range. Leave this field empty to collect data from all sources.
Click OK.

After it's completed, go back to Server Policy. Select the one which contains the Bot Detection policy you just created. You will see the following buttons in the Bot Detection tab.

Button	Function
View	Click to view and edit machine learning policies and their learning results. Note: You can also access the Machine Learning page by clicking Machine Learning, and then selecting a specific policy.
Start/Stop	Click to start/stop Machine Learning for the policy.
Refresh	Click to restart machine learning. Note: This will discard all existing learning results and then relearn all data.
Discard	Click to remove all learned data from the policy.
Export	Click to export all the data generated by the machine learning policy.
Import	Click to import the machine learning data from your local directory to FortiWeb.

All bot detection policies that you have created will show up on the Bot Mitigation > ML Based Bot Detection page, where you can configure or edit them to your preference.

To configure a bot detection policy:

Click Bot Mitigation > ML Based Bot Detection.
Double-click a bot detection policy of interest (or highlight it and then click the Edit button on top of the page) to open it. The Edit bot detection page opens, which breaks down bot detection policy into several sections, each of which has various parameters you can use to configure the policy.
Follow the instructions in the following subsections to configure a bot detection policy.
Click OK when done.

The Advanced settings in the bot detection policy are hidden by default. Run the following commands to show the settings:

config waf bot-detection-policy

edit <bot-detection-policy_ID>

set advanced-mode enable

end

Sections & Parameters	Function
Sample Settings
Client Identification Method	The data collected in one sample should be from the same user. The system uses IP, IP and User-Agent, or Cookie to identify a user. IP: The traffic data in one sample should come from the same source IP. IP and User-Agent: The traffic data in one sample should come from the same source IP and User-Agent (the browser). Cookie: The traffic data in one sample should have the same cookie value.
Sampling Time per Vector	Each vector (also called sample) records a certain user's behaviors in a certain time range. This option defines how long the time range is. For example, if the Sample Time Per Vector is 5 minutes, the system will record a certain user's behaviors in 5 minutes and count it as one sample.
Sample Count per Client per Hour	This option controls how many samples FortiWeb will collect from each client (user) in an hour. For example, if the value is set to 3, and a client generates 10 samples in an hour, the system only collects the first 3 samples from this client in an hour. If the client generates more samples in the second hour, the system continues collecting samples from this client until the sample count reaches 3. This option prevents the system from continuously collecting samples from one client, thus to avoid the interference of the bot traffic in the sampling stage.
Sample Count	This option controls how many samples should be collected during the sample collection period. More samples mean the model will be more accurate; but at the same time, it costs longer time to complete the sample collection. Not all traffic data will be collected as samples. The system abandons traffic data if it meets one of the following criteria: The system sends Javascript challenge to user clients before collecting samples from them. If a client doesn't pass the challenge, the system will not collect sample data from it. The traffic is from malicious IPs reported by the IP Intelligence feature, or is recognized as a bot by the system. The traffic is from Known Engines, such as Google and Bing. The system also skips the known engine traffic when executing bot detection. Using these criteria is to exclude malicious traffic and the traffic from known engines that act like a bot, thus to make sure the bot detection model is built upon valid data collected from regular users.
Model Building Settings
Model Type	Multiple models are built during the model building stage. The system uses training accuracy, cross-validation value, and testing accuracy to select qualified models. The Model Type is used to select the one final model out of all the qualified models. If you configure the Model Type to Moderate, the system chooses the model which has the highest training accuracy among all the qualified models. If you configure the Model Type to Strict, the system chooses the model which has the lowest training accuracy among all the qualified models. The Strict Model detects more anomalies, but there are chances that regular users are false positively detected as bots. The Moderate Model is comparatively loose. It's less likely to conduct false positive detection, but there are risks that real bots might be escaped from detection. There isn't a perfect option for every situation. Whichever model type you choose, you can always leverage the options in Anomaly Detection Settings and Action Settings to mitigate the side effects, for example, using Bot Confirmation to avoid false positive detections.
Advanced (Model Building Settings)
Training Accuracy	The training accuracy is calculated by this formula: *The number of the regular samples in the training sample set/the total number of training samples 100%**. As we have introduced in the Basic Concepts section, multiple models are built based on multiple parameter combinations in the SVM algorithm. The system uses each model to detect anomalies in the sample set, and calculates the training accuracy for each model. For example, if there are 100 training samples, and 90 of them are treated as regular samples by a model, then the training accuracy for this model is 90%. The default value for the training accuracy is 95%, which means only the models whose training accuracy equals to or higher than 95% will be selected as qualified models.
Cross-Validation Value	The system divides the training sample sets evenly into three parts, let's say, Part A, B and C. The system executes three rounds of bot detection: First, the system observes the samples in Part A and B to build up a mathematical model, then uses this model to detect anomalies in Part C. Then, the system observes the samples in Part B and C to build up a mathematical model, then uses this model to detect anomalies in Part A. At last, the system observes the samples in Part A and C to build up a mathematical model, then uses this model to detect anomalies in Part B. The cross-validation value is calculated by this formula: *The total number of the regular samples/the total number of samples 100%.** For example, if there are 100 samples, and 10 anomalies are detected in the three rounds, then the cross-validation value for this model is: (100-10)/100 * 100% = 90%. The default value for the training accuracy is 90%, which means only the models whose Cross-Validation Value equals to or higher than 90% will be selected as qualified models.
Testing Accuracy	Three quarters of the samples are divided into training sample set, and one quarter of the samples are divided into testing sample set. The system uses the models built for the training sample set to detect anomalies in the testing sample set. If the training accuracy and testing accuracy for a model vary greatly, it may indicate the model is not invalid. The testing accuracy is calculated by this formula: *The number of the regular samples in the testing sample set/the number of the testing samples 100%.** For example, if there are 100 testing samples, and 95 of them are treated as regular samples by a model, then the testing accuracy for this model is 95%. The default value for the training accuracy is 95%, which means only the models whose testing accuracy equals to or higher than 95% will be selected as qualified models.
Anomaly Detection Settings
Anomaly Count	If the system detects certain times of anomalies from a user, it takes actions such as sending alerting emails or blocking the traffic from this user. Anomaly Count controls how many times of anomalies are allowed for each user. For example, the Anomaly Count is set to 4, and the system has detected 3 anomalies in the last 6 vectors. If the 7th vector is detected again as an anomaly, the system will take actions. Please note that if no valid traffic is collected for the 7th vector (for example, the user leaves your application), the system will clear the anomaly count and the user information. If the user revisits your application, he/she will be treated as new users and the system starts anomaly counting afresh. Since this option allows certain times of anomalies from a user, it might be a good choice if you want to avoid false positive detections.
Bot Confirmation	If the number of anomalies from a user has reached the Anomaly Count, the system executes Bot Confirmation before taking actions. The Bot Confirmation is to confirm if the user is indeed a bot. The system sends RBE (Real Browser Enforcement) JavaScript or CAPTCHA to the client to double check if it's a real bot.
For Browser
Verification Method	Disable: Do not execute browser verification. Real Browser Enforcement: The system sends a JavaScript to the client to verify whether it is a web browser. CAPTCHA Enforcement: The system requires clients to successfully fulfill a CAPTCHA request. reCAPTCHA Enforcement: The system requires the client to successfully fulfill a reCAPTCHA request. It will trigger the action policy if the traffic is not from web browser.
reCAPTCHA	Select the reCAPTCHA server you have created in the reCAPTCHA Server tab in User > Remote Server. See Creating reCAPTCHA servers
Validation Timeout	Enter the maximum amount of time (in seconds) that FortiWeb waits for results from the client for Bot Confirmation. The default value is 20. The valid range is 5–30.
Max Attempt Times	Enter the maximum times that FortiWeb attempts to validate whether the request is from browser. Available only when CAPTCHA Enforcement is selected.
For mobile client Apps
Verification Method	Disable: Do not execute mobile client verification. Mobile-Token-Validation: The system verifies the mobile token to verify whether the traffic is from mobile devices. It will triger the action policy if the traffic is not from mobile devices.
Dynamically Update Model	With the option enabled, FortiWeb can detect if the current model is applicable. If not, FortiWeb will refresh the current model automatically.
Advanced (Anomaly Detection Settings)
Auto Refresh Factor	Auto Refresh Factor controls the timing to trigger the model refreshment when a certain number of false positive vectors are detected. FortiWeb makes statistics for the bot detection in the past 24 hours. It counts the number of the following vectors: All vectors in the past 24 hours (A), Anomaly vectors (B), and The anomaly vectors that are confirmed as bots (C) If *(B - C)/(A - C) > 1 - Auto Refresh Factor training accuracy, the model will be refreshed. (B - C) is the false positive vectors, and (A - C) is the regular vectors. (B - C)/(A - C) represents the false positive rate. (1 - Auto Refresh Factor * training accuracy) is an adjusted anomaly vector rate. You can consider it as an auto refresh threshold. If the false positive rate (B - C)/(A - C) becomes greater than the auto refresh threshold (1 - Auto Refresh Factor * training accuracy)*, the system determines the current model is not applicable and automatically refreshes the model. The following table calculates the value of the auto refresh threshold when the Auto Refresh Factor is set to 0-1 (assuming the training accuracy is the default value 95%). For example, if the Auto Refresh Factor is set to 0.8, the auto refresh threshold will be 1 - 0.8 95% = 0.24, which means the system automatically refreshes the model when the false positive rate is greater than 0.24 (e.g. 24 false positive vectors and 100 regular vectors). You can use this table to quickly decide a value for the Auto Refresh Factor that is suitable for your situation.
Minimum Vector Number	As we mentioned above, the system decides whether to update the bot detection model based on the statistics in the past 24 hours. If very few vectors are detected in the past 24 hours, it may interfere the rightness of the model refreshment decision. Set a value for the Minimum Vector Number, so that the system won't update the model if the number of the vectors hasn't reached this value. If the value is set to 0, the system will use the value of the Sample Count as the Minimum Vector Number.
Action Settings
Action	Double click the cells in the Action Settings table to choose the action FortiWeb takes when a user client is confirmed as a bot: Alert—Accepts the connection and generates an alert email and/or log message. Alert & Deny—Blocks the requests from the user (or resets the connection) and generates an alert and/or log message. Period Block—Blocks the requests from the user for a certain period of time.
Block Period	Enter the number of seconds that you want to block the requests. The valid range is 1–3,600 seconds (1 hour). This option only takes effect when you choose Period Block in Action.
Severity	Select the severity level for this anomaly type. The severity level will be displayed in the alert email and/or log message.
Trigger Action	Select a trigger policy that you have set in Log&Report > Log Policy > Trigger Policy. If an anomaly is detected, it will trigger the system to send email and/or log messages according to the trigger policy.

Limit sample collection from IPs

Add IP addresses in this table so that the system will collect sample data only from the specified IP addresses.

If you leave this table blank, there will be no limitation for the IP addresses, which means the system will collect sample data from any IP addresses.

To collect samples only from certain IP address:

In the Limit Sample Collections From IPs section, click Create New.
Enter the IP range. Both IPv4 and IPv6 addresses are supported.
Click OK.

Exception URLs

The system build machine learning models for any URL except the ones in the Exception URLs list.

Due to the nature of some web pages, such as the stock list web page, even regular users may behave like bots because they tend to frequently refresh the pages. You may need to add these URLs in the exception list, otherwise the model may be invalid because too many bot-like behaviors are recorded in the samples.

To add Exception URLs:

In the Exception URLs section, click Create New.

Configure the settings:

Parameters	Functions
Host Status	Enable to compare the URLs to the `Host:` field in the HTTP header.
Host	Select the IP address or FQDN of a protected host.
Type	Select whether the Exception URLs must contain either: Simple String—The field is a string that the Exception URL must match exactly. Regular Expression—The field is a regular expression that defines a set of matching URLs.
URL Pattern	Depending on your selection in Type , enter either: Simple String—The literal URL, such as `/index.php`, that the HTTP request must contain in order to match the rule. The URL must begin with a slash ( `/` ). Regular Expression—A regular expression, such as `^/.php`, matching the URLs to which the rule should apply. The pattern does not require a slash ( `/` ), but it must match URLs that begin with a slash, such as `/index.cfm`. Do not include the domain name, such as `www.example.com`, which is configured separately in Host* . To test a regular expression, click the >> (test) icon. This icon opens the Regular Expression Validator window from which you can fine-tune the expression.

Click OK.