Configuring bot detection profiles

Basic Concepts

The bot detection model has three stages: sample collecting, model building, and model running.

Sample collecting

To build up a bot detection model, the system collects samples (also called vector) of users' behaviors when they are visiting your application. Each sample records a certain user's behaviors in a certain time range.

The samples are split into two parts. Three quarters of the samples are divided into training sample set. One quarter of the samples are divided into testing sample set.

Model building

During the model building stage, the system observes the training samples to self-learn user behavior profiles and builds up mathematical models using the SVM (Support Vector Machine) algorithm. The SVM parameters are used to eliminate rogue training samples and control individual sample influence on the overall result.

Multiple models are built based on different parameter combinations in the SVM algorithm. According to the training accuracy, cross-validation value, testing accuracy, and the model type you have configured, the system narrows down the selection to one model and uses it as the bot detection model.

Model running

When the bot detection model is in running state, the system compares users' behaviors against the bot detection model. If the traffic from a certain user doesn't match the model, the system will record the traffic as an anomaly. If a certain times of anomalies are recorded for this user, the system will take actions such as sending alert emails or blocking the traffic from this user.

It's possible that sometimes the traffic is false positively detected as an anomaly. The system uses Bot Confirmation to confirm whether an anomaly is indeed a bot. If the false positive detection occurs so many times that it exceeds a certain threshold, the system considers the current bot detection model invalid, and automatically updates the model.

Creating bot detection profiles

Bot detection profiles are part of a server policy. They are created on the Policy > Sever Policy page. All bot detection profiles that you create will show up on the Machine Learning > Bot Detection page, where you can configure or edit them to your preference.

To configure a bot detection profile:

Click Machine Learning > Bot Detection.
Double-click a bot detection profile of interest (or highlight it and then click the Edit button on top of the page) to open it. The Edit bot detection page opens, which breaks down bot detection profile into several sections, each of which has various parameters you can use to configure the profile.
Follow the instructions in the following subsections to configure a bot detection profile.
Click OK when done.

The Advanced settings in the bot detection profile are hidden by default. Run the following commands to show the settings:

config waf bot-detection-policy

edit <bot-detection-policy_ID>

set advanced-mode enable

end

Sections & Parameters	Function
Sample Settings
Client Identification Method	The data collected in one sample should be from the same user. The system uses IP, IP and User-Agent, or Cookie to identify a user. IP: The traffic data in one sample should come from the same source IP. IP and User-Agent: The traffic data in one sample should come from the same source IP and User-Agent (the browser). Cookie: The traffic data in one sample should have the same cookie value.
Sampling Time per Vector	Each vector (also called sample) records a certain user's behaviors in a certain time range. This option defines how long the time range is. For example, if the Sample Time Per Vector is 5 minutes, the system will record a certain user's behaviors in 5 minutes and count it as one sample.
Sample Count per Client per Hour	This option controls how many samples FortiWeb will collect from each client (user) in an hour. For example, if the value is set to 3, and a client generates 10 samples in an hour, the system only collects the first 3 samples from this client in an hour. If the client generates more samples in the second hour, the system continues collecting samples from this client until the sample count reaches 3. This option prevents the system from continuously collecting samples from one client, thus to avoid the interference of the bot traffic in the sampling stage.
Sample Count	This option controls how many samples should be collected during the sample collection period. More samples mean the model will be more accurate; but at the same time, it costs longer time to complete the sample collection. Not all traffic data will be collected as samples. The system abandons traffic data if it meets one of the following criteria: The system sends Javascript challenge to user clients before collecting samples from them. If a client doesn't pass the challenge, the system will not collect sample data from it. The traffic is from malicious IPs reported by the IP Intelligence feature, or is recognized as a bot by the system. The traffic is from Known Engines, such as Google and Bing. The system also skips the known engine traffic when executing bot detection. Using these criteria is to exclude malicious traffic and the traffic from known engines that act like a bot, thus to make sure the bot detection model is built upon valid data collected from regular users.
Model Building Settings
Model Type	Multiple models are built during the model building stage. The system uses training accuracy, cross-validation value, and testing accuracy to select qualified models. The Model Type is used to select the one final model out of all the qualified models. If you configure the Model Type to Moderate, the system chooses the model which has the highest training accuracy among all the qualified models. If you configure the Model Type to Strict, the system chooses the model which has the lowest training accuracy among all the qualified models. The Strict Model detects more anomalies, but there are chances that regular users are false positively detected as bots. The Moderate Model is comparatively loose. It's less likely to conduct false positive detection, but there are risks that real bots might be escaped from detection. There isn't a perfect option for every situation. Whichever model type you choose, you can always leverage the options in Anomaly Detection Settings and Action Settings to mitigate the side effects, for example, using Bot Confirmation to avoid false positive detections.
Advanced (Model Building Settings)
Training Accuracy	The training accuracy is calculated by this formula: *The number of the regular samples in the training sample set/the total number of training samples 100%**. As we have introduced in the Basic Concepts section, multiple models are built based on multiple parameter combinations in the SVM algorithm. The system uses each model to detect anomalies in the sample set, and calculates the training accuracy for each model. For example, if there are 100 training samples, and 90 of them are treated as regular samples by a model, then the training accuracy for this model is 90%. The default value for the training accuracy is 95%, which means only the models whose training accuracy equals to or higher than 95% will be selected as qualified models.
Cross-Validation Value	The system divides the training sample sets evenly into three parts, let's say, Part A, B and C. The system executes three rounds of bot detection: First, the system observes the samples in Part A and B to build up a mathematical model, then uses this model to detect anomalies in Part C. Then, the system observes the samples in Part B and C to build up a mathematical model, then uses this model to detect anomalies in Part A. At last, the system observes the samples in Part A and C to build up a mathematical model, then uses this model to detect anomalies in Part B. The cross-validation value is calculated by this formula: *The total number of the regular samples/the total number of samples 100%.** For example, if there are 100 samples, and 10 anomalies are detected in the three rounds, then the cross-validation value for this model is: (100-10)/100 * 100% = 90%. The default value for the training accuracy is 90%, which means only the models whose Cross-Validation Value equals to or higher than 90% will be selected as qualified models.
Testing Accuracy	Three quarters of the samples are divided into training sample set, and one quarter of the samples are divided into testing sample set. The system uses the models built for the training sample set to detect anomalies in the testing sample set. If the training accuracy and testing accuracy for a model vary greatly, it may indicate the model is not invalid. The testing accuracy is calculated by this formula: *The number of the regular samples in the testing sample set/the number of the testing samples 100%.** For example, if there are 100 testing samples, and 95 of them are treated as regular samples by a model, then the testing accuracy for this model is 95%. The default value for the training accuracy is 95%, which means only the models whose testing accuracy equals to or higher than 95% will be selected as qualified models.
Anomaly Detection Settings
Anomaly Count	If the system detects certain times of anomalies from a user, it takes actions such as sending alerting emails or blocking the traffic from this user. Anomaly Count controls how many times of anomalies are allowed for each user. For example, the Anomaly Count is set to 4, and the system has detected 3 anomalies in the last 6 vectors. If the 7th vector is detected again as an anomaly, the system will take actions. Please note that if no valid traffic is collected for the 7th vector (for example, the user leaves your application), the system will clear the anomaly count and the user information. If the user revisits your application, he/she will be treated as new users and the system starts anomaly counting afresh. Since this option allows certain times of anomalies from a user, it might be a good choice if you want to avoid false positive detections.
Bot Confirmation	If the number of anomalies from a user has reached the Anomaly Count, the system executes Bot Confirmation before taking actions. The Bot Confirmation is to confirm if the user is indeed a bot. The system sends RBE (Real Browser Enforcement) JavaScript or CAPTCHA to the client to double check if it's a real bot.
Verification Method	Real Browser Enforcement: The system sends a JavaScript to the client to test whether it is a web browser or automated tool. CAPTCHA Enforcement: The system requires clients to successfully fulfill a CAPTCHA request.
Validation Timeout	Enter the maximum amount of time (in seconds) that FortiWeb waits for results from the client for Bot Confirmation. The default value is 20. The valid range is 5–30.
Dynamically Update Model	With the option enabled, FortiWeb can detect if the current model is applicable. If not, FortiWeb will refresh the current model automatically.
Advanced (Anomaly Detection Settings)
Auto Refresh Factor	Auto Refresh Factor controls the timing to trigger the model refreshment when a certain number of false positive vectors are detected. FortiWeb makes statistics for the bot detection in the past 24 hours. It counts the number of the following vectors: All vectors in the past 24 hours (A), Anomaly vectors (B), and The anomaly vectors that are confirmed as bots (C) If *(B - C)/(A - C) > 1 - Auto Refresh Factor training accuracy, the model will be refreshed. (B - C) is the false positive vectors, and (A - C) is the regular vectors. (B - C)/(A - C) represents the false positive rate. (1 - Auto Refresh Factor * training accuracy) is an adjusted anomaly vector rate. You can consider it as an auto refresh threshold. If the false positive rate (B - C)/(A - C) becomes greater than the auto refresh threshold (1 - Auto Refresh Factor * training accuracy)*, the system determines the current model is not applicable and automatically refreshes the model. The following table calculates the value of the auto refresh threshold when the Auto Refresh Factor is set to 0-1 (assuming the training accuracy is the default value 95%). For example, if the Auto Refresh Factor is set to 0.8, the auto refresh threshold will be 1 - 0.8 95% = 0.24, which means the system automatically refreshes the model when the false positive rate is greater than 0.24 (e.g. 24 false positive vectors and 100 regular vectors). You can use this table to quickly decide a value for the Auto Refresh Factor that is suitable for your situation.
Minimum Vector Number	As we mentioned above, the system decides whether to update the bot detection model based on the statistics in the past 24 hours. If very few vectors are detected in the past 24 hours, it may interfere the rightness of the model refreshment decision. Set a value for the Minimum Vector Number, so that the system won't update the model if the number of the vectors hasn't reached this value. If the value is set to 0, the system will use the value of the Sample Count as the Minimum Vector Number.
Action Settings
Action	Double click the cells in the Action Settings table to choose the action FortiWeb takes when a user client is confirmed as a bot: Alert—Accepts the connection and generates an alert email and/or log message. Alert & Deny—Blocks the requests from the user (or resets the connection) and generates an alert and/or log message. Period Block—Blocks the requests from the user for a certain period of time.
Block Period	Enter the number of seconds that you want to block the requests. The valid range is 1–3,600 seconds. The default value is 60 seconds. This option only takes effect when you choose Period Block in Action.
Severity	Select the severity level for this anomaly type. The severity level will be displayed in the alert email and/or log message.
Trigger Action	Select a trigger policy that you have set in Log&Report > Log Policy > Trigger Policy. If an anomaly is detected, it will trigger the system to send email and/or log messages according to the trigger policy.

Limit sample collection from IPs

Add IP addresses in this table so that the system will collect sample data only from the specified IP addresses.

If you leave this table blank, there will be no limitation for the IP addresses, which means the system will collect sample data from any IP addresses.

To collect samples only from certain IP address:

In the Limit Sample Collections From IPs section, click Create New.
Enter the IP range. Both IPv4 and IPv6 addresses are supported.
Click OK.

Exception URLs

The system collects samples from any IP address except the ones in the Exception URLs list.

Due to the nature of some web pages, such as the stock list web page, even regular users may behave like bots because they tend to frequently refresh the pages. You may need to add these URLs in the exception list, otherwise the model may be invalid because too many bot-like behaviors are recorded in the samples.

To add Exception URLs:

In the Exception URLs section, click Create New.

Configure the settings:

Parameters	Functions
Host Status	Enable to compare the URLs to the `Host:` field in the HTTP header.
Host	Select the IP address or FQDN of a protected host.
Type	Select whether the Exception URLs must contain either: Simple String—The field is a string that the Exception URL must match exactly. Regular Expression—The field is a regular expression that defines a set of matching URLs.
URL Pattern	Depending on your selection in Type , enter either: Simple String—The literal URL, such as `/index.php`, that the HTTP request must contain in order to match the rule. The URL must begin with a slash ( `/` ). Regular Expression—A regular expression, such as `^/.php`, matching the URLs to which the rule should apply. The pattern does not require a slash ( `/` ), but it must match URLs that begin with a slash, such as `/index.cfm`. Do not include the domain name, such as `www.example.com`, which is configured separately in Host* . To test a regular expression, click the >> (test) icon. This icon opens the Regular Expression Validator window from which you can fine-tune the expression.

Click OK.