In-depth explanation of the SOAR script under the private cloud

Many people do not know what SOAR (Security Orchestration, Automation, and Response) is used for, so today I will introduce him to the core part-the script, to let everyone know him ~

I. Introduction

Before discussing how to write a SOAR script, let’s review what SOAR is.

Gartner defines SOAR as Security Orchestration (Automation and Response) , which includes four stages: detection (Detect), classification (Triage), response (Respond) and evaluation (Prioritize).

I still don’t seem to understand what SOAR is used for. What scenarios are suitable for? It doesn’t matter. Through the analysis of related products and the department’s own landing process based on the actual user data, we have concluded a five-character policy to specifically describe the scope of application of SOAR.

  • Lack: Lack of experience in security analysis, operations personnel handling, or related security alerts, event handling.
  • Mess: The process of internally handling security issues is too long, often requiring the cooperation of multiple departments.
  • Miscellaneous: With various brands and different models of security equipment, it is difficult to use the equipment in a simple and fast way.
  • Many: Security events and alarms generated every day or even every hour require a lot of manpower and time for troubleshooting.
  • Urgent: When a certain type of security incident occurs, it needs to be handled immediately, but there is no corresponding 24-hour emergency personnel or handling method.

Through the above-mentioned five-character policy, you must have a preliminary understanding of the scope of SOAR. Next, let’s look at the core part of SOAR-the script .

What is a script?

What is a script? A script is a description of a method for dealing with a type of security problem. It contains what needs to be handled and how it should be handled. In our opinion, a script should consist of actions and processing logic:

In the script, the actions to be done can be divided into the following four categories:

  1. Query Class : What kind of data should I get at the beginning?
  2. Parsing class : Which fields should I extract after getting the data?
  3. Response class : What should I do in this case?
  4. Output class : How should I notify the user when I’m done?

The processing logic in the script is summarized as follows: conventional routines + local conditions .

For example: virus software suddenly appeared on the client machine, the routine is antivirus software! Remove the virus! !! And the full scan goes! !!

What is tailored to local conditions? Oh, the original Agent deployed on the client only detects and has no file deletion function, but has a deployment blocking device. Then change the method: first block the relevant traffic and notify the staff to handle it, and it will automatically return to normal after the processing is completed.

Seeing here, you can feel that the relationship between action and processing logic is just like the relationship between limbs and brain, and it is indeed the case. We boil it down to the following figure:

After reading the two core components of the script, the next question is how to write a script.

How to write a script?

Before discussing how to write a script, I want to discuss one question: Is the script really bigger? (Big here refers to the more actions and judgment logic contained in the script)

There is a very funny tool in the security community called: one-click satellite.

(The tool is just funny. Don’t take it seriously, and don’t do such things that endanger national security.)

Here comes to say that I just want to express “a lot of complicated and difficult things can be done in a simple way “. For users, they definitely want to be able to solve all security scenarios and security issues through a few scripts, so that they don’t need to invest too much human and material resources and energy to solve security issues. In theory, a sufficiently rich script can indeed handle most cases of a type of problem, but in practice we have not done so.

Too large a script will cause the following two problems:

1). Difficult to read and modify

In the initial stage of our research on SOAR, we once saw a huge script in Demisto products (a foreign security vendor, they have SOAR products): In terms of actions alone, there are 100+ actions in the script. If you count the judgment logic, the number is even greater.

Maybe everyone does not have a very intuitive feeling for the number of 100+, we can use another description method. The script can be displayed on the page, so you can browse by dragging the page. It may take tens of seconds to scroll through the script using the mouse flywheel to complete the page-to-page turning process. At the same time, there are quite a few logical branches in the script. This shows that a large script is very unfriendly to users who browse and modify it.

2). The bigger the script, the worse the transplantability

The larger the script, the wider the scope of the problem; the wider the scope of the problem, the more equipment will be used . In the actual environment, the devices owned by each customer are often different, so the corresponding actions (calling devices) are also different. Therefore, the more actions (referring to the calling device) called in the script, the more likely the script will not run properly or obtain normal results in different environments.

At the same time, here also gives us a warning: an excellent script should not be limited to using a specific device to complete this action, but should provide a large number of actions that can call this type of device. Only in this way can the same script work normally in different customer environments, and there will be no situation in which the script is abandoned due to the lack of certain equipment.

Let’s return to our theme: how to write a script. First of all, we will explain from the aspect of movement.

Earlier we talked about dividing actions into four categories: query, parsing, response, and output. Next we will explain by category.

Query class

When designing query-type actions, we considered whether we need to design an action that can be executed on a regular basis. But later, after communicating with the small partners in the group, it was decided to hand over the functions executed regularly to the platform for execution, maintaining the design idea of ​​” one type of action completes only one function “. In the query class, we mainly implement three types of query functions, which are introduced one by one below.

First of all, the function that the query action must meet is to be able to read security events or alarms . In actual implementation, we have implemented a variety of ways to read security events or alarms. Generally speaking, they can be divided into two categories: single read and batch read. There are corresponding subdivisions under these two categories, as shown in the figure below:

Single read

Under a single read class, there are two actions of reading recently generated and some recently generated security event / alarm. The purpose of designing the action of “reading recent events / alarms” is to match with the platform’s timed execution function, so that it can achieve the so-called “minute-level” detection under certain configurations. As for why there are reads that specify a certain type of event / alarm action, they will be explained after the introduction of batch reads.

Batch read

Some people may have questions: Isn’t the effect of running multiple “single read” actions the same as “batch read”? We answer this question from two perspectives in practice:

1). In actual use, customers often need to evaluate yesterday or recent security conditions and then write related analysis instructions or specify related plans. If only “single read” actions are provided , there may be a large number of such actions in the script, which will inevitably appear a bit bloated or even fail to run properly . At this time, it is a bit overwhelming.

2). The alarms generated during the hacking process are often related, so it is necessary to analyze the security events / alarms within a certain period of time . For example, if a security warning of “Implant a one-word trojan horse” appears, it cannot be judged from the request alone whether the system has a loophole that was successfully implanted or is an automated attack by the scanner. But if at this time it is found that there is a large amount of “SQL injection behavior” in the range before the request, it means that there is likely to be a real vulnerability: SQL injection was successful and a sentence could be successfully written. At this time, the importance of batch reading security events / alarms is reflected .

Going back to the question previously raised in “Single Read”: Why do you need to read to specify a certain type of event / alarm action? This situation is also a problem that we did not expect at the time of design and later discovered when running the actual data: customers often have multiple security devices, and the detection function between devices and devices overlap, and for the same type of vulnerability Each has its own category name. Therefore, in order to obtain a specific type of security event / alarm, it is often necessary to specify keywords .

In addition to the “Read Security Event / Alarm” action, we also need a method that can query various commonly used information. Therefore, we have implemented a method called “regular query”, which contains four types of actions: ES query, MySQL query, resource query, and call query are shown in the following figure:

The “Limited Field Value” and “Limited Range” actions in “ES Query” are well understood: ES (ElasticSearch) is often used for data storage in big data platforms, so we need to provide a function that can get data from it . But there is a roadblock: how to support more complex ES query statements? Is it only possible to let users enter more complex ES query statements? Or can it only support simpler queries? After thinking about this problem for a long time, I finally found that a wonderful method is to handle the “parse class” action mentioned above . Therefore, it is not described in detail here, please see the decomposition below ~

The “MySQL query” similar to the “ES query” is also very straightforward: query data from the MySQL database. However, due to the structure of the database, you may need to know the names of the libraries, tables, and fields in the database when using this action. At the same time, it does not currently support overly complex SQL queries. It can only support more basic query functions.

“Resource query” , you can see what you do by looking at the name: it is used to obtain third-party information . Here we divide it into two categories: “open source intelligence query” and “external query”. The reason for doing such a division instead of merging into a single query is because open source intelligence is really great, enough to separate it out for research. In addition, “external query” is specifically used to obtain data from internal or external systems. If the relevant information is placed in a file, the information in it needs to be obtained at this time for judgment, and then this kind of action can be adopted.

The last “call query” is strongly related to the actual.

“Call device query” : In the client environment, there are agents on the end or some devices have the function of detection, so we provide this type of action to obtain the results of related detection.

“Collaborative query” : Collaborative? Who is the target of collaboration? Of course it is human, so the purpose of this type of action is to interact with other software and systems, or to obtain relevant instructions or settings issued by relevant personnel.

Having said that, the “query class” action is considered to be introduced. Here is a brief summary: There are two major classes under the “query class” action, a total of six subclasses, as shown in the figure below.

2. Parsing class

The “analysis classes” to be introduced next have four categories as a whole: structural analysis, description analysis, condition analysis, and hybrid analysis, as shown in the following figure:

First of all, let’s interpret the “structure analysis” action.

The “Single Structure Class” action is actually easier to understand: Many times the logs or security events / alarms are read in a dictionary format. You can directly read the value of a field by the key name. So this is the most basic action when parsing data.

The “embedded structure” action may seem a bit difficult to understand at first glance, but it may be understood by explaining it in a straightforward way: the data is in the form of a “doll” and the data to be obtained C is stored in data B, and data B is a field value of data A. It is true that you can use multiple most basic “single structure” actions to parse the data in this case, but this is more troublesome, so we directly provide a class of actions that can parse such “baby doll” data.

The “list nested class” action is used to parse each element in the list. As mentioned above, “query-type” actions can obtain relevant data from the database or elsewhere, and the format of these data is mainly in the form of lists. It is therefore necessary to provide a type of action to parse the data in the list.

The second “Description Parsing Class” action is used to obtain the actual required part of the field value. Two types of actions are provided: regular parsing class and feature description class .

The “regular parsing class” is well understood. In practice, there are often some unstructured data or scenarios in which relevant features need to be extracted from a field. At this time, using regular expressions can solve this problem very well, so we think that there are It is necessary to provide such a type of action.

When it comes to regularity, everyone who has used it knows that regularity is a very unstable tool. It often occurs that data cannot be obtained or even incorrect information is obtained because of changes in data or incomplete expressions. So we create a “feature description class” action as a complement to the “regular parsing class” action. The “characteristic description class” action is parsed by delimiters and descriptors, such as: first cutting the field with a space, then splitting the third split value with a comma, and finally taking the third split by a comma Values.

After introducing the “Description Parsing Class” action, let’s take a look at the “conditional parsing” action with a sense of logic.

Why do we need a “conditional parsing” action with logic? Because in practice, it is found that some logic that is simple to describe in terms of language, but requires the relevant effort to achieve, such as judging whether an IP is within the range of an IP segment. In our SOAR platform (the engine that runs the script), the most basic judgment functions are included, such as the inclusion of strings, the greater than, less than between values, etc., but similar to the above needs to have certain judgment logic Function is powerless. So we make up for this shortcoming by designing such a type of action.

The “multi-field value condition class” action is aimed at data obtained when multiple fields must meet certain conditions at the same time, while the “complex judgment condition class” action is suitable for scenarios that require field value conversion and calculation to be compared .

The final “hybrid parsing class” is to combine existing interpretation-type actions in order to solve more complex application scenarios. One thing to declare here is that scripts can be nested with each other, and the output of one script can be used as an input source for another script. The script is composed of actions, so actions should also be nested (the manifestation of actions on the page is a draggable component, but it is still a series of code in the background). With the above various parsing actions, how to solve the problem of obtaining the results of complex ES query statements in the “ES query class” action mentioned above can be easily solved: the results are obtained by simple query and then parsed.

In our opinion, it is also more efficient to obtain complex results. The way to complete field parsing through draggable components on the page is more efficient than writing beautiful ES statements (of course, it may also be a comparative dish written by our ES).

3. Response class

The “response-type” actions described below are designed based on actual equipment conditions.

When thinking about how “response” actions should be divided, many ideas came to mind: should they be divided according to domestic and foreign equipment? Or according to the function of the device? Or according to other divisions? After thinking about it, I finally came to the following conclusion: According to the two aspects of device function and call design , it is divided into two aspects . Therefore, there are two ways to divide the “response” action.

Method 1: call design

In response actions, it is inevitable to deal with various equipment, so we need to know how to communicate with the equipment. Here we analyze from open source software (soft waf, etc.) and closed source devices.

In the actual environment, open source security software or secondary development of open source security software is deployed for use. Generally, this type of security software has detailed instructions, usage specifications, and mature API interfaces that can be provided externally. The response action of the security software is relatively simple. The more pitfalls may be the writing of “difficult to call class” actions. This kind of action requires reading the source code to obtain the relevant usage parameters before writing the corresponding usage method, so the writing cost is relatively large.

The “closed source” category mostly refers to equipment from various manufacturers. In this regard, some manufacturers provide good manuals and mature API ports for the products, so you can write a calling method for them by referring to the manuals (similar to open source). However, some manufacturers require the customer to ask for the relevant interface to be enabled, and the implementation of the relevant interface is similar to what was just said.

Therefore, the above-mentioned types of actions can be accomplished by providing a simple HTTP request interface.

The other case is that the device provides a command line window for interaction. In this case, if the sent message is encrypted, it is difficult to process; if not, you can obtain the data request format by capturing packets, and then Then change the corresponding parameters to initiate an HTTP request to achieve the corresponding purpose.

Finally, the “independent class” action is aimed at devices that do not provide an interface or a command line, and only support page login operations. As for how to keep it secret in this case!

Method 2: Equipment functions

Another classification method for response actions is to divide by the function of the called device. The above picture is a simpler classification method, because each function has its own characteristics. I wo n’t introduce too much here. Students can check the information by themselves.

What follows is the last item in our action, which is also the type of action that is closest to the user: the output class.

4. Output class

The category classification diagram is also presented directly, and it can be divided into four parts in general: status feedback, conventional notification, product combination, and collaborative processing.

For SOAR, a large number of security events / alarms need to be processed. Therefore, after the processing, the status needs to be changed or the relevant field values ​​must be modified to indicate “this has been processed”. For this purpose, we designed ” state feedback class” actions to complete this type of action.

The purpose of designing “general notification classes” is to be able to notify system administrators of relevant emergency events or scenarios that need to participate in decision-making in a timely manner. It also supports sending to non-SOAR users.

Our SOAR is part of the SOC product developed by the department, so we need to transfer some of the relevant results back to other functions of the SOC. However, as a platform for orchestration and automated response, SOAR should not only be combined with SOC, but should also be easily integrated with other platforms. Therefore, we have named an action named “product combination class” (of course, more of them are currently Combining with our SOC products =, =).

The final “collaborative processing class” puts more emphasis on the process of recording the responsible person’s handling of related security issues, because in actual scenarios, machines are often divided into humans for management.

Action summary

So far, the action-related knowledge in the SOAR script has been introduced, and its categories can be divided into four categories, as shown in the figure below.

The induction and summarization of the action is based on the analysis of other people’s products and in the actual landing process, so it has high reference value.

Processing logic

As mentioned earlier, the script consists of two parts: actions and processing logic. After the action is introduced above, we should introduce the processing logic. The processing logic we said above is a combination of conventional routines and measures tailored to local conditions . This statement is more general, so we need to refine it to a certain extent, and we will organize it as follows:

What data to use

The script is used to solve a certain type of problem, so we need to know what data is needed to solve this type of problem. Only when confirming what kind of data to use, can you choose specific actions to obtain and parse the data. At the same time, what kind of data is obtained will affect the quality of the solution. As mentioned earlier, in the actual scenario, the same type of vulnerability may have multiple different devices to generate events / alarms, and the level of detailed information provided by different devices is different. Therefore, if the correct data type is selected in the data selection stage, it can often achieve more results with less effort.

What is the basis of judgment

Attacks often need to be judged when dealing with security events / alarms, so we need to determine what type of vulnerability this is, what the attack payload is, whether the attack payload is successful, and other related factors. And these are the things that need to be solidified into the script, because with the upgrade of offensive and defensive technology, many attack characteristics are no longer visible to the naked eye or simple rules. After determining the basis for the judgment, it is also necessary to consider if it can be realized through action (curing process).

Whether to combine other data

In the offensive and defensive world, many vulnerabilities or attack methods often do not simply rely on a single event / alarm to complete the confirmation, so many times need to be combined with other data for confirmation. When combining other data, you can consider it from the perspective of time dimension and alarm type.

For a simple example, if a SQL injection alert appears on the server, and then a one-word Trojan alert appears, then the attack is likely to be successful. Or if the server suddenly receives an abnormal login warning after a login blast, then it is likely that the correct password was blasted and then the login was performed.

How to deal with

How to deal with events / alarms needs to take into account the actual conditions of the site to operate and the types of events / alarms. It should also be noted that not all processing must depend on safety equipment (of course it is more convenient to have safety equipment). Some operations (such as adjusting related network policies to achieve blocking effects or delete malicious files on the server) can be done with custom actions. Under the premise of safety equipment, it is necessary to consider which equipment can be used to achieve the purpose while ensuring the normal operation of the business.

Whether human intervention is required

Two words can often be heard when testing: false negatives and false positives. These two types of problems will also occur in SOAR, so you must consider them when writing the script. Our goal is to make SOAR a “platform where users can drink coffee and solve problems with a few clicks of a mouse”, but it is still very difficult in the short term, so at this stage we need to rely on human power to do it. Decision making, and SOAR solves most things. When people need to make decisions, the script should use actions to obtain a lot of relevant information to assist users.

Do you need to verify the results

Regardless of whether the device is called for operation or another colleague is required to assist in the operation, you need to consider whether this step can achieve the expected result. If not, you need to add additional actions and processing logic to confirm or retry. Take the killing of a Trojan as an example. If the virus cannot be deleted or has the ability to resurrect, then in this case we need to confirm whether the Trojan file is actually deleted. If it cannot be deleted, what should we do.

If the processing of a security event / alarm is in a false state (indicating that the processing was successful, but the processing failed in reality), then the same type of events / alarms will continue to appear in the follow-up, eventually resulting in excessive consumption of resources On the same wrong step.

The above is our thinking on “how to be the processing logic that should be solidified”. It may be insufficient due to personal experience. Please forgive me.

Fourth, the script scene

Seeing this, you may be more concerned about which scenarios can SOAR be used in? Don’t worry, we also briefly summarize the existing scenarios for everyone:

After reading the above picture, I was puzzled why we should separate the network attack from the website attack? In reality, website attacks (Web attacks) often account for a large portion, and network attacks similar to DDoS and website attacks have a large difference in attack form and defense method. In order to be more clear about the problems that each solution solves, the two are divided into two different scenarios.

The appearance of other types of scenes is to solve scenes that are more special or difficult to fall into a certain type in the actual process.

Five, the actual effect

When it comes to the actual running effect, I don’t think there should be too many text descriptions. It feels like simple pictures + comments. This has a better effect!

The above is a practical application scenario of the “Suspicious Behavior” category. It can be seen that SOAR solves most of the security alarms in it, and only selects a few alarms that need to be manually resolved.

The following are used by other teams in our department to solve problems in their daily security operations:

The above picture is the operation diagram of each action node of the background script. The other green represents the route taken during the operation (the web page was not developed at that time, so it looks ugly =, =).

Six, expansion

1. Combine with other types of work

After developing the basic framework of SOAR, we once thought about the question: What else can SOAR do besides handling security events / alerts? Just one day a small partner of the team asked me: Can you combine SOAR and operation and maintenance-related things together?

After thinking about it, it feels OK, so I simply drafted the following script prototype:

2. How to integrate detection guidance model

There are several popular intrusion detection theory guidance models in the security world, such as Kill Chain, MITRE ATT & CK, etc. At the time of writing the SOAR script, I tried to integrate Kill Chain into the processing flow, and the actual feedback results are indeed very valuable. For example, if the attacker is using multiple different IP attacks, how to construct this chain in the script? And how do you map security events / alarms to the various parts of the chain? Because once you connect the parts, you need to correlate a large amount of data. In theory, it should be that the later steps of the chain have a smaller proportion (indicating the deeper the attack), but in practice, it is found that the later The proportion of security events / alarms has gradually increased.

So the question is: Should we use pre-processing events / alarms to construct the chain or post-processing? Regarding the practical application of these two models, welcome friends to discuss it!


Because the author mainly studies the direction of private clouds, this article focuses on the development of SOAR scripts under private clouds . I once wanted to generalize the situation under the public cloud, but after all, because I don’t know much about the public cloud scene, I did not analyze it in the above. In the private cloud scenario, the application scope of SOAR, the composition of the script, and the script’s soul-actions and processing logic are described in detail. Finally, it briefly considers what further SOAR can do.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.