The battles between attackers and defenders get more sophisticated every day. Both sides are locked in a constant back-and-forth game of stealth and visibility. Detection engineering plays a central role in these struggles – a role that enables organizations to see and stop threats before they escalate to full-blown incidents.
Detection systems are driven by data, which can be a lot of work to distill into high-fidelity signals of malicious activity. Understanding detection engineering is critical to successful security monitoring and incident response.
In this blog post, we will explore the intricate world of detection engineering. We’ll start by examining the inputs and outputs of detection engineering, and then we’ll illustrate the detection engineering lifecycle by walking through a relevant case study where we create a detection for remote execution over DCOM.
The big picture
At its core, detection engineering is the process of creating code that filters a set of data to produce signals of malicious artifacts and activity. This definition already raises several questions… How do detection engineers know what to look for? What data is being analyzed? How are the malicious signals used?
To answer these questions, we need to look at where detection engineering sits in relation to other roles within cyber security. If we think about detecting and responding to threats as a step-by-step process, detection engineering sits somewhere in the middle.
Before detection engineering can happen, we need two things. We need to know what object or activity we are looking for, and we need a stream of data that will contain the evidence when the malicious action occurs.
Threat intelligence
Threat intelligence is the role that evaluates the current landscape of cyber threats, and produces reports that highlight which attacker behaviors are relevant for detection. These relevant behaviors can be prioritized into a kind of “wish list” of detections for detection engineers to create in order to fill in the visibility gaps of an organization. Detection engineers can’t do their jobs without some idea of what they should be looking for.
Security data engineering
The other thing we need before we can start creating detections is a stream of data or “telemetry” that will contain the threat we are looking for. Some people may consider the act of gathering telemetry to be part of detection engineering, but in my opinion, it really deserves to be its own discipline that I call “security data engineering”.
Security data engineering relies on deep, low-level knowledge of the systems involved in an attack. Specific knowledge domains might include things like network protocols, OS internals, ETW (Event Tracing for Windows), or even radio spectrum. Understanding these systems enables you to know which signals will be produced when an attacker performs a malicious action. You then need to create a stream of logs containing those signals to make those events available for detection engineering.
Now that we know what goes into the detection engineering process, let’s look at the different ways that detections can be used to understand the important role they play.
Security monitoring
The primary goal of detection engineering is to discover an active threat and put a stop to it before it can do any damage. When detections are triggered based on live activity or malware signature, it’s the role of security monitoring to investigate and determine whether the detection actually found something malicious.
Incident response
When a threat is confirmed by those doing security monitoring, the incident response role steps in to remediate the threat. Incident response also uses detections to carry out their tasks, as they look for both artifacts and additional event logs to paint the full picture of the security incident.
At this point, we’ve illustrated detection engineering’s role in the overall process of detecting and responding to threats. The actual work of a detection engineer still isn’t very clear, though. For the rest of this article, we’ll walk through a day in the life of a detection engineer to show what it looks like to create detections.
Zooming in
The process of creating detections can be broken down into some high-level steps that make up a detection engineering lifecycle. To show how this lifecycle works, we will be walking through a case study: creating a detection for DCOM remote code execution. We will be using KQL with Microsoft Defender XDR to query a stream of logs for our detection.
Research
The first phase of the detection engineering cycle is research. This is where you become intimately familiar with the technique you want to detect. Threat intelligence reports will help here, and so will technical research articles and actual testing with open-source tools that perform the technique.
We won’t go through a full research phase for DCOM remote execution in this blog post, but there are several great resources out there if you want to learn more about it before moving on. The earliest reference I could find was this blog post by Matt Nelson. In the article, Matt walks through how to remotely tell the MMC (Microsoft Management Console) DCOM server to spawn a process for you.
Hypothesis
After doing our research, we have a few basic pieces we can work with:
- There will have to be a network connection involved
- There will be a DCOM host process
- Most DCOM host processes are created by the DCOMLaunch Windows service
Based on these indicators, we develop a hypothesis that we can detect remote DCOM execution by looking for network connections to processes spawned by the DCOMLaunch service.
Since we are using Microsoft Defender in this example, we have access to network and process logs, which should be sufficient to find all the indicators we just went over. This is where we will start our detection.
Development
The development stage is where we actually write our detection code and iterate on our hypothesis as we create the query. We start out by getting all inbound network connections to processes spawned by svchost.exe.
DeviceNetworkEvents
| where ActionType == "InboundConnectionAccepted"
| where InitiatingProcessParentFileName == "svchost.exe"
| where RemoteIP !contains "127.0.0.1" // Loopback traffic
...
Svchost is the process that runs the DCOMLaunch service, but we have a problem already. Svchost also runs many other Windows services, and DCOMLaunch is the only service we care about.
Svchost’s command line arguments will allow us to identify which process is for DCOMLaunch, but unfortunately, the DeviceNetworkEvents table doesn’t contain the full command line arguments we need to do that. To filter on just the DCOMLaunch service, we need to pull in data from the DeviceProcessEvents table to get the InitiatingProcessCommandLine field for our svchost processes.
...
// Get full svchost command line so we can filter on DcomLaunch
| join kind=innerunique DeviceProcessEvents on
DeviceName,
$left.InitiatingProcessId == $right.ProcessId,
$left.InitiatingProcessFileName == $right.FileName
| where InitiatingProcessCommandLine1 contains "svchost.exe -k DcomLaunch"
...
Now our detection code does what we originally wanted. The query finds all network connections to processes spawned by the DCOMLaunch service.
Testing
Next up in the detection engineering lifecycle is to test our detection. We want to make sure that our detection catches what we want, doesn’t have too many false positives, and isn’t performance-intensive or expensive to run.
During our testing of the remote DCOM execution detection, our query is able to find our malicious test actions. However, we find that we have too many false positives for legitimate tools that communicate over DCOM…
To address the false positives, we’ll return to our research findings to see if we can get more specific about what we are looking for. In the case of the DCOM execution from Matt’s blog post, the MMC DCOM host spawns a child process. Will the DCOM host always spawn a child process during malicious execution?
Unfortunately, no. As explained in this blog post from MDSec, Excel’s DCOM server allows remote DLL injection and script injection. Neither of these remote execution procedures involve spawning a child process.
Still, we need to narrow our original detection down somehow. We decide to temporarily accept the DLL and script injection gaps and add those to our backlog to address in separate detections. For now, we will focus on remote DCOM execution that involves spawning a malicious child process.
Now that we’re going to get more specific with our detection, we need more information from the process table. This time, we’re looking for child processes spawned by our DCOM hosts that had a network connection. This is the new part of our detection query that does that:
...
// Take only what we need for joins and final output -- prevents mismatched process fields after join
| project DeviceName, FileName, ProcessId, RemoteIP, RemotePort, LocalIP, LocalPort
// Get DCOM host child processes
| join kind=innerunique DeviceProcessEvents on
DeviceName,
$left.ProcessId == $right.InitiatingProcessId,
$left.FileName == $right.InitiatingProcessFileName
// Clean up old fields -- DCOM host is now parent process
| extend FileName = FileName1, ProcessId = ProcessId1
| project-away *1
| sort by Timestamp
...
By adding an additional requirement to our detection (spawning a child process), we’ve fixed all the false positives we had before, except for one. When the DCOM host crashes, Windows will spawn WerFault.exe. We filter this child process out from our query, and now we have no false positives in our current set of telemetry. This is what the final detection query looks like:
DeviceNetworkEvents
| where ActionType == "InboundConnectionAccepted"
| where InitiatingProcessParentFileName == "svchost.exe"
| where RemoteIP !contains "127.0.0.1" // Loopback traffic
// Get full svchost command line so we can filter on DcomLaunch
| join kind=innerunique DeviceProcessEvents on
DeviceName,
$left.InitiatingProcessId == $right.ProcessId,
$left.InitiatingProcessFileName == $right.FileName
| where InitiatingProcessCommandLine1 contains "svchost.exe -k DcomLaunch"
// Take only what we need for join and final output -- prevents mismatched process fields after join
| project DeviceName, FileName, ProcessId, RemoteIP, RemotePort, LocalIP, LocalPort
// Get DCOM host child processes
| join kind=innerunique DeviceProcessEvents on
DeviceName,
$left.ProcessId == $right.InitiatingProcessId,
$left.FileName == $right.InitiatingProcessFileName
// Clean up old fields -- DCOM host is now parent process
| extend FileName = FileName1, ProcessId = ProcessId1
| project-away *1
| sort by Timestamp
// Child process exclusions
| where FileName != "WerFault.exe" // DCOM server crash
Deployment
Once the detection is working like we want it to, we will deploy it. This will look different depending on which platform you use for detection. Ideally, the detections you write are version-controlled using git, which enables you to utilize CI/CD processes.
By applying DevOps principles to your detections, you can do things like automated testing with frameworks like Atomic Red Team, and deployment-on-merge using the APIs of your detection platform.
Revising
The last step in the detection engineering lifecycle is to continuously revise the detections you’ve deployed. Purple team and red team exercises may reveal gaps in your query logic that you missed during your testing. You also might read some new research that includes a new method of remote DCOM abuse that breaks your detection.
One more thing to consider when revising your detections is the false positive rate. Keeping a low false positive rate is important for those in the security monitoring role who only have a limited amount of time and mental bandwidth to triage all the alerts that your detections create. Produce too many false positives, and you could be causing alert fatigue or burying a dangerous activity beneath a pile of benign events.
Conclusion
Hopefully this blog post has been extremely clarifying for anyone who wondered what detection engineering is all about. We discussed the role that detection engineering plays among other roles in cyber defense, and highlighted the dependencies within that process.
In the case study, it was clear to see the thoughts and challenges that someone might face in detection engineering. A key point made within the case study was that the work of detection engineering is never done. Even within our one example, we discussed gaps in our detection that will need to be addressed by follow-on work (DLL and script injection).
The work of detection engineering is critical for organizations of the world to do better in the ongoing battle against cyber threats. If this blog post interested you, read some incident reports and research how you could detect some of the threats that organizations face today.