Playbook of the Week: Uncovering Unknown Malware Using SSDeep by Ido Van Dijk
Cryptographic hashes like MD5 are often used to identify malware, but as malware evolves, new iterations of the same malware are being released constantly. Malware may be modified or modify itself to avoid detection by traditional hashes. For example, in 2018,12,000 variantsof the WannaCry ransomware were detected in the wild. SSDeep has proven effective in identifying malware of the same family, as large portions of the code usually remain the same between versions.
In this post, we’ll explain how to improve the detection, enrichment and investigation of malware by leveraging the power of SSDeep, a fuzzy hash often used in digital forensics, research and incident response. Yet, before talking about SSDeep, let’s talk about cryptographic hashes.
Cryptographic hashes vs. fuzzy hashes
Cryptographic hashes are used to identify data. In digital forensics, they serve as a means of identification of files. They are used to answer the question: “Is file X exactly the same as file Y?” A change in one bit of a file will cause a completely new cryptographic hash to be produced for that file. Some of the most commonly used cryptographic hashes are MD5, SHA-256 and SHA-512.
What exactly is SSDeep?
SSDeep is a fuzzy hash, and is a type of Context-Triggered Piecewise Hash (CTPH). It is calculated differently from the previously mentioned hashes, and is produced by running a hash function on fixed size segments of the file. This produces a special hash that answers the question: “Is file X similar to file Y?”. Therefore small changes in a file will only slightly alter the SSDeep hash it produces.
Let’s look at an example. Here are 2 files with the difference in only 1 byte:
Let’s look at the SHA-1 produced for those files:
As you can see, a completely different SHA-256 hash is produced for each file.
Let’s take a look at the SSDeep hashes of those files:
The SSDeep hash is almost identical, except for one character difference.
Why use fuzzy hashes?
As we saw before, cryptographic hashes can help us uniquely identify malware. If we find a suspicious file, we can query our threat intelligence products to see whether they know about its MD5 hash, and we may get insightful results about the malware.
However, if the malware we are investigating was forked and made the slightest modification of a string for example, then the method of enriching it using its MD5 hash may not yield any results at all, unless that exact fork is known to the threat intelligence product we are using.
This is where SSDeep comes in.
SSDeep is a common implementation of CTPH. The “ssdeep” command is a native command in Unix systems, and it has a built-in function that allows us to find files with similar SSDeep hashes. We may find out that we have just investigated a very similar malware the week before, which has a similar SSDeep, but its MD5 hash is not known to any of our sources.
Leveraging SSDeep in XSOAR to uncover unknown malware
Combining SSDeep with Cortex XSOAR provides a tremendous value due to the following capabilities:
Threat Intelligence Management (TIM) module
Fetch indicators from feeds.
Fetch indicators from incidents.
Get reputation based on multiple sources (sandbox. threat intel, manual).
Build relationships between entities.
Integrate with threat intelligence products for indicator enrichment.
Integrate with EDRs, firewalls, email security gateways and more - to provide a holistic remediation plan.
Automatically detect files with similar hashes.
Automatically hunt for files in endpoints.
Link similar incidents together.
Display all relevant information in a special custom layout.
Taking into account all of the tools we have, we created the following general flow in an XSOAR playbook:
We obtain an MD5 hash from an alert on one of our endpoints.
We retrieve the file from the endpoint, to obtain its SSDeep hash.
We generate a pool of SSDeep hashes that we know about. We use existing SSDeeps in our system (from various feeds and incidents we have), and we enrich other hash types using threat intelligence tools to obtain even more SSDeep hashes.
4. We use an automation we provide out of the box called SSDeepSimilarity, to find similarities between the file we started with and the SSDeep hashes from the pool.
5. We begin to pivot:
- We create relationships between our original hash and previously unknown hashes by uncovering the SSDeep similarity of the original SSDeep hash and the hashes from the hash pool. - We link the current incident with another incident where the similar hash came from. This provides us with additional IOCs which may be related to the malware we are investigating. - We search for the similar hashes we found on our endpoints, and we detect one of them on an endpoint.
6. Finally, we save all of our findings in the incident layout and allow for remediation of any malicious indicators found in our investigation.