Chuck Leaver – Why Edit Difference Is Important Part One

Written By Jesse Sampson And Presented By Chuck Leaver CEO Ziften


Why are the same tricks being utilized by opponents over and over? The basic response is that they continue to work. For example, Cisco’s 2017 Cyber Security Report tells us that after years of wane, spam e-mail with destructive attachments is once again growing. In that standard attack vector, malware authors generally mask their activities by utilizing a filename similar to a typical system process.

There is not always a connection between a file’s path name and its contents: anybody who has actually attempted to conceal sensitive details by offering it an uninteresting name like “taxes”, or altered the extension on a file attachment to get around email rules understands this principle. Malware creators understand this as well, and will frequently name malware to look like typical system processes. For example, “explore.exe” is Internet Explorer, but “explorer.exe” with an additional “r” could be anything. It’s simple even for specialists to ignore this minor distinction.

The opposite issue, known.exe files running in uncommon locations, is simple to resolve, using string functions and SQL sets.


What about the other case, discovering close matches to the executable name? The majority of people start their search for near string matches by sorting data and visually searching for disparities. This normally works well for a little set of data, maybe even a single system. To find these patterns at scale, nevertheless, needs an algorithmic approach. One recognized method for “fuzzy matching” is to use Edit Distance.

Exactly what’s the best technique to determining edit distance? For Ziften, our technology stack includes HP Vertica, making this task simple. The internet has plenty of data scientists and data engineers singing Vertica’s praises, so it will suffice to discuss that Vertica makes it simple to develop custom functions that make the most of its power – from C++ power tools, to statistical modeling scalpels in R and Java.

This Git repo is preserved by Vertica enthusiasts operating in industry. It’s not a certified offering, but the Vertica group is absolutely aware of it, and additionally is thinking every day about how to make Vertica more useful for data scientists – a great space to watch. Most importantly, it consists of a function to compute edit distance! There are likewise alternative tools for natural language processing here like word stemmers and tokenizers.

Using edit distance on the top executable paths, we can rapidly discover the closest match to each of our leading hits. This is an interesting data set as we can arrange by distance to discover the closest matches over the entire dataset, or we can sort by frequency of the top path to see what is the closest match to our commonly utilized processes. This data can also emerge on contextual “report card” pages, to reveal, e.g. the leading five nearest strings for a given path. Below is a toy example to give a sense of use, based upon genuine data ZiftenLabs observed in a customer environment.


Setting a threshold of 0.2 appears to discover good results in our experience, however the point is that these can be edited to fit specific use cases. Did we find any malware? We observe that “teamviewer_.exe” (needs to be just “teamviewer.exe”), “iexplorer.exe” (needs to be “iexplore.exe”), and “cvshost.exe” (needs to be svchost.exe, unless perhaps you work for CVS drug store…) all look strange. Since we’re already in our database, it’s likewise minor to get the associated MD5 hashes, Ziften suspicion scores, and other attributes to do a deeper dive.


In this specific real life environment, it ended up that teamviewer_.exe and iexplorer.exe were portable applications, not known malware. We helped the customer with additional investigation on the user and system where we observed the portable applications considering that use of portable apps on a USB drive could be evidence of suspicious activity. The more disturbing find was cvshost.exe. Ziften’s intelligence feeds show that this is a suspicious file. Searching for the md5 hash for this file on VirusTotal verifies the Ziften data, suggesting that this is a possibly severe Trojan infection that may be part of a botnet or doing something even more malicious. When the malware was found, nevertheless, it was easy to resolve the problem and make sure it stays solved using Ziften’s capability to eliminate and persistently obstruct procedures by MD5 hash.

Even as we establish advanced predictive analytics to detect destructive patterns, it is very important that we continue to improve our abilities to hunt for known patterns and old techniques. Even if new threats emerge does not indicate the old ones go away!

If you enjoyed this post, watch this space for part 2 of this series where we will use this method to hostnames to identify malware droppers and other malicious sites.


No Responses Yet to “Chuck Leaver – Why Edit Difference Is Important Part One”

Leave a Reply