Chuck Leaver – Why Edit Difference Is Important Part Two

Written By Jesse Sampson And Presented By Chuck Leaver CEO Ziften


In the first post on edit distance, we took a look at searching for destructive executables with edit distance (i.e., the number of character modifications it takes to make 2 text strings match). Now let’s look at how we can use edit distance to look for malicious domains, and how we can build edit distance functions that can be combined with other domain functions to determine suspicious activity.

Here is the Background

Exactly what are bad actors doing with destructive domains? It could be simply using a similar spelling of a common domain to fool negligent users into looking at ads or getting adware. Legitimate websites are gradually catching onto this method, sometimes called typo-squatting.

Other destructive domains are the result of domain name generation algorithms, which could be used to do all types of nefarious things like avert countermeasures that obstruct recognized compromised sites, or overwhelm domain servers in a distributed DoS attack. Older versions use randomly generated strings, while more advanced ones add tricks like injecting typical words, additionally confusing protectors.

Edit distance can aid with both usage cases: here we will find out how. First, we’ll omit common domain names, because these are normally safe. And, a list of typical domains supplies a standard for spotting abnormalities. One excellent source is Quantcast. For this discussion, we will adhere to domain names and prevent sub-domains (e.g., not

After data cleaning, we compare each candidate domain name (input data observed in the wild by Ziften) to its prospective neighbors in the same top-level domain (the last part of a domain name –,. org, etc. but now can be almost anything). The basic job is to discover the closest neighbor in regards to edit distance. By discovering domains that are one step removed from their nearby next-door neighbor, we can quickly identify typo-ed domain names. By discovering domains far from their neighbor (the normalized edit distance we introduced in the initial post is useful here), we can also discover anomalous domain names in the edit distance area.

Exactly what were the Results?

Let’s take a look at how these results appear in reality. Use caution when navigating to these domain names considering that they might consist of malicious material!

Here are a few potential typos. Typo-squatters target well known domains because there are more possibilities somebody will check them out. Numerous of these are suspect in accordance with our hazard feed partners, but there are some false positives too with adorable names like “wikipedal”.


Here are some unusual looking domain names far from their next-door neighbors.


So now we have developed 2 beneficial edit distance metrics for searching. Not just that, we have three features to possibly add to a machine learning model: rank of closest neighbor, range from next-door neighbor, and edit distance 1 from neighbor, suggesting a danger of typo tricks. Other features that could be utilized well with these include other lexical features like word and n-gram distributions, entropy, and string length – and network features like the number of failed DNS requests.

Streamlined Code that you can Play Around with

Here is a streamlined version of the code to play with! Created on HP Vertica, but this SQL will probably run with most advanced databases. Note the Vertica editDistance function might differ in other executions (e.g. levenshtein in Postgres or UTL_MATCH. EDIT_DISTANCE in Oracle).




No Responses Yet to “Chuck Leaver – Why Edit Difference Is Important Part Two”

Leave a Reply