THE FOLLOWING IS A LOW-TO-MODERATE TECHNICAL OVERVIEW OF HOW CASM® (CONTINUOUS ATTACK SURFACE MANAGEMENT) PERFORMS ASSET DISCOVERY USING MACHINE LEARNING.
“You can’t manage what you don’t know”. This is a phrase that I heard many times, early on when I joined the team at SynerComm in 2016. It’s a phrase that succinctly captures the idea that one cannot secure systems unless they’re aware of them. It’s not a deep insight, but it is an important reminder that the cybersecurity journey really starts with an awareness of the systems that you’re responsible for.
Before we dive into that, my name is Bill Kiley. I am the software architect of CASM®, and in this series I will be sharing the developer perspective of a business’s attack surface and an under-the-hood look at how CASM® does attack surface monitoring.
Let’s bring that phrase into the attack surface monitoring space. If the attack surface is the public-facing, attackable systems owned or managed by a company, then “You can’t manage what you don’t know” suggests that we need to figure out which assets are owned and managed by a company. To do that, CASM® does what we call asset discovery. The core idea is that we want to automate the discovery of as many company assets as possible, because maintaining an accurate inventory is hard to do. It’s especially hard to do when it involves multiple data centers, multiple clouds, and your company acquired another company last quarter.
When we discover assets, we broadly categorize them as networks (IP address space), domain names, or web applications. For today, let’s take a narrower look at how CASM® discovers networks. Most businesses with offices, data centers, facilities, retail locations have some manner of static IP space allocated by their internet service provider (ISP). For business connections, that ISP typically registers a business name such as “EXAMPLE CO” to the IPs that it assigned to you. If you are in North America, registered information is managed by ARIN, the American Registry for Internet Numbers. If you are in Europe, then that registry is RIPE. Overall, there are 5 regional internet registries that manage the IP addresses across the globe and the registry information is known as WHOIS, a term which dates back to ARPANET.
Let’s pause and recap:
If your business pays for internet service, then there is a high chance that your business name is associated with public IP addresses, saved at a regional internet registry such as ARIN.
With that in mind, how can we automate CASM® to perform asset discovery with this information? Simple: we look up names. But wait – as anyone with a keyboard can attest: typos happen. As it turns out, typos do appear within WHOIS data. Not only typos, but also inconsistencies and legacy oddities tend to emerge for any business that has been around long enough.
If you expect to look up “SYNERCOMM”, then be prepared to occasionally see things like “SYNERCOM”, “SYNER COMM”, and “SYNERCOMM INC”. That’s a short, one-word company name, and so the variations tend to be small. With longer, more complicated names the variations can become stranger. Lastly, on the subject of name variants, bear in mind that companies acquire one another, companies change names, and over time this leads to some interesting outcomes that all need to be handled by our discovery platform.
With our simple goal in mind paired with the known oddities in the data, it’s time to talk implementation. There are many ways to search text. For CASM®, our trade-off analysis led us to select machine learning. This approach would allow us to quickly find new assets and have the ability to provide feedback when the results are good, to train our model to make ever-better suggestions. There’s a lot to talk about on this subject. For this article’s purposes, let’s focus on just one component of machine learning: tokenization.
The idea behind tokenization with textual data is to split a text string into smaller pieces for the purpose of comparison. In our case, comparison means searching a large number of company names for fuzzy matches. It’s what allows us to search using “SYNERCOMM” as our query and find “SYNER COMM INC” as a result, even though it is an imperfect match.
Tokenization has a lot of academic research behind it, and from that a lot of excellent ideas and projects have emerged. The core concept is that by splitting text such as “SYNER COMM INC” into smaller parts, known as tokens, we can then search a large data set for token matches in novel, efficient ways that produce more useful matches than traditional string searching.
As it turns out, searching via tokens can help CASM be better at matching company names and, consequently, better at discovering assets belonging to a company. To prove it, we explored the performance of various tokenizers. Which one will be the best for matching company names? Let’s test and find out.
In setting up our testing methodology, let’s first discuss some basic options, which include:
- Splitting text into 2-character tokens. Example:
- “SYNERCOMM” -> “SY”, “NE”, “RC”, “CO”, “MM”
- Splitting text into 3-character tokens. Example:
- “SYNERCOMM” -> “SYN”, “ERC”, “OMM”
- Splitting text by words by whitespace. Example:
- “SYNERCOMM INC” -> “SYNERCOMM”, “INC”
- Splitting text by non-word characters. Example:
- “SYNERCOMM L.L.C.” -> “SYNERCOMM”, “L”, “L”, “C”
As you might’ve guessed, there will be some cases where a token performs well with some company names and poorly with others. Additionally, more complex tokenizer options include natural language processing (NLP) steps such as stemming and stop word handling to normalize the text in an effort to produce higher quality tokens. Such steps are highly dependent upon the data set. Company names are an odd mix of pronouns, common words, corporate suffixes, portmanteaus, pronouns, and more. Simply put, company names are very different from long-form text documents such as blog posts and will require a tokenizer that is suited to the task.
Some results were expected: white space produces high quality tokens such as “SYNERCOMM” which, in turn, produce high quality matches for us. And yet, it suffers from company suffixes from legal entity types such as “INC” and “LLC”. Others, such as 2-character tokens produced low quality results in most cases, due to small tokens being very commonly found in dissimilar company names.
In the end, we found our favorite tokenizers. From that we were able to build out a powerful discovery engine that helps companies find their assets across the world, from corporate offices to warehouses, retail locations, cloud providers, and data centers. All of this thanks to machine learning.