Regular expressions or regex is supported in all scripting languages as well as general purpose programming languages. Regular expressions are critical to cyber security experts because it is used for searching and you can save a lot of time. Think of regular expression as advanced form of control+f. Not only can you can use Regex to find stuff, but you can add, remove, isolate, and manipulate all kinds of text and data. It’s useful for numerous things and has multiple applications like web crawlers, data scraping, data wrangling, and machine learning to name a few. With regular expressions, you can use match function to search for a pattern with a given string, you can use search function to look for first occurrence of a given pattern, you can use findall function to search for a string and find all possible matches, you can use substitute function to replace whatever you find and so on.
Regular expression is commonly used in cyber security because analysts deal with logs and other large data files. You must know how to create a script that searches for common patterns and locate certain information. For example, looking for lists of email addresses that ends with @suspicious.com or IP addresses that has specific numbers in the center. You can write a regex script in almost anything like python. It allows you to extract information from servers or web pages without the need for someone to go there directly and pull the information manually.
Don’t worry too much about trying to remembering all the syntax since you’ll have the internet as a cheat sheet. The syntax is used to make your search function more clear. For example, ^ character matching beginning of a line or $ character matching end of the line or + sign matching characters before and finding characters related to that and if it repeats that more than once.
Ultimately, it’ll help you automate tasks like pulling information you are looking for. This is critical because data is always changing. For example, you can write a script that runs every morning that will look for new images by finding anything that ends with .jpg for example. If there are new images that got created or maybe there are images that got modified and you know this by finding data size changes, you can pull it for further analysis.
Let’s do a very simple review. Here’s a skeletal structure of a Regex string.
re.findall(<pattern>, string)
Using this string, let’s say I want to find how many times a movie theatre will be playing Lion King on a particular day (let’s say tuesday) and we can pull this from a text called “tuesdaymovielist”.
tuesdaymovielist=”lion king at 1030, lilo and stitch at 1230, lord of the rings at 1430, lion king again at 1630″
The findall returns all non-overlapping matches of pattern in string, as a list of strings and will find and return the data in respective order. This is the function and within it we have the pattern we want to look for followed by a comma and a string. String is the location we want to look for.
The below string will return ‘lion king’ twice if we add print function before all of that string. This is how we know they’re playing lion king twice on tuesday. You can make this string more complicated by adding more syntax to be specific to time and can even get more fancy by pulling all movies that has the letter “a” in it and such.
re.findall('lion king', tuesdaymovielist)