Lec4 Data Analysis
Lec4 Data Analysis
Data Analysis
Techniques
SIT Internal
Lecture 3 Review
Lecture 3 Review
Filtering &
Raw logs Correlation
Normalization
Lecture 3 Review
Correlation Patterns
Micro-Level Macro-Level
Lecture 4 Contents
Data analysis techniques
• Linux commands for log analysis
• Regular expressions
• Statistical analysis for data exploration
SIT Internal
Linux commands
for log analysis
• grep, awk
• Data filtering
• sed
• Parsing utility like awk
• Good at search and text replacements to format the log output
• sort
• Data summary
• head, tail
• …
SIT Internal
Operations on Data
• Target operations
10
SIT Internal
12
SIT Internal
• Linux/Unix tool
• http://www.gnu.org/software/gawk/manual/gawk.html
• We focus on “print” command of awk to help us piece together what the
malicious attacker has done
Use $n to reference a
specific field
• e.g., awk ‘{print $1}’
gives the client IP
addresses
SIT Internal
Regular Expressions
19
SIT Internal
Sample Sample
Quantifier Legend Example Quantifier Legend Example
Match Match
Version A- The + (one or more)
+ One or more Version \w-\w+ + \d+ 12345
b1_1 is "greedy"
Exactly three Makes quantifiers
{3} \D{3} ABC ? \d+? 1 in 12345
times "lazy"
Two to four The * (zero or
{2,4} \d{2,4} 156 * A* AAA
times more) is "greedy"
Three or more regex_tutori Makes quantifiers empty in
{3,} \w{3,} ? A*?
times al "lazy" AAA
Two to four times,
Zero or more {2,4} \w{2,4} abcd
* A*B*C* AAACC "greedy"
times
Makes quantifiers
? \w{2,4}? ab in abcd
? Once or none plurals? plural "lazy"
SIT Internal
Alternation / OR
| 22|33 33
operand
Apple (captures
( … ) Capturing group A(nt|pple)
"pple")
Contents of
\1 r(\w)g\1x regex
Group 1
Contents of (\d\d)\+(\d\d)=\2\
\2 12+65=65+12
Group 2 +\1
Non-capturing
(?: … ) A(?:nt|pple) Apple
group
SIT Internal
https://www.rexegg.com/regex-quickstart.html
https://regexr.com/
SIT Internal
• Copy and paste sample log messages onto https://regexr.com/, then try to
develop the regular expression to match IPv4 addresses.
Date flow start Duration Proto Src IP Addr:Port Dst IP Addr:Port Packets Bytes Flows
2007-02-24 04:54:54.917 42.682 UDP 84.77.114.176:57024 -> 10.16.54.6:19522 2 58 1
IP Address Validation
\d+\.\d+\.\d+\.\d+ or /([0-9]{1,3}\.){3}[0-9]{1,3}/g
• While it will catch IP addresses like 10.0.3.1, it will also catch an invalid
IP address like 300.500.27.900
• The regular expression for matching IP addresses should make sure each octet
is in the proper range.
^([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])\.
([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])$
• The expression will detect an IP address of 0.0.0.0, invalid in some network
types
• Some security systems report spoofed IP addresses as 0.0.0.0
28
SIT Internal
DATA EXPLORATION
SIT Internal
Statistical Techniques
• Techniques such as mean, median, standard deviations,
inter-quartile ranges, and distance formulas
SIT Internal
Central Tendency
Dispersion Exercise:
Calculate the standard
deviation (σ) of the
• Dispersion – Spread of the data dataset containing 3, 4,
• Range 4, 5, 6, 8.
• range = max – min
• difference between the maximum and minimum values
• Variance (σ2)
• measures how far each value in the dataset is from the mean
• defined as the sum of the squared distances of each term in
the distribution from the mean (μ), divided by the number of
terms in the distribution (N).
• Quartiles
• ¼ population according to some attribute
• First and third quartiles (the 25th and 75th percentiles, or
the median value of the first and last halves of the data)
SIT Internal
Graph to visualize
data distribution
Two-way Tables
Table of counts for reliability and Risk
Two-way Table Graphical Representation
SIT Internal
SIT Internal
Lecture 4 Summary
• Regular expressions
• Data exploration
• Quantitative variables: five number summary
• Qualitative variables: frequency table
• Charts and graphs