0% found this document useful (0 votes)
45 views

Lec4 Data Analysis

The document discusses data analysis techniques for security analytics. It covers topics like data filtering, normalization, and correlation. It also discusses Linux commands like grep, awk and sed that can be used for log analysis and parsing logs. Finally, it provides an introduction to regular expressions and how they can be used for searching logs.

Uploaded by

Aqil Syahmi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Lec4 Data Analysis

The document discusses data analysis techniques for security analytics. It covers topics like data filtering, normalization, and correlation. It also discusses Linux commands like grep, awk and sed that can be used for log analysis and parsing logs. Finally, it provides an introduction to regular expressions and how they can be used for searching logs.

Uploaded by

Aqil Syahmi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

SIT Internal

Lecture ICT 3204 Security


4 Analytics

Data Analysis
Techniques
SIT Internal

Lecture 3 Review

• Events of security interests


SIT Internal

Lecture 3 Review

Filtering &
Raw logs Correlation
Normalization

• Data filtering (Extraction)


• Irrelevant data fields
• Duplicated data entries, could be from different sources
• Redundant data that is heavily dependent and can be derived from other
data, e.g., collinearity between data, DoB and Age
• Data normalization and reformatting (Transformation)
• Break down known log message into a normalized format, e.g., inconsistent
representation between data sources
• Reformatting e.g., .pcap (for Wireshark) to csv (for Splunk)
• Handling data discrepancy (Feature engineering)
• Noise, outliers
• Missing values
SIT Internal

Lecture 3 Review

Correlation Patterns

Micro-Level Macro-Level

Source IP Destination IP Time Anti-port Geographic Vulnerability


correlation correlation correlation correlation location correlation
correlation

Interleaving Port Watch list


address correlation correlation
correlation
SIT Internal

Lecture 4 Contents
Data analysis techniques
• Linux commands for log analysis
• Regular expressions
• Statistical analysis for data exploration
SIT Internal

Linux commands
for log analysis
• grep, awk
• Data filtering
• sed
• Parsing utility like awk
• Good at search and text replacements to format the log output
• sort
• Data summary
• head, tail
• …
SIT Internal

Operations on Data

• Target operations

• Data reformatting: Modifying the way we see it


• .pcap -> Splunk

• Data filtering: We want to only see specific stuff

• Data summarization: Seeing a condensed view


• E.g., count, uniq

10
SIT Internal

Linux Command - grep

• Linux/Unix utility, also ported to Cygwin on Windows


• Search input files based on a pattern or regular expressions
• Human readable text files
• User need to know the search term or what they are looking for
• http://www.gnu.org/software/grep/manual/grep.html
SIT Internal

Using grep for Log Analysis

• See all messages except those containing ssh or telnet


# grep –v ‘ssh|telnet’ /var/log/messages

• See all messages matching the patterns from a file “patterns”


# grep –f patterns /var/log/messages

• Look for records with the string “Failed” or “failed”


# tail –1000 /var/log/messages | grep ailed

12
SIT Internal

Using grep for Log Analysis

Someone at address 192.168.0.6 is doing something


malicious by incrementing a customer account number by
one and trying to guess the valid accounts we have in the
system!
SIT Internal

Linux Command - awk

• Linux/Unix tool
• http://www.gnu.org/software/gawk/manual/gawk.html
• We focus on “print” command of awk to help us piece together what the
malicious attacker has done

• View what devices and systems has logged to our file


# cat messages | awk ‘{print $4}’ | sort -u
SIT Internal

Using awk for Log Analysis

Use $n to reference a
specific field
• e.g., awk ‘{print $1}’
gives the client IP
addresses
SIT Internal

Combined Usage of grep and awk

• Show the URLs that were accessed by the attacker at 192.168.0.6,


what pages returned an error, with status code 403, and what
pages were accessed successfully, with status code 200

The attacker at 192.168.0.6 was able to brute force guess the


account number 111111114 and changed the password on
this client account.
SIT Internal

Regular Expressions

• Information on regular expression


• http://www.regular-expressions.info/quickstart.html

• Grep can be used with regex


• https://www.digitalocean.com/community/tutorials/using-grep-regular-
expressions-to-search-for-text-patterns-in-linux

• Splunk can be used with regex

19
SIT Internal

Regex - Characters https://www.rexegg.com/regex-quickstart.html


Sample Sample
Character Legend Example Character Legend Example
Match Match
\d Most engines: one digit from 0 to 9 file_\d\d file_25 Any character
\d .NET, Python 3: one Unicode digit in any script file_\d\d file_9੩ . a.c abc
except line break
Most engines: "word character": ASCII letter,
\w \w-\w\w\w A-b_1
digit or underscore Any character whatever,
. .*
.Python 3: "word character": Unicode letter, except line break man.
\w \w-\w\w\w 字-ま_۳
ideogram, digit, or underscore
A period (special
.NET: "word character": Unicode letter,
\w \w-\w\w\w 字-ま‿۳ character: needs
ideogram, digit, or connector \. a\.c a.c
to be escaped by a
Most engines: "whitespace character": space, ab
\s a\sb\sc \)
tab, newline, carriage return, vertical tab c
.NET, Python 3, JavaScript: "whitespace ab Escapes a special \.\*\+\?
\s a\sb\sc \ .*+? $^/\
character": any Unicode separator c character \$\^\/\\
One character that is not a digit as defined by Escapes a special \[\{\(\)\}\
\D \D\D\D ABC \ [{()}]
your engine's \d character ]
One character that is not a word character as \W\W\W\W\
\W *-+=)
defined by your engine's \w W
One character that is not a whitespace
\S \S\S\S\S Yoyo
character as defined by your engine's \s
SIT Internal

Regex - Quantifiers https://www.rexegg.com/regex-quickstart.html

Sample Sample
Quantifier Legend Example Quantifier Legend Example
Match Match
Version A- The + (one or more)
+ One or more Version \w-\w+ + \d+ 12345
b1_1 is "greedy"
Exactly three Makes quantifiers
{3} \D{3} ABC ? \d+? 1 in 12345
times "lazy"
Two to four The * (zero or
{2,4} \d{2,4} 156 * A* AAA
times more) is "greedy"
Three or more regex_tutori Makes quantifiers empty in
{3,} \w{3,} ? A*?
times al "lazy" AAA
Two to four times,
Zero or more {2,4} \w{2,4} abcd
* A*B*C* AAACC "greedy"
times
Makes quantifiers
? \w{2,4}? ab in abcd
? Once or none plurals? plural "lazy"
SIT Internal

Regex - Character class https://www.rexegg.com/regex-quickstart.html

Character Legend Example Sample Match


[…] One of the characters in the brackets [AEIOU] One uppercase vowel
[…] One of the characters in the brackets T[ao]p Tap or Top
- Range indicator [a-z] One lowercase letter
[x-y] One of the characters in the range from x to y [A-Z]+ GREAT
One of either:
[…] One of the characters in the brackets [AB1-5w-z]
A,B,1,2,3,4,5,w,x,y,z
Characters in the printable
[x-y] One of the characters in the range from x to y [ -~]+
section of the ASCII table.
[^x] One character that is not x [^a-z]{3} A1!
Characters that are not in the
[^x-y] One of the characters not in the range from x to y [^ -~]+ printable section of the ASCII
table.
Any characters, inc-
[\d\D] One character that is a digit or a non-digit [\d\D]+ luding new lines, which the
regular dot doesn't match
Matches the character at hexadecimal position 41
[\x41] [\x41-\x45]{3} ABE
in the ASCII table, i.e. A
SIT Internal

Regex - logic https://www.rexegg.com/regex-quickstart.html

Logic Legend Example Sample Match

Alternation / OR
| 22|33 33
operand

Apple (captures
( … ) Capturing group A(nt|pple)
"pple")

Contents of
\1 r(\w)g\1x regex
Group 1

Contents of (\d\d)\+(\d\d)=\2\
\2 12+65=65+12
Group 2 +\1

Non-capturing
(?: … ) A(?:nt|pple) Apple
group
SIT Internal

https://www.rexegg.com/regex-quickstart.html

Regex - Anchors and Boundaries


Anchor Legend Example Sample Match
Start of string or start of linedepending on multiline
^ ^abc .* abc (line start)
mode. (But when [^inside brackets], it means "not")
End of string or end of linedepending on multiline mode.
$ .*? the end$ this is the end
Many engine-dependent subtleties.
Beginning of string abc (string...
\A \Aabc[\d\D]*
(all major engines except JS) ...start)
Very end of the string
\z the end\z this is...\n...the end
Not available in Python and JS
End of string or (except Python) before final line break
\Z the end\Z this is...\n...the end\n
Not available in JS
Beginning of String or End of Previous Match
\G
.NET, Java, PCRE (C, PHP, R…), Perl, Ruby
Word boundary
\b Most engines: position where one side only is an ASCII Bob.*\bcat\b Bob ate the cat
letter, digit or underscore
Word boundary
\b .NET, Java, Python 3, Ruby: position where one side only Bob.*\b\кошка\b Bob ate the кошка
is a Unicode letter, digit or underscore
\B Not a word boundary c.*\Bcat\B.* copycats
SIT Internal

Regular Expression Online Engine


Splunk uses Perl Compatible
Regular Expressions (PCRE)

https://regexr.com/
SIT Internal

Regular Expression Exercise

• Copy and paste sample log messages onto https://regexr.com/, then try to
develop the regular expression to match IPv4 addresses.

Date flow start Duration Proto Src IP Addr:Port Dst IP Addr:Port Packets Bytes Flows
2007-02-24 04:54:54.917 42.682 UDP 84.77.114.176:57024 -> 10.16.54.6:19522 2 58 1

2007-02-24 04:55:06.552 15.202 UDP 84.77.114.176:57024 -> 10.16.54.6:18278 2 58 1

2007-02-24 04:54:54.806 13.998 UDP 84.77.114.176:57024 -> 10.16.54.6:31991 2 58 1

2007-02-24 04:54:52.434 96.322 UDP 89.106.22.3:54606 -> 10.16.54.6:38662 166 4814 1

2007-02-24 04:55:03.714 72.352 UDP 84.77.114.176:57024 -> 10.16.54.6:34016 2 58 1

2007-02-24 04:54:34.830 91.019 UDP 213.144.110.130:3656 -> 10.16.54.6:4027 160 4640 1

2007-02-24 04:54:54.941 80.638 UDP 84.77.114.176:57024 -> 10.16.54.6:34197 2 58 1


SIT Internal

IP Address Validation

\d+\.\d+\.\d+\.\d+ or /([0-9]{1,3}\.){3}[0-9]{1,3}/g
• While it will catch IP addresses like 10.0.3.1, it will also catch an invalid
IP address like 300.500.27.900

• The regular expression for matching IP addresses should make sure each octet
is in the proper range.
^([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])\.
([01]?\d\d?|2[0-4]\d|25[0-5])\. ([01]?\d\d?|2[0-4]\d|25[0-5])$
• The expression will detect an IP address of 0.0.0.0, invalid in some network
types
• Some security systems report spoofed IP addresses as 0.0.0.0

28
SIT Internal

• on Lecture-3 contents • Open www.classpoint.app on


your web browser
• 3 MCQs • Key in the Class code that
appears in the top right-hand
corner of the presentation
• Type in your student ID and join
SIT Internal

DATA EXPLORATION
SIT Internal

Exploratory Data Analysis

○ uncovering interesting trends, outliers, and patterns in the data


○ identifying areas of interest, understanding the context of data
Exploratory Data Analysis (EDA)
SIT Internal

• Process to understand data


• Learn about variables
• Significance of variables
• Entities involved
• Relationships between variables
• Relation with other datasets

• Without understanding the data


• Cannot assess usefulness
• Cannot refine it over time
• Cannot visualize suitably
• Cannot think algorithmically
• Cannot comprehend the capabilities
SIT Internal

Statistical Techniques
• Techniques such as mean, median, standard deviations,
inter-quartile ranges, and distance formulas
SIT Internal

Analysis on a Single Variable


• Univariate Analysis
• Analyzing a single variable/attribute
• Purpose is to describe the quantitative data
• Does not deal with relationships

• Describing patterns using


• Central Tendency – Concentration of the data
• Mean, Mode and Median

• Dispersion – Spread of the data


• Range
• Variance
• Quartiles
SIT Internal

Central Tendency

• Central Tendency – Concentration of the data


• Mean, Mode and Median
• Mean - sum of all values divided by the number of count
• Mode – value that occurs most frequently
• Median - the value at the middle of the data set
SIT Internal

Dispersion Exercise:
Calculate the standard
deviation (σ) of the
• Dispersion – Spread of the data dataset containing 3, 4,
• Range 4, 5, 6, 8.
• range = max – min
• difference between the maximum and minimum values
• Variance (σ2)
• measures how far each value in the dataset is from the mean
• defined as the sum of the squared distances of each term in
the distribution from the mean (μ), divided by the number of
terms in the distribution (N).

• Quartiles
• ¼ population according to some attribute
• First and third quartiles (the 25th and 75th percentiles, or
the median value of the first and last halves of the data)
SIT Internal

Analysis on Multiple Variables


• Multivariate analysis (MVA)
• Analysing one or more attributes
• Quantitative measures

• Relationship between two attributes


• How attribute 1 affects attribute 2
• Interesting patterns

• Relationship between three attributes


• How attributes affect each other
• Interesting trends
SIT Internal

Analysis on Multiple Variables


• Regression Analysis
• Predicting the outcome of an attribute
from another attribute
• Predictive Modelling

• Principal Component Analysis


• Identify dominant patterns in data
• Detect Outliers using a box-plot
SIT Internal

Five Number Summary for


quantitative variables
• Use five numbers to summarize on the range and distribution of a
quantitative variable
• Minimum and maximum values;
• taking the difference of these will give you the range (range = max -
min)
• Median
• the value at the middle of the data set
• First and third quartiles
• 25th and 75th percentiles
• Mean
• or called average
• Provides an exploratory step to look at descriptive statistics of
quantitative variables
SIT Internal

• Even though we represent Reliability and Risk as numbers, they


are ordinal variables
• meaning each entry is assigned an integer, and a value of 4 is not
necessarily twice the Reliability or Risk of 2.
• It only means that Reliability or Risk that is scored 4 is higher than that
scored 2.
SIT Internal

Frequency for qualitative variables

• Display the count for each category of a qualitative variable

table(av$Reliability) # summary sorts by the counts by default


## 1 2 3 4 5 6 7 8 9 # maxsum sets how many factors to display
## 5612 149117 10892 87040 7 4758 297 21 summary(av$Type, maxsum=10)
686 ## Scanning Host Malware Domain
## 10 ## 234180 9274
## 196 ## Malware IP Malicious Host
## 6470 3770
table(av$Risk) ## Spamming C&C
## 1 2 3 4 5 6 7 ## 3487 610
## 39 213852 33719 9588 1328 90 10 ## Scanning Host;Malicious Host Malware
Domain;Malware IP
## 215 173
## Malicious Host;Scanning Host (Other)
## 163 284
SIT Internal

Graph to visualize
data distribution

Bar charts giving a visual overview of


Country, Risk and Reliability factors
respectively
SIT Internal

Two-way Tables
Table of counts for reliability and Risk
Two-way Table Graphical Representation
SIT Internal
SIT Internal

Lecture 4 Summary

• Linux commands for log analysis


• grep, awk ‘{print $n}’
• sed
• sort, uniq, count

• Regular expressions

• Data exploration
• Quantitative variables: five number summary
• Qualitative variables: frequency table
• Charts and graphs

You might also like