MIDS 701 Final Project

Final Project- Machine Malware Prediction

Overview

Malware or malicious software is any form of software that is intentionally designed to cause damage to a computer, server, or a computer network. Enterprises like Microsoft take this to be a very serious problem. With over a billion enterprise and consumer customers, they want to develop data-science based techniques to predict if a machine will soon be hit with malware. Microsoft’s next-generation detection solutions, like Windows Defender Antivirus uses data science, machine learning, automation, and behavioral analysis to provide ‘real-time protection’ to customers against threats. However, early detection of the possibility of a malware-attack will strengthen Microsoft’s security. The goal of this project is to use data on different properties of a Windows machine to predict the probability of it getting infected by any kind of malware.

Research questions

Although Microsoft has increasingly used machine-learning algorithms along with AI to predict and prevent malware attacks within milliseconds, they are constantly working to improve the efficiency of these algorithms. They want to “to stop malware before it is even seen”. The main objective questions being answered in this project are: * What system traits/properties of a machine make it the most vulnerable to a malware attack? * What parameter, if controlled for can provide a Windows machine the maximum security against malware? * Are there geographic factors that make a machine more vulnerable to malware attacks? * Do hardware traits and software traits reflect on a machines’ vulnerability to a malware attack differently?

Data

The data is being sourced from Kaggle where the data had been put up by Microsoft as a part of a sponsored competition. The telemetry data containing data on the machines’ properties and the machine infections was generated by combining heartbeat and threat reports collected by Microsoft’s endpoint protection solution, Windows Defender. This data contains anonymized data from 16.8M devices and has over 8,000,000 records of data. Some of the relevant variables in the dataset are:

MachineIdentifier - Individual machine ID
IsBeta - Defender state information e.g. false
HasTpm - True if machine has tpm
CountryIdentifier - ID for the country the machine is located in
CityIdentifier - ID for the city the machine is located in
GeoNameIdentifier - ID for the geographic region a machine is located in
Platform - Calculates platform name (of OS related properties and processor property)
Processor - This is the process architecture of the installed operating system
OsVer - Version of the current operating system
OsBuild - Build of the current operating system
OsSuite - Product suite mask for the current operating system.
OsPlatformSubRelease - Returns the OS Platform sub-release
Census_PrimaryDiskTotalCapacity - Amount of disk space on primary disk of the machine in MB
Census_PrimaryDiskTypeName - Friendly name of Primary Disk Type - HDD or SSD

The complete data dictionary can be found in the data dictionary file.

Project plan

This looks like a binary classification problem. It might be a good decision to use decision trees owing to the high number of variables present in the data and compare its performance to basic logistic regression. An ensemble model can be incorporated to learn and borrow information from various models. Data cleaning, imputation and EDA will be a major chunk of the project and will take 1.5 weeks (10 days). The model building and validation stage will then another 7 days. Validation and final results should take another 3 days to close the project.

Inspiration: Kaggle Competition

Final Model

It was a bagging model: mod6 <- randomForest(HasDetections ~ ., data = train_input,mtry=4)

This was a classification model with mtry=4 which means:

Number of trees: 500
No. of variables tried at each split: 4

The model results were as follows:

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Final Report		Final Report
Presentation		Presentation
data		data
notebooks		notebooks
proposal		proposal
Data_Dictionary.txt		Data_Dictionary.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MIDS 701 Final Project

Overview

Research questions

Data

Project plan

Final Model

About

Releases

Packages

Languages

srishtis/MIDS_701_Final_Project

Folders and files

Latest commit

History

Repository files navigation

MIDS 701 Final Project

Overview

Research questions

Data

Project plan

Final Model

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages