Data Poison Detection in DML Systems
Data Poison Detection in DML Systems
On
Submitted to
Submitted by
[Link] 20UC1A0507
[Link] 20UC1A0512
[Link] Pranith 20UC1A0516
[Link] 20UC1A0526
BONAFIDE CERTIFICATE
This is to certify that this Project Report titled “DATA POISON DECTION SCHEMES FOR
DISTRIBUTED MACHINE LEARNING “is a bonafide work carried out by [Link]
(20UC1A0507),[Link](20UC1A0512),[Link](20UC1A0516)[Link](20U
C1A0526) under our supervision in the partial fulfilment of the Award of Bachelor of
Technology in Computer Science and Engineering from Talla Padmavathi College of
Engineering, Warangal, affiliated to Jawaharlal Nehru Technological University, Hyderabad,
Telangana.
This project work does not constitute in part or full of any other works that have been earlier
submitted to this university or any other institutions for the award of any degree/ diploma.
2
DECLARATION
We,[Link](20UC1A0507),[Link](20UC1A0512),[Link](20UC1A0516)and
[Link](20UC1A0526) final year students from Talla Padmavathi College of Engineering,
Warangal, affilitated to Jawaharlal Nehru Technological University, Hyderabad, Telangana,
solomnly declare that this Project titled “DATA POISON DECTION SCHEMES FOR
DISTRIBUTED MACHINE LEARNING” is a bonafide work carried out by us under the
supervision of Mrs. J. Shilpa, Assistant Professor, HOD of CSE Department ,Computer Science
and Engineering for the Award of Bachelor of Technology in Computer Science and
Engineering.
We also declare that this project work does not constitute in part or full of any other works that
have been earlier submitted to this university or any other institutions for the award of any
degree/ diploma.
Signature name Roll No
1 [Link] 20UC1A0507
2 [Link] 20UC1A0512
4 [Link] 20UC1A0526
2
ACKNOWLEDGEMENTS
We are grateful to our Chairman, Mr. Talla Mallesham, for providing us ambient learning
experience at our institution.
We are greatly thankful to our Director, Dr. Talla Vamshi, and Directrix Mrs. ChaitanyaTalla
Vamshi, for their encouragement and valuable academic support in all aspects.
We are thankful to our Principal, Dr. R. Velu, for his patronage towards our project and standing
as a support in the need of the hour.
We would like to acknowledge and express our sincere thanks to our guide [Link],
Assistant Professor, HOD of CSE Department, Department of Computer Science and
Engineering for introducing the present topic and for the inspiring guidance, constructive
criticism and valuable suggestions through-out our project work which have helped us in
bringing out this proficient project.
We also thank all the faculty members of our institution for their kind and sustained support
throughout our programme of study.
We thank our parents for their confidence that they have on us to be potential and useful
technological graduates to serve the society at large.
[Link] 20UC1A0507
[Link] 20UC1A0512
[Link] 20UC1A0526
[Link] Pranith 20UC1A0516
2
ABSTRACT
Distributed Machine Learning (DML) can realize massive dataset training when no single node
can work out the accurate results within an acceptable time. However, this will inevitably expose
more potential targets to attackers compared with the non-distributed environment. In this
project, we classify DML into basic-DML and semi-DML. In basic-DML, the center server
dispatches learning tasks to distributed machines and aggregates their learning results. While in
semi-DML, the center server further devotes resources into dataset learning in addition to its
duty in basic-DML. We firstly put forward a novel data poison detection scheme for basic-DML,
which utilizes a cross-learning mechanism to find out the poisoned data. We prove that the
proposed cross-learning mechanism would generate training loops, based on which a
mathematical model is established to find the optimal number of training loops. Then, for semi-
DML, we present an improved data poison detection scheme to provide better learning protection
with the aid of the central resource. To efficiently utilize the system resources, an optimal
resource allocation approach is developed. Simulation results show that the proposed scheme can
significantly improve the accuracy of the final model by up to 20% for support vector machine
and 60% for logistic regression in the basic-DML scenario. Moreover, in the semi DML scenario,
the improved data poison detection scheme with optimal resource allocation can decrease the
wasted resources for 20-100%.
2
INDEX
CONTENTS PAGE NO
1. Chapter-1
1.1 Introduction 1
2. Chapter-2
Literature Survey 3
3. Chapter-3
3.1 System Analysis 6
3.1.1 Existing System 6
3.1.2 Proposed System 6
3.1.3 Algorithm 7
3.2 System Requirements
3.3 System feasibility
4. Chapter-4
4.1 System Design
4.1.1 System Architecture
4.1.2 UML Diagrams
4.1.3 Data Flow diagram
5. Chapter-5
5.1 Software environment
5.1.1 Python Technology
5.1.2 Python library
5.1.3 Machine Learning overview
6. Chapter-6
Implementing and analysis
6.1System Implementation
6.2Modules
6.3Sample Code
2
6.4OUTPUT SCREENSHOTS
7. CHAPTER-7
7.1TESTING
7.2Types of Testing
8. CONCLUSION
9. FUTURE SCOPE
BIBLOGRAPHY
CHAPTER 1
2
1.1 INTRODUCTION
Distributed Machine Learning (DML) has been widely used in distributed systems, where no
single node can get the intelligent decision from a massive dataset within an acceptable time. In a
typical DML system, a central server has a tremendous amount of data at its disposal. It divides
the dataset into different parts and disseminates them to distributed workers who perform the
training tasks and return their results to the center. Finally, the center integrates these results and
outputs the eventual model. Unfortunately, with the number of distributed workers increasing, it
is hard to guarantee the security of each worker. This lack of security will increase the danger
that attackers poison the dataset and manipulate the training result. Poisoning attack is a typical
way to tamper the training data in machine learning. Especially in scenarios that newly generated
datasets should be periodically sent to the distributed workers for updating the decision model,
the attacker will have more chances to poison the datasets, leading to a more severe threat in
DML. Such vulnerability in machine learning has attracted much attention from researchers.
Dalvi et al initially demonstrated that attackers could manipulate the data to defeat the data miner
if they have complete information. Then Low d claimed that the perfect information assumption
is unrealistic, and proved the attackers can construct attacks with part of the information.
Afterwards, a series of works were conducted. Focusing on non-distributed machine learning
context. Recently, there are a couple of efforts devoted in preventing data from being
manipulated in DML. For example, Zhang and Esposito et al used game theory to design a
secure algorithm for distributed support vector machine (DSVM) and collaborative deep
learning, respectively. However, these schemes are designed for specific DML algorithm and
cannot be used in general DML situations. Since the adversarial attack can mislead various
machine learning algorithms, a widely applicable DML protection mechanism is urgent to be
studied. In this project, we classify DML into basic distributed machine learning (basic-DML)
and semi distributed machine learning (semi-DML), depending on whether the centre shares
resources in the dataset training tasks. Then, we present data poison detection schemes for basic-
DML and semi DML respectively. The experimental results validate the effect of our proposed
schemes. We summary the main contributions of this project as follows
2
• We put forward a data poison detection scheme for basic-DML, based on a so-called cross
learning data assignment mechanism. We prove that the cross-learning mechanism would
consequently generate training loops, and provide a mathematical model to find the optimal
number of training loops which has the highest security.
• We present a practical method to identify abnormal training results, which can be used to find
out the poisoned datasets at a reasonable cost.
• For semi-DML, we propose an improved data poison detection scheme, which can provide
better learning protection. To efficiently utilize the system resources, an optimal resource
allocation scheme is develop.
2
CHAPTER 2
LITERATURE SURVEY
1. “Collaborative task offloading in vehicular edge multi-access networks
Mobile Edge Computing (MEC) has emerged as a promising paradigm to realize user
requirements with low-latency applications. The deep integration of multi-access technologies
and MEC can significantly enhance the access capacity between heterogeneous devices and
MEC platforms. However, the traditional MEC network architecture cannot be directly applied
to the Internet of Vehicles (IoV) due to high-speed mobility and inherent characteristics.
Furthermore, given a large number of resource-rich vehicles on the road, it is a new opportunity
to execute task offloading and data processing onto smart vehicles. To facilitate good merging of
the MEC technology in IoV, this article first introduces a vehicular edge multi-access network
that treats vehicles as edge computation resources to construct the cooperative and distributed
computing architecture. For immersive applications, co-located vehicles have the inherent
properties of collecting considerable identical and similar computation tasks. We propose a
collaborative task offloading and output transmission mechanism to guarantee low latency as
well as the application- level performance. Finally, we take 3D reconstruction as an exemplary
scenario to provide insights on the design of the network framework. Numerical results
demonstrate that the proposed scheme is able to reduce the perception reaction time while
ensuring the application-level driving experiences.
The Internet of Things (IoT) platform has played a significant role in improving road transport
safety and efficiency by ubiquitously connecting intelligent vehicles through wireless
communications. Such an IoT paradigm however, brings in considerable strain on limited
spectrum resources due to the need of continuous communication and monitoring. Cognitive
radio (CR) is a potential approach to alleviate the spectrum scarcity problem through
opportunistic exploitation of the underutilized spectrum. However, highly dynamic topology and
time- varying spectrum states in CR-based vehicular networks introduce quite a few challenges
to be addressed. Moreover, a variety of vehicular communication modes, such as vehicle-to-
2
infrastructure and vehicle to-vehicle, as well as data QOS requirements pose critical issues on
efficient transmission scheduling. Based on this motivation, in this paper, we adopt 4 a deep Q -
learning approach for designing an optimal data transmission scheduling scheme in cognitive
vehicular networks to minimize transmission costs while also fully utilizing various
communication modes and resources. Furthermore, we investigate the characteristic modes and
spectrum resources chosen by vehicles indifferent network states, and propose an efficient
learning algorithm for obtaining the optimal scheduling strategies. Numerical results are
presented to illustrate the performance of the proposed scheduling schemes
3. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems
MX Net is a Multilanguage machine learning (ML) library to ease the development of ML
algorithms, especially for deep neural networks. Embedded in the host language, it blends
declarative symbolic expression with imperative tensor computation. It offers auto differentiation
to derive gradients. MX Net is computation and memory efficient and runs on various
heterogeneous systems, ranging from mobile devices to distributed GPU clusters. This paper
describes both the API design and the system implementation of MX Net, and explains how
embedding of both symbolic expression and tensor operation is handled in a unified fashion. Our
preliminary experiments reveal promising results on large scale deep neural network applications
using multiple GPU machines.
Machine learning (ML) is continuously unleashing its power in a wide range of applications. It
has been pushed to the forefront in recent years partly owing to the advert of big data. ML
algorithms have never been better promised while challenged by big data. Big data enables ML
algorithms to uncover more fine- grained patterns and make more timely and accurate
predictions than ever before; on the other hand, it presents major challenges to ML such as model
scalability and distributed computing. In this paper, we introduce a framework of ML on big data
(MLBID) to guide the discussion of its opportunities and challenges. The framework is centered
on ML which follows the phases of preprocessing, learning, and evaluation. In addition, the
framework is also comprised of four other components, namely big data, user, domain, and
system. The phase of ML and the components of MLBID provide directions for the identification
2
of associated opportunities and challenges and open up future work in many unexplored renders
or under explored research areas.
5. J. Chen et al. described Deep Poisons an innovative hostile network with one generator and
two distinctions to solve this problem. In particular, the generator automatically extracts hidden
features of the target class and embeds them in harmless training models. A discriminator
controls the rate of addiction harassment. Another discriminator acts as a target model to
demonstrate the effects of the drug. The novelty of Deep Poisons that the toxic training models
developed cannot be distinguished from harm lessons by defensive methods or human visual
inspection, and even harmless test models can be attacked C. Li et al. described, Machine Learning
(ML) is widely used to detect malware on various platforms, including Android. Detection
models must be retested following the data collected (e.g., monthly) to continue the evolution of
malware. However, it can also lead to toxic attacks, especially backdoor attacks, which disrupt
the learning process and create evasion tunnels for manipulated malware models. No previous
research has examined this critical issue with Android Malware Detector.
J. Chen et al. described, Advanced attackers may be vulnerable to data poisoning attacks and
may interfere with the learning process by inserting some malicious samples into the training
database. Existing defences against drug attacks are primarily target-specific attacks. Designed
for a specific type of attack. However, due to the explicit principles of the Master, it does not
work for other types. However, some common safety strategies have developed.
2
CHAPTER 3
3.1 SYSTEM ANALYSIS
2
3.1.3 ALGORITHMS
Random Forest
Random Forest is a popular ensemble learning algorithm used for both classification and
regression tasks. It operates by constructing a multitude of decision trees during training and
outputs the class that is the mode of the classes (classification) or mean prediction (regression) of
the individual trees. Here's a step-by-step overview of how the Random Forest algorithm works:
Randomly select a subset of the training data with replacement. This means that some samples
may be repeated in the subset, while others may be left out.
For each subset of data, build a decision tree. However, when constructing each tree, at each
split, only a random subset of features is considered.
The Isolation Forest algorithm is an unsupervised machine learning used for anomaly detection.
It was introduced to efficiently detect outliers or anomalies in a dataset. The key idea behind
Isolation Forest is that anomalies are rare and are few and far from the normal instances. The
algorithm isolates anomalies by constructing isolation trees. Here's a high-level overview of how
the Isolation Forest algorithm works:
Isolation Trees:
The algorithm builds isolation trees, which are binary trees, recursively partitioning the data into
subsets.
2
Path Length:
Anomalies are expected to be isolated in shorter paths in the tree. Therefore, the average path
length from the root to an anomaly in the tree is shorter than the average path length for a normal
instance.
Scoring:
Anomaly scores are assigned based on the average path length. Shorter path lengths indicate
higher anomaly scores.
2
3.2 SOFTWARE REQUIREMENTS SPECIFICATION
Hardware Requirements:
Software Requirements:
The feasibility of the project is analyzed in this phase and business proposal is put forth with a
very general plan for the project and some cost estimates. During system analysis the feasibility
study of the proposed system is to be carried out. This is to ensure that the proposed system is
not a burden to the company. For feasibility analysis, some understanding of the major
requirements for the system is essential. Three key considerations involved in the feasibility
analysis are
ECONOMICAL FEASIBILITY
TECHNICAL FEASIBILITY
SOCIAL FEASIBILITY
ECONOMICAL FEASIBILITY
This study is carried out to check the economic impact that the system will have on the
organization. The amount of fund that the company can pour into the research and development
of the system is limited. The expenditures must be justified. Thus, the developed system as well
within the budget and this was achieved because most of the technologies used are freely
available. Only the customized products had to be purchased.
2
TECHNICAL FEASIBILITY
This study is carried out to check the technical feasibility, that is, the technical requirements of
the system. Any system developed must not have a high demand on the available technical
resources. This will lead to high demands on the available technical resources. This will lead to
high demands being placed on the client. The developed system must have a modest
requirement, as only minimal or null changes are required for implementing this system.
SOCIAL FEASIBILITY
The aspect of study is to check the level of acceptance of the system by the user. This includes
the process of training the user to use the system efficiently. The user must not feel threatened by
the system, instead must accept it as a necessity. The level of acceptance by the users solely
depends on the methods that are employed to educate the user about the system and to make him
familiar with it. His level of confidence must be raised so that he is also able to make some
constructive criticism, which is welcomed, as he is the final user of the system.
2
CHAPTER 4
We classify DML into Basic-DML and Semi-DML, which are shown in above figure. Both of the
two scenarios have a center, which contains a database, a computing server, and a parameter
server. However, the center provides different functions in these two scenarios. In the basic-DML
scenario, the center has no spare computing resource for sub-dataset training, and will send all
the sub-datasets to the distributed workers. Therefore, in the basic- DML, the center only
integrates the training results from distributed workers by the parameter server. In the semi-DML
scenario, the center has some spare resources in the computing server for sub datasets learning.
Consequently, it will keep some sub-datasets and learn from them by itself. That is to say, in the
semi-DML, the center will learn from some sub-datasets as well as integrate the results from
both of the center and distributed workers.
2
4.1.2 UML DIAGRAMS
UML stands for Unified Modeling Language. UML is a standardized general-purpose modeling
language in the field of object-oriented software engineering. The standard is managed, and was
created by, the Object Management Group.
The goal is for UML to become a common language for creating models of object-oriented
computer software. In its current form UML is comprised of two major components: a Meta-
model and a notation.
In the future, some form of method or process may also be added to; or associated with, UML.
The Unified Modeling Language is a standard language for specifying, Visualization,
Constructing and documenting the artifacts of software system, as well as for business modeling
and other non-software systems.
The UML represents a collection of best engineering practices that have proven successful in the
modeling of large and complex systems. The UML is a very important part of developing
objects-oriented software and the software development process. The UML uses mostly
graphical notations to express the design of software projects.
2
4.1. A USECASE DIAGRAM
A Use Case Diagram in the Unified Modeling Language (UML) is a type of Behavioral diagram
defined by and created from a Use-case analysis. Its purpose is to present a graphical overview
of the functionality provided by a system in terms of actors, their goals (represented as use
cases), and any dependencies between those use cases. The main purpose of a use case diagram
is to show what system functions are performed for which actor. Roles of the actors in the system
can be depicted.
2
4.1.B CLASS DIAGRAM
The Class Diagram is used to refine the use case diagram and define a detailed design of the
system. The class diagram classifies the actors defined in the use case diagram into a set of
interrelated classes. The relationship or association between the classes can be either an "is-a" or
"has-a" relationship. Each class in the class diagram may be capable of providing certain
functionalities. These functionalities provided by the class are termed "methods" of the class.
Apart from this, each class may have certain "attributes" that uniquely identify the class.
2
4.1.C SEQUENCE DIAGRAM
A Sequence Diagram represents the interaction between different objects in the system. The
important aspect of a sequence diagram is that it is time-ordered. This means that the exact
sequence of the interactions between the objects is represented step by step. Different objects in
the sequence diagram interact with each other by passing "messages".
2
4.1.D ACTIVITY DIAGRAM
The process flows in the system are captured in the activity diagram. Similar to a state diagram,
an activity diagram also consists of activities, actions, transitions, initial and final states, and
guard conditions.
2
4.1.E STATECHART DIAGRAM
A state diagram, as the name suggests, represents the different states that objects in the system
undergo during their life cycle. Objects in the system change states in response to events. In
addition to this, a state diagram also captures the transition of the object’s state from an initial
state to a final state in response to events affecting the system.
2
4.1.3 FLOW DIAGRAM:
1. The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent
a system in terms of input data to the system, various processing carried out on this data, and the output
data is generated by this system.
2. The Data Flow Diagram (DFD) is one of the most important modeling tools. It is used to model the
system components. These components are the system process, the data used by the process, an external
entity that interacts with the system and the information flows in the system.
3. DFD shows how the information moves through the system and how it is modified by a series of
transformations. It is a graphical technique that depicts information flow and the transformations that are
applied as data moves from input to output.
4. DFD is also known as bubble chart. A DFD may be used to represent a system at any level of
abstraction. DFD may be partitioned into levels that represent increasing information flow and functional
detail.
2
Fig 4.1.3: FLOW DIAGRAM
2
CHAPTER 5
Python is currently the most widely used multi-purpose, high-level programming language.
Programmers have to type relatively less and indentation requirement of the language, makes
them readable all the time.
Python language is being used by almost all tech-giant companies like – Google, Amazon,
Facebook, Instagram, Dropbox, Uber… etc.
The biggest strength of Python is huge collection of standard libraries which can be used for the
following –
• Machine Learning
• Test frameworks
• Multimedia
2
5.1.2 PYTHON LIBRARY
1. Extensive Libraries
Python downloads with an extensive library and it contain code for various purposes like regular
expressions, documentation-generation, unit-testing, web browsers, threading, databases, CGI,
email, image manipulation, and more. So, we don’t have to write the complete code for that
manually.
2. Extensible
As we have seen earlier, Python can be extended to other languages. You can write some of your
code in languages like C++ or C. This comes in handy, especially in projects.
3. Embeddable
Complimentary to extensibility, Python is embeddable as well. You can put your Python code in
your source code of a different language, like C++. This lets us add scripting capabilities to our
code in the other language.
4. Improved Productivity
The language’s simplicity and extensive libraries render programmers more productive than
languages like Java and C++ do. Also, the fact that you need to write less and get more things
done.
5. IOT Opportunities
Since Python forms the basis of new platforms like Raspberry Pi, it finds the future bright for the
Internet of Things. This is a way to connect the language with the real world.
When working with Java, you may have to create a class to print ‘Hello World’. But in Python,
just a print statement will do. It is also quite easy to learn, understand, and code. This is why
when people pick up Python, they have a hard time adjusting to other more verbose languages
like Java.
2
7. Readable
Because it is not such a verbose language, reading Python is much like reading English. This is
the reason why it is so easy to learn, understand, and code. It also does not need curly braces to
define blocks, and indentation is mandatory. This further aids the readability of the code.
8. Object-Oriented
This language supports both the procedural and object-oriented programming paradigms. While
functions help us with code reusability, classes and objects let us model the real world. A class
allows the encapsulation of data and functions into one.
Like we said earlier, Python is freely available. But not only can you download Python for free,
but you can also download its source code, make changes to it, and even distribute it. It
downloads with an extensive collection of libraries to help you with your tasks.
10. Portable
When you code your project in a language like C++, you may need to make some changes to it if
you want to run it on another platform. But it isn’t the same with Python. Here, you need to code
only once, and you can run it anywhere. This is called Write Once Run Anywhere (WORA).
However, you need to be careful enough not to include any system- dependent features.
11. Interpreted
Lastly, we will say that it is an interpreted language. Since statements are executed one by one,
debugging is easier than in compiled languages.
2
Advantages of Python Over Other Languages
[Link] Coding
Almost all of the tasks done in Python requires less coding when the same task is done in other
languages. Python also has an awesome standard library support, so you don’t have to search for
any third-party libraries to get your job done. This is the reason that many people suggest
learning Python to beginners.
2. Affordable
Python is free therefore individuals, small companies or big organizations can leverage the free
available resources to build applications. Python is popular and widely used so it gives you better
community support. The 2019 Github annual survey showed us that Python has overtaken Java
in the most popular programming language category.
Python code can run on any machine whether it is Linux, Mac or Windows. Programmers need
to learn different languages for different jobs but with Python, you can professionally build web
apps, perform data analysis and machine learning, automate things, do web scraping and also
build games and powerful visualizations. It is an all-rounder programming language.
Disadvantages of Python
So far, we’ve seen why Python is a great choice for your project. But if you choose it, you should
be aware of its consequences as well. Let’s now see the downsides of choosing Python over
another language.
[Link] Limitations
We have seen that Python code is executed line by line. But since Python is interpreted, it often
results in slow execution. This, however, isn’t a problem unless speed is a focal point for the
2
project. In other words, unless high speed is a requirement, the benefits offered by Python are
enough to distract us from its speed limitations.
While it serves as an excellent server-side language, Python is much rarely seen on the clientside.
Besides that, it is rarely ever used to implement smartphone-based applications. One such
application is called Carbonnelle. The reason it is not so famous despite the existence of Brython
is that it isn’t that secure.
3. Design Restrictions
As you know, Python is dynamically-typed. This means that you don’t need to declare the type of
variable while writing the code. It uses duck-typing. But wait, what’s that? Well, it just means
that if it looks like a duck, it must be a duck. While this is easy on the programmers during
coding, it can raise run-time errors.
Compared to more widely used technologies like JDBC (Java DataBase Connectivity) and
ODBC (Open DataBase Connectivity), Python’s database access layers are a bit underdeveloped.
Consequently, it is less often applied in huge enterprises.
5. Simple
No, we’re not kidding. Python’s simplicity can indeed be a problem. Take my example. I don’t
do Java, I’m more of a Python person. To me, its syntax is so simple that the verbosity of Java
code seems unnecessary.
This was all about the Advantages and Disadvantages of Python Programming Language.
2
History of Python:-
What do the alphabet and the programming language Python have in common? Right, both start
with ABC. If we are talking about ABC in the Python context, it's clear that the programming
language ABC is meant. ABC is a general-purpose programming language and programming
environment, which had been developed in the Netherlands, Amsterdam, at the CWI (Centrum
Wiskunde & Informatica). The greatest achievement of ABC was to influence the design of
Python. Python was conceptualized in the late 1980s. Guido van Rossum worked that time in a
project at the CWI, called Amoeba, a distributed operating system. In an interview with Bill
Venners1 , Guido van Rossum said: "In the early 1980s, I worked as an implementer on a team
building a language called ABC at Centrum voor Wiskunde en Informatica (CWI). I don't know
how well people know ABC's influence on Python. I try to mention ABC's influence because I'm
indebted to everything I learned during that project and to the people who worked on it."Later on
in the same Interview, Guido van Rossum continued: “I remembered all my experience and some
of my frustration with ABC. I decided to try to design a simple scripting language that possessed
some of ABC's better properties, but without its problems. So, I started typing. I created a simple
virtual machine, a simple parser, and a simple runtime. I made my own version of the various
ABC parts that I liked. I created a basic syntax, used indentation for statement grouping instead
of curly braces or begin-end blocks, and developed a small number of powerful data types: a
hash table (or dictionary, as we call it), a list, strings, and numbers."
2
5.1.3 MACHINE LEARNING OVERVIEW: -
Before we take a look at the details of various machine learning methods, let's start by looking at
what machine learning is, and what it isn't. Machine learning is often categorized as a subfield of
artificial intelligence, but I find that categorization can often be misleading at first brush. The
study of machine learning certainly arose from research in this context, but in the data science
application of machine learning methods, it's more helpful to think of machine learning as a
means of building models of data.
TensorFlow
TensorFlow is a free and open-source software library for dataflow and differentiable
programming across a range of tasks. It is a symbolic math library, and is also used for machine
learning applications such as neural networks. It is used for both research and production at
Google. TensorFlow was developed by the Google Brain team for internal Google use. It was
released under the Apache 2.0 open-source license on November 9, 2015.
2
Numpy
It is the fundamental package for scientific computing with Python. It contains various features
including these important ones:
Besides its obvious scientific uses, Numpy can also be used as an efficient multi-dimensional
container of generic data. Arbitrary data-types can be defined using Numpy which allows Numpy
to seamlessly and speedily integrate with a wide variety of databases.
Pandas
Matplotlib
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety
of hardcopy formats and interactive environments across platforms. Matplotlib can be used in
Python scripts, the Python and I Python shells, the Jupyter Notebook, web application servers,
2
and four graphical user interface toolkits. Matplotlib tries to make easy things easy and hard
things possible. You can generate plots, histograms, power spectra, bar charts, error charts,
scatter plots, etc., with just a few lines of code. For examples, see the sample plots and thumbnail
gallery. For simple plotting the pyplot module provides a MATLAB-like interface, particularly
when combined with IPython. For the power user, you have full control of line styles, font
properties, axes properties, etc., via an object-oriented interface or via a set of functions familiar
to MATLAB users.
Scikit – learn
Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent
interface in Python. It is licensed under a permissive simplified BSD license and is distributed
under many Linux distributions, encouraging academic and commercial use.
Python features a dynamic type system and automatic memory management. It supports multiple
programming paradigms, including object-oriented, imperative, functional and procedural, and
has a large and comprehensive standard library.
Python is Interpreted − Python is processed at runtime by the interpreter. You do not need to
compile your program before executing it. This is similar to PERL and PHP.
Python is Interactive − you can actually sit at a Python prompt and interact with the interpreter
directly to write your programs.
Python also acknowledges that speed of development is important. Readable and terse code is
part of this, and so is access to powerful constructs that avoid tediousrepetition of code.
Maintainability also ties into this may be an all but useless metric, but it does say something
about how much code you have to scan, read and/or understand to troubleshoot problems or
tweak behaviors.
2
CHAPTER 6
Worker1: This is a worker node which accept divided dataset from center server and thenbuild
existing SVM model and Basic DML model and then calculate accuracy of both algorithms and
send result back to center server
Worker2: This is another worker node which accept other half of dataset and then run existing
SVM and Basic DML and send accuracy back to center server.
Center Server: This is a center server which upload dataset to application and then divide dataset
into two equal parts and then distribute each part to worker 1 and 2 and then collect result. This
server will run semi DML and calculate its accuracy also.
2
6.2 SAMPLE CODE
worker 1
import socket
import json
import os
import pandas as pd
import numpy as np
def runExistingSVM(dataset):
dataset = [Link]
print("Dataset received and contain total records without poison detection : "+str(len(X)))
indices = [Link]([Link][0])
[Link](indices)
X = X[indices]
Y = Y[indices]
2
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
cls = [Link]()
[Link](X_train, y_train)
for i in range(0,20):
y_test[i] = 10
prediction_data = [Link](X_test)
svm_acc = accuracy_score(y_test,prediction_data)*100
return svm_acc
def runDMLwithPoisonDataDetection(dataset):
dataset = [Link]
indices = [Link]([Link][0])
[Link](indices)
X = X[indices]
Y = Y[indices]
iso = IsolationForest(contamination=0.1)
yhat = iso.fit_predict(X_train)
mask = yhat != -1
cls = [Link]()
2
[Link](X_train, y_train)
prediction_data = [Link](X_test)
svm_acc = accuracy_score(y_test,prediction_data)*100
return svm_acc
def startApplicationServer():
class ClientThread(Thread):
def _init_(self,ip,port):
Thread._init_(self)
[Link] = ip
[Link] = port
def run(self):
data = [Link](10000)
data = [Link]([Link]())
request_type = str([Link]("type"))
if request_type == 'basicDML':
data = str([Link]("dataset"))
f = open("[Link]", "w")
[Link](data)
2
[Link]()
dataset = pd.read_csv("[Link]")
existing_svm_accuracy = runExistingSVM(dataset)
basic_dm_accuracy = runDMLwithPoisonDataDetection(dataset)
message = [Link]([Link]())
worker 2
import socket
import json
import os
import pandas as pd
import numpy as np
def runExistingSVM(dataset):
2
dataset = [Link]
print("Dataset received and contain total records without poison detection : "+str(len(X)))
indices = [Link]([Link][0])
[Link](indices)
X = X[indices]
Y = Y[indices]
cls = [Link]()
[Link](X_train, y_train)
prediction_data = cls.…
import tkinter
import numpy as np
import os
import re
import numpy as np
2
import pandas as pd
import socket
import json
main = [Link]()
[Link]("1300x1200")
global filename
global svm_acc,basic_acc,semi_acc
global part1,part2
global first,second
def upload():
global filename
filename = [Link]…
2
6.3 OUTPUT SCREENSHOTS
2
Figure 6.3.3: Upload dataset
2
Figure 6.3.5: Divide dataset
2
Figure 6.3.7: Run Semi-DML
2
Figure 6.3.9: x-axis contains algorithm name and y-axis represents accuracy and from above
graph
Figure 6.3.10: Basic-DML and Semi-DML accuracy is better than existing SVM accuracy
2
2
CHAPTER 7
7.1 TESTING
The purpose of testing is to discover errors. Testing is the process of trying to discover every
conceivable fault or weakness in a work product. It provides a way to check the functionality of
components, sub-assemblies, assemblies and/or a finished product It is the process of exercising
software with the intent of ensuring that the Software system meets its requirements and user
expectations and does not fail in an unacceptable manner. There are various types of tests. Each
test type addresses a specific testing requirement.
UNIT TESTING
Unit testing involves the design of test cases that validate that the internal program logic is
functioning properly, and that program inputs produce valid outputs. Alldecision branches and
internal code flow should be validated. It is the testing of individual software units of the
application .it is done after the completion of an individual unit before integration. This is a
structural testing, that relies on knowledge of its construction and is invasive. Unit tests perform
basic tests at component level and test a specific business process, application, and/or system
configuration. Unit tests ensure that each unique path of a business process performs accurately
to the documented specifications and contains clearly defined inputs and expected results.
INTEGRATION TESTING
Integration tests are designed to test integrated software components to determine if they actually
run as one program. Testing is event driven and is more concerned with the basic outcome of
screens or fields. Integration tests demonstrate that although the components were individually
satisfaction, as shown by successfully unit testing, the combination of components is correct and
consistent. Integration testing is specifically aimed at exposing the problems that arise from the
combination of components.
FUNCTIONAL TESTING
2
Functional tests provide systematic demonstrations that functions tested are available as
specified by the business and technical requirements, system documentation, and user manuals.
Functional testing is centered on the following items:
SYSTEM TESTING
System testing ensures that the entire integrated software system meets requirements. It tests a
configuration to ensure known and predictable results. An example of system testing is the
configuration-oriented system integration test. System testing is based on process descriptions
and flows, emphasizing pre-driven process links and integration points.
White Box Testing is a testing in which in which the software tester has knowledge of the inner
workings, structure and language of the software, or at least its purpose. It is purpose. It is used
to test areas that cannot be reached from a black box level.
Black Box Testing is testing the software without any knowledge of the inner workings, structure
or language of the module being tested. Black box tests, as most other kinds of tests, must be
written from a definitive source document, such as specification or requirements document, such
as specification or requirements document. It is a testing in which the software under test is
2
treated, as a black box. you cannot “see” into it. The test provides inputs and responds to outputs
without considering how the software works.
UNIT TESTING
Unit testing is usually conducted as part of a combined code and unit test phase of the software
lifecycle, although it is not uncommon for coding and unit testing to be conducted as two distinct
phases.
The task of the integration test is to check that components or software applications, e.g.
components in a software system or – one step up – software applications at the company level –
interact without error.
Test Results: All the test cases mentioned above passed successfully. No defects encountered.
Acceptance Testing
User Acceptance Testing is a critical phase of any project and requires significant participation
by the end user. It also ensures that the system meets the functional requirements.
Test Results: All the test cases mentioned above passed successfully. No defects encountered.
2
8. CONCLUSION
In this project, we discussed the Data Poison Detection Schemes in both Basic-DML and Semi-
DML scenarios. The data poison detection scheme in the basic-DML scenario utilizes a threshold
of parameters to find out the poisoned sub-datasets. Moreover, we established a mathematical
model to analyze the probability of finding threats with different numbers of training loops.
Furthermore, we presented an improved data poison detection scheme and the optimal resource
allocation in the Semi-DML scenario. Simulation results show that in the Basic-DML scenario,
the proposed scheme can increase the model accuracy by up to 20%- 30%. As to the semi-DML
scenario, the improved data poison detection scheme with optimal resource allocation can
decrease the resource wastage by 50-60% compared to the other two schemes without the
optimal resource allocation.
2
9. FUTURE ENHANCEMENT
The data poisoning detection strategy may be expanded to a more dynamic pattern in the future
to accommodate the evolving application environment and level of assault. Furthermore, further
research is required on the trade-off between resource cost and security since multi training sub-
datasets will raise the system's resource consumption.
2
BIBLIOGRAPHY
[1] G. Qiao, S. Leng, K. Zhang, and Y. He, “Collaborative task offloading in vehicular edge
multiaccess networks,” IEEE Communications Magazine, vol. 56, no. 8, pp. 48–54, 2018.
[2] K. Zhang, S. Leng, X. Peng, L. Pan, S. Maharjan, and Y. Zhang, “Artificial intelligence
inspired transmission scheduling in cognitive vehicular communications and networks,” IEEE
Internet of Things Journal, vol. 6, no. 2, pp. 1987–1997, 2019.
[4] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang,
“Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems,”
CoRR, vol. abs/1512.01274, 2015.
[6] S. Yu, M. Liu, W. Dou, X. Liu, and S. Zhou, “Networking for bigdata: A survey,” IEEE
Communications Surveys & Tutorials, vol. 19, no. 1, pp. 531–549, 2017.