0% found this document useful (0 votes)

29 views

Unit6 - Working With Data

The document discusses working with data in Python using Pandas. It covers reading and writing files, loading data with Pandas, and working with and saving Pandas dataframes. Specifically, it shows how to open, read, and write files, load CSV data into a dataframe, access columns and rows of a dataframe, slice dataframes, and describes additional dataframe methods and properties.

Uploaded by

vvloggingzone05

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

Unit6 - Working With Data

Uploaded by

vvloggingzone05

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

WORKING WITH

DATA

Ms. Margaret Salve,

Department of Computer Science
• READING AND WRITING FILES
• LOADING DATAWITH PANDAS
• WORKING WITH AND SAVING DATAWITH PANDAS
• DATA CLEANING IN PYTHON
File Handling in Python:
•Python has various functions for File Handling.
open() – Used to open a file in various modes
read() – reading the contents of a file
write() – writing contents to a file
close() – Closes a file
Opening a File:
• Python provides an open() function that accepts two arguments: filename and mode of opening the
file
•open() returns a file object which is used to perform various file operations
Syntax:

fileobj = open(“filename”, “access mode”)

where,
access modes are: r, w, a, rb, wb, ab etc

Example:

fileobj = open("demo.txt", "r") Output:

if fileobj:
print("File opened successfully") File opened successfully
Reading a File:
• The read() method reads all the contents of the file opened
• readline() reads only a single line from the file
• read(count) reads the specified count of characters from the file
Example:
Output:
# File handling

fileobj = open("demo.txt", "r") # Opening a File File opened successfully

if fileobj: All Contents
print("File opened successfully") Hello Everyone
print(fileobj.read()) # Read all contents of a file Welcome to Python Programming
print(“Only 10 Lines")
Only 10 characters
print(fileobj.read(10)) # Reading only 10 characters of a file
Hello Ever
print("Single Line")
print(fileobj.readline()) # Read only a single line of the file Single Line
Hello Everyone
Writing to/Creating a File:
• To create a new file or write to an existing file, the file needs to be opened in “w” or “a” mode
Example:

fileptr = open("new.txt","a")
fileptr.write("This is a new file")

Closing a File:
• Once all the operations are done on the file, we must close it through our Python script using the
close() method.

Example:
fileptr.close()
NumPy in Python:
• NumPy is Numerical Python library used for computation and processing of the multidimensional
and single dimensional array elements.

• NumPy works faster than Lists as array are stored at one continuous memory.
• The NumPy package needs to installed using the following command:
pip install numpy
• Once installed, the package has to be imported in the Python program using the import keyword
• The n-dimensional array object in numpy is known as ndarray
• We can create a NumPy ndarray object by using the array() function
Example:
import numpy as np # import the numpy package Output:
arr = np.array([1, 2, 3, 4, 5]) # Creating ndarray object
[1 2 3 4 5]
print(arr) <class 'numpy.ndarray'>

print(type(arr))

Array Indexing and Slicing:

• Any array element can be accessed by using the index number
• The indexes in NumPy arrays start with 0
• Like, Lists the slicing can also be performed on arrays using the slice operator [:]
NumPy in Python:

Type of Creating the array Accessing the element using

array index
1-D array arr = np.array([1, 2, 3, 4]) print(arr[1])
2-D array arr = np.array([[1,2,3,4,5], [6,7,8,9,10]] print(arr[1, 3]) # Prints 9
Pandas Library:
• Python Pandas is an open-source library that provides high-performance data manipulation in
Python.

• It provides various data structures and operations for manipulating numerical data and time series
• It has functions for analyzing, cleaning, exploring, and manipulating data.
• The name Pandas is derived from the word Panel Data and is developed by Wes
McKinney in 2008

• Pandas is built on top of the Numpy package, means Numpy is required for operating the Pandas.
• It also has to be installed as follows:
pip install pandas
Pandas DataFrames:
• A Pandas DataFrame is a 2 dimensional data structure or a table with rows and columns.
• Pandas DataFrame consists of three principal components, the data, rows, and columns.
• Syntax: pandas.DataFrame( data, index, columns, dtype, copy)
Creating a simple DataFrame:
• A simple DataFrame can be created using a single list or a list of list Output:
import pandas as pd
0
data = ["C", "C++", "Java", "Python", "PHP"] # Create List 0 C
1 C++
df = pd.DataFrame(data) # Create dataframe using the list 2 Java
3 Python
print(df) 4 PHP
Dataframes:

Example:
Columns

import pandas as pd
Output:
data1 = [ [101, "Abc", 90], Roll Name Percent
[102, "Xyz", 80], a 101 Abc 90
[103, "Pqr", 75], b 102 Xyz 80
[104, "Mno", 85], c 103 Pqr 75
[105, "Def", 77] d 104 Mno 85
] e 105 Def 77
df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"],
index=["a","b","c","d","e"] ) Index
print(df1)
Loading Data with Pandas:
• Using the Pandas library, the data can be imported from the csv (comma separated values) files and
load it in the Pandas Dataframes
Example:

import pandas as pd

data = pd.read_csv("IRIS.csv") # Read the csv file

df = pd.DataFrame(data) # Load the imported data into dataframe
print(df)
Additional information about Dataframes:
Example:
Output:
print(len(df1)) # Returns number of rows 5
print(df1.shape) # Returns rows and columns as a tuple (5, 3)
15
print(df1.size) # Returns 'cells' in the table Index(['Roll', 'Name', 'Percent'],
dtype='object')
print(df1.columns) # Returns the column names Roll int64
print(df1.dtypes) # Returns data types of the columns Name object
Percent int64
dtype: object
Creating Dataframes:
We can create a DataFrame using following ways:
• dict
• Lists
• Numpy ndarrrays
• Series
1. Accessing any column in DataFrame:
The columns of a dataframe can be accessed by calling them by their columns name.
Example:
print(df1["Name"]) # Accessing a column of Dataframe

Output:
a. Abc
b. Xyz
c. Pqr
d. Mno
e. Def
2. Accessing any row in DataFrame:
• Pandas provide a special method to retrieve rows from a Data frame.
• the loc[] method to retrieve the row
Example:
data1 = [ [101, "Abc", 90],
[102, "Xyz", 80],
[103, "Pqr", 75], Output:
[104, "Mno", 85], Roll 102
[105, "Def", 77]
Name Xyz
]
Percent 80
df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"],
index=["a","b","c","d","e"])

print(df1.loc['b']) # Retrieving Data at row indexed “b”

data = df1.loc['b'] # Retrieving data at index 'b‘ and storing it in variable

print(data)
3. Accessing any row in DataFrame using integer index:
• Rows can be selected by passing integer location to an iloc() function
• Example:

data1 = [ [101, "Abc", 90],

[102, "Xyz", 80], Output:
[103, "Pqr", 75], Roll 103
[104, "Mno", 85],
Name Pqr
[105, "Def", 77]
]
Percent 75

df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"],

index=["a","b","c","d","e"])

print(df1.iloc[2]) # Retrieving Data at row at index number 2

4. Slicing a Dataframe:
• Multiple rows can be selected using : operator.
Example:

data1 = [ [101, "Abc", 90],

[102, "Xyz", 80], Output:
[103, "Pqr", 75], Roll Name Percent
[104, "Mno", 85],
c 103 Pqr 75
[105, "Def", 77]
]
d 104 Mno 85

df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"],

index=["a","b","c","d","e"])

print(df1[2 : 4) # Retrieving Data at row from index 2 to (4 -1)th index

5. Appending/Adding row to Dataframe:
• New rows can be added to a dataframe using the append() function
Example:

data1 = [ [101, "Abc", 90],

[102, "Xyz", 80],
[103, "Pqr", 75], Output:
[104, "Mno", 85],
Roll Name Percent
[105, "Def", 77]
]
a 101 Abc 90
b 102 Xyz 80
df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"], c 103 Pqr 75
index=["a","b","c","d","e"]) d 104 Mno 85
e 105 Def 77
df2 = pd.DataFrame([[106, 'AAA', 60], [107, 'asfd', 65]],columns=["Roll", f 106 AAA 60
"Name", "Percent"],index=['f','g']) g 107 asfd 65
df1 = df1.append((df2)) # Appending a row to a dataframe
print(df1)
6. Dropping or deleting rows from a Dataframe:
• Rows can be deleted using the index number in drop() method
Example:

df1 = df1.drop('a')
print(df1)
Working with and saving data with pandas:
• A Pandas DataFrame as a CSV file using to_csv() method.
data1 = [[101, "Abc", 90],
[102, "Xyz", 80],
[103, "Pqr", 75],
[104, "Mno", 85],
[105, "Def", 77]
]

df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"], index=["a","b","c","d","e"])

df1.to_csv("StudentData.csv")
Saving Dataframe without header and index:
• Setting the parameter header=False removes the column heading and setting index=False removes
the index while writing DataFrame to CSV file.

data1 = [[101, "Abc", 90],

[102, "Xyz", 80],
[103, "Pqr", 75],
[104, "Mno", 85],
[105, "Def", 77]
]

df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"], index=["a","b","c","d","e"])

df1.to_csv("StudentData.csv“, header = False, index = False) # Saving file without header and index
Data Cleaning in Python:
• Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset.
• Data cleaning aims at filling missing values, smoothing out noise while determining outliers and
rectifying inconsistencies in the data
• Data cleaning consists of – Removing null records, dropping unnecessary columns, treating
missing values, rectifying junk values or otherwise called outliers, restructuring the data to modify
it to a more readable format, etc. is known as data cleaning

• There are various techniques to clean the data like:

1. Removing Null/Duplicate Records/Ignore tuple
2. Dropping unnecessary Columns
3. Treating missing values
Removing Null/Duplicate Records/Ignore tuple:
• If in a particular row a significant amount of data is missing, then it would be better to drop that
row as it would not be adding any value to our model.
Example: RollNo Name Percent City Output:
101Abc 80Pune RollNo Name Percent City
102Xyz 85Mumbai 0 101.0 Abc 80.0 Pune
103Def 90Goa 1 102.0 Xyz 85.0 Mumbai
104 75Delhi 2 103.0 Def 90.0 Goa
105 Pune 3 104.0 NaN 75.0 Delhi
106Bbb 4 105.0 NaN NaN Pune
Pqr 60Nagpur
5 106.0 Bbb NaN NaN
108 Nashik
109 50Mumbai
6 NaN Pqr 60.0 Nagpur
110dgfg 74Pune 7 108.0 NaN NaN Nashik
8 109.0 NaN 50.0 Mumbai
• Sometimes csv file has null values, which are later displayed 9 110.0 dgfg 74.0 Pune

as NaN in Data Frame

Removing Null/Duplicate Records/Ignore tuple:
• Pandas DataFrame dropna() function is used to remove rows and columns with Null/NaN values
•import pandas as
This function pd a new DataFrame and the source DataFram
returns
e remains unchanged.
Output:
RollNo Name Percent City
data = pd.read_csv("Cleaning.csv") # Read the csv file 0 101.0 Abc 80.0 Pune
df = pd.DataFrame(data) # Load the imported data into dataframe 1 102.0 Xyz 85.0 Mumbai
2 103.0 Def 90.0 Goa
print(df) 9 110.0 dgfg 74.0 Pune
newdf = df.dropna() # drop the NaN values from Dataframe df
print(newdf)
2. Dropping unnecessary Columns
• Sometimes the Data set we receive is huge with number of rows and columns
• Some columns from the dataset may not be useful for our model
• Such data is better removed as it would valuable resources like memory and processing time.
• Example: In the previous data the column City is not required for finding the student progression,
hence can be dropped using the dropna() or drop() methods

new2 = df.drop(['City'], axis=1, inplace=False)

print(new2)

where,
axis : Takes value 0 for ‘index’ and 1 for ‘columns’
inplace=True will changes in original dataframe and
3. Treating missing values
• Missing Data can occur when no information is provided for one or more items or for a whole unit
• Missing Data can be a very big problem in a real-life scenarios.
• They must be handled carefully as they can be an indication of something important.
• We can also fill in the missing value using fillna() method and let the user replace NaN values with some value
of their own.
Syntax:
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None)
where,
value : Static, dictionary, array, series or dataframe to fill instead of NaN.
method : Method is used if user doesn’t pass any value. Pandas has different methods like bfill, backfill or ffill
axis: axis takes int or string value for rows/columns.
inplace: It is a boolean which makes the changes in data frame itself if True.
3. Treating missing values
Example:
new3 = df['City'].fillna("No Data", inplace=False) Output:
1 Pune
print(new3) 2 Mumbai
3 Goa
4 Delhi
5 Pune
6 No Data
7 Nagpur
8 Nashik
9 Mumbai
10 Pune

Temperature Mapping Protocol For RM Quarantine
100% (1)
Temperature Mapping Protocol For RM Quarantine
20 pages
Detailed Syllabus Of: Diploma in Civil Engineering
No ratings yet
Detailed Syllabus Of: Diploma in Civil Engineering
62 pages
01 Introduction to Python
No ratings yet
01 Introduction to Python
36 pages
Pandas What Can Pandas Do For You ?: Statsmodels SM Seaborn Sns
No ratings yet
Pandas What Can Pandas Do For You ?: Statsmodels SM Seaborn Sns
9 pages
PythonForMachineLearning
No ratings yet
PythonForMachineLearning
66 pages
Pandas in Python
No ratings yet
Pandas in Python
59 pages
CHP 8 Pandas
No ratings yet
CHP 8 Pandas
49 pages
Python Libraries
No ratings yet
Python Libraries
27 pages
PPT for Assignment-3 (Final_Pandas_Lab)
No ratings yet
PPT for Assignment-3 (Final_Pandas_Lab)
40 pages
Pandas
No ratings yet
Pandas
13 pages
Python
No ratings yet
Python
20 pages
Unit 4
No ratings yet
Unit 4
36 pages
Da Programs
No ratings yet
Da Programs
10 pages
Pandas,Numpy,Matplotlib
No ratings yet
Pandas,Numpy,Matplotlib
11 pages
Pandas
No ratings yet
Pandas
1 page
W2 Advanced Data Structures, IO & Control
No ratings yet
W2 Advanced Data Structures, IO & Control
44 pages
Pandas Numpy
No ratings yet
Pandas Numpy
4 pages
Week 4- Introduction to Python #3
No ratings yet
Week 4- Introduction to Python #3
47 pages
01 Introduction to Python
No ratings yet
01 Introduction to Python
36 pages
Data frames pandas, handout 1 (1)
No ratings yet
Data frames pandas, handout 1 (1)
16 pages
Python_Advanced
No ratings yet
Python_Advanced
16 pages
05-Unit-V Python Lecture Notes
No ratings yet
05-Unit-V Python Lecture Notes
14 pages
Usage of NumPy for Numerical Data in Detail
No ratings yet
Usage of NumPy for Numerical Data in Detail
52 pages
CSL-410-L15
No ratings yet
CSL-410-L15
29 pages
Python Data Science 101
100% (1)
Python Data Science 101
41 pages
lab 1 ML lab
No ratings yet
lab 1 ML lab
15 pages
12 Pandas
No ratings yet
12 Pandas
9 pages
Pandas 1705297450
No ratings yet
Pandas 1705297450
21 pages
dav 2 unit
No ratings yet
dav 2 unit
55 pages
Ex. No: 1 Exploring The Features of Numpy, Scipy, Jupyter, Statsmodels and Pandas Date: 07/08/2024
No ratings yet
Ex. No: 1 Exploring The Features of Numpy, Scipy, Jupyter, Statsmodels and Pandas Date: 07/08/2024
9 pages
Financial Analytics With Python
100% (1)
Financial Analytics With Python
40 pages
Python Libraries
No ratings yet
Python Libraries
53 pages
EDS - Python Cheat Sheet
No ratings yet
EDS - Python Cheat Sheet
3 pages
Mdad - Numpy ML
No ratings yet
Mdad - Numpy ML
85 pages
Pandas CheatSheet
No ratings yet
Pandas CheatSheet
18 pages
AD3301 DEV Lab Manual
No ratings yet
AD3301 DEV Lab Manual
26 pages
ML Lab File Vijay Kumar
No ratings yet
ML Lab File Vijay Kumar
16 pages
2-Introduction to exploratory data analysis using R or Python-14-12-2024
No ratings yet
2-Introduction to exploratory data analysis using R or Python-14-12-2024
13 pages
batch1 ds
No ratings yet
batch1 ds
15 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
10 pages
R Code NB
No ratings yet
R Code NB
3 pages
pythonlibraries[1]
No ratings yet
pythonlibraries[1]
20 pages
Python Numpy Programming: Eliot Feibush
No ratings yet
Python Numpy Programming: Eliot Feibush
66 pages
Introduction and Pythonb Basics
No ratings yet
Introduction and Pythonb Basics
34 pages
Comprehensive Guide To Pandas Clean Table Format
No ratings yet
Comprehensive Guide To Pandas Clean Table Format
1 page
Exercise and Experiment 3
No ratings yet
Exercise and Experiment 3
14 pages
Python Presentation
100% (1)
Python Presentation
71 pages
IP 12th Chapter 3
No ratings yet
IP 12th Chapter 3
9 pages
Python_for_DataScience
No ratings yet
Python_for_DataScience
47 pages
Arpit
No ratings yet
Arpit
30 pages
LAB 2 DWM
No ratings yet
LAB 2 DWM
13 pages
Part A Assignment 10
No ratings yet
Part A Assignment 10
3 pages
FDS record last copy
No ratings yet
FDS record last copy
61 pages
Python For R Users
No ratings yet
Python For R Users
34 pages
python unit 3 4
No ratings yet
python unit 3 4
92 pages
unit-3(FODS)
No ratings yet
unit-3(FODS)
34 pages
Iteration
No ratings yet
Iteration
40 pages
IP Imp Notes
No ratings yet
IP Imp Notes
5 pages
dv_lab_manual_modified
No ratings yet
dv_lab_manual_modified
31 pages
Data Analysis and Visulaization Experiment
No ratings yet
Data Analysis and Visulaization Experiment
104 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Quick Python Guide
From Everand
Quick Python Guide
Coder1
No ratings yet
Industry Internship Summary Report Web Page Automation Using Selenium
No ratings yet
Industry Internship Summary Report Web Page Automation Using Selenium
28 pages
AP Lab
No ratings yet
AP Lab
4 pages
B M O B M O: Ritish Athematical Lympiad Ritish Athematical Lympiad
No ratings yet
B M O B M O: Ritish Athematical Lympiad Ritish Athematical Lympiad
1 page
hwP4C21_1ms
No ratings yet
hwP4C21_1ms
18 pages
1103D 33G2
No ratings yet
1103D 33G2
2 pages
Critical Thinking Reading and Writing Test
No ratings yet
Critical Thinking Reading and Writing Test
67 pages
Mallard Level Instrumentation PDF
No ratings yet
Mallard Level Instrumentation PDF
24 pages
CHAPTER 3 THREE DRYING
No ratings yet
CHAPTER 3 THREE DRYING
32 pages
Statistics Probability Review Midterm Exam SY 2022 2023
No ratings yet
Statistics Probability Review Midterm Exam SY 2022 2023
4 pages
3phase-75k-G05 100 Deye
No ratings yet
3phase-75k-G05 100 Deye
1 page
Electricity: Electricity Is A General Term Applied To Phenomena Involving A Fundamental Property of
No ratings yet
Electricity: Electricity Is A General Term Applied To Phenomena Involving A Fundamental Property of
12 pages
Canon - PIXMA Manuals - E400 Series - If An Error Occurs
0% (1)
Canon - PIXMA Manuals - E400 Series - If An Error Occurs
4 pages
What Are Electric Drives?
No ratings yet
What Are Electric Drives?
84 pages
Topic: Ident No: Supersedes: Date: SUBJECT: Quick Disconnect K - Type Exhaust Thermocouples Models Affected: All
No ratings yet
Topic: Ident No: Supersedes: Date: SUBJECT: Quick Disconnect K - Type Exhaust Thermocouples Models Affected: All
1 page
XVI. T Test For A Mean
No ratings yet
XVI. T Test For A Mean
23 pages
Att Inox Drain Polonia - Sifoane de Pardoseala - Eng
No ratings yet
Att Inox Drain Polonia - Sifoane de Pardoseala - Eng
26 pages
Detailed Lesson Plan in Mathematics 11
100% (1)
Detailed Lesson Plan in Mathematics 11
7 pages
A Gentle Introduction To Tensors
No ratings yet
A Gentle Introduction To Tensors
87 pages
Section 2
100% (1)
Section 2
4 pages
Potopna Crpka PEDROLLO MC I Double Channel Brosura
No ratings yet
Potopna Crpka PEDROLLO MC I Double Channel Brosura
4 pages
Paladin® Designbase™: Ac & DC Arc Flash Evaluation Manual
100% (2)
Paladin® Designbase™: Ac & DC Arc Flash Evaluation Manual
81 pages
MFCPL Presentation Onbag Filter Installation, Operation and Maintenance
100% (3)
MFCPL Presentation Onbag Filter Installation, Operation and Maintenance
30 pages
Hyperv Openstack 121017184348 Phpapp01
No ratings yet
Hyperv Openstack 121017184348 Phpapp01
16 pages
Lesson 1
No ratings yet
Lesson 1
32 pages
L2a - Advanced Word Processing Skills
No ratings yet
L2a - Advanced Word Processing Skills
44 pages
GRADE 10 MATHEMATICS MOCK EXAMS PAPER 2
No ratings yet
GRADE 10 MATHEMATICS MOCK EXAMS PAPER 2
24 pages
Ronan, C., - Dunlop, S. (1984) - The Golden Book of Astronomy. A Comprehensive and Practical Survey. Racine, UK. Western Publishing Company.
100% (2)
Ronan, C., - Dunlop, S. (1984) - The Golden Book of Astronomy. A Comprehensive and Practical Survey. Racine, UK. Western Publishing Company.
264 pages
06b Practice tests Set 23 - 1H SOLUTIONS
No ratings yet
06b Practice tests Set 23 - 1H SOLUTIONS
23 pages