Unit6 - Working With Data
Unit6 - Working With Data
DATA
Example:
fileptr = open("new.txt","a")
fileptr.write("This is a new file")
Closing a File:
• Once all the operations are done on the file, we must close it through our Python script using the
close() method.
Example:
fileptr.close()
NumPy in Python:
• NumPy is Numerical Python library used for computation and processing of the multidimensional
and single dimensional array elements.
• NumPy works faster than Lists as array are stored at one continuous memory.
• The NumPy package needs to installed using the following command:
pip install numpy
• Once installed, the package has to be imported in the Python program using the import keyword
• The n-dimensional array object in numpy is known as ndarray
• We can create a NumPy ndarray object by using the array() function
Example:
import numpy as np # import the numpy package Output:
arr = np.array([1, 2, 3, 4, 5]) # Creating ndarray object
[1 2 3 4 5]
print(arr) <class 'numpy.ndarray'>
print(type(arr))
• It provides various data structures and operations for manipulating numerical data and time series
• It has functions for analyzing, cleaning, exploring, and manipulating data.
• The name Pandas is derived from the word Panel Data and is developed by Wes
McKinney in 2008
• Pandas is built on top of the Numpy package, means Numpy is required for operating the Pandas.
• It also has to be installed as follows:
pip install pandas
Pandas DataFrames:
• A Pandas DataFrame is a 2 dimensional data structure or a table with rows and columns.
• Pandas DataFrame consists of three principal components, the data, rows, and columns.
• Syntax: pandas.DataFrame( data, index, columns, dtype, copy)
Creating a simple DataFrame:
• A simple DataFrame can be created using a single list or a list of list Output:
import pandas as pd
0
data = ["C", "C++", "Java", "Python", "PHP"] # Create List 0 C
1 C++
df = pd.DataFrame(data) # Create dataframe using the list 2 Java
3 Python
print(df) 4 PHP
Dataframes:
Example:
Columns
import pandas as pd
Output:
data1 = [ [101, "Abc", 90], Roll Name Percent
[102, "Xyz", 80], a 101 Abc 90
[103, "Pqr", 75], b 102 Xyz 80
[104, "Mno", 85], c 103 Pqr 75
[105, "Def", 77] d 104 Mno 85
] e 105 Def 77
df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"],
index=["a","b","c","d","e"] ) Index
print(df1)
Loading Data with Pandas:
• Using the Pandas library, the data can be imported from the csv (comma separated values) files and
load it in the Pandas Dataframes
Example:
import pandas as pd
Output:
a. Abc
b. Xyz
c. Pqr
d. Mno
e. Def
2. Accessing any row in DataFrame:
• Pandas provide a special method to retrieve rows from a Data frame.
• the loc[] method to retrieve the row
Example:
data1 = [ [101, "Abc", 90],
[102, "Xyz", 80],
[103, "Pqr", 75], Output:
[104, "Mno", 85], Roll 102
[105, "Def", 77]
Name Xyz
]
Percent 80
df1 = pd.DataFrame(data1, columns=["Roll", "Name", "Percent"],
index=["a","b","c","d","e"])
df1 = df1.drop('a')
print(df1)
Working with and saving data with pandas:
• A Pandas DataFrame as a CSV file using to_csv() method.
data1 = [[101, "Abc", 90],
[102, "Xyz", 80],
[103, "Pqr", 75],
[104, "Mno", 85],
[105, "Def", 77]
]
df1.to_csv("StudentData.csv")
Saving Dataframe without header and index:
• Setting the parameter header=False removes the column heading and setting index=False removes
the index while writing DataFrame to CSV file.
df1.to_csv("StudentData.csv“, header = False, index = False) # Saving file without header and index
Data Cleaning in Python:
• Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted,
duplicate, or incomplete data within a dataset.
• Data cleaning aims at filling missing values, smoothing out noise while determining outliers and
rectifying inconsistencies in the data
• Data cleaning consists of – Removing null records, dropping unnecessary columns, treating
missing values, rectifying junk values or otherwise called outliers, restructuring the data to modify
it to a more readable format, etc. is known as data cleaning
where,
axis : Takes value 0 for ‘index’ and 1 for ‘columns’
inplace=True will changes in original dataframe and
3. Treating missing values
• Missing Data can occur when no information is provided for one or more items or for a whole unit
• Missing Data can be a very big problem in a real-life scenarios.
• They must be handled carefully as they can be an indication of something important.
• We can also fill in the missing value using fillna() method and let the user replace NaN values with some value
of their own.
Syntax:
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None)
where,
value : Static, dictionary, array, series or dataframe to fill instead of NaN.
method : Method is used if user doesn’t pass any value. Pandas has different methods like bfill, backfill or ffill
axis: axis takes int or string value for rows/columns.
inplace: It is a boolean which makes the changes in data frame itself if True.
3. Treating missing values
Example:
new3 = df['City'].fillna("No Data", inplace=False) Output:
1 Pune
print(new3) 2 Mumbai
3 Goa
4 Delhi
5 Pune
6 No Data
7 Nagpur
8 Nashik
9 Mumbai
10 Pune