0% found this document useful (0 votes)
5 views2 pages

PANDAS Python

Pandas consists of two core objects: DataFrames, which are tables of data, and Series, which are sequences of data values. It offers two indexing paradigms—index-based selection (iloc) and label-based selection (loc)—with distinct behaviors regarding inclusivity in indexing. Additionally, Pandas provides mapping functions, data type management, and methods for combining DataFrames and Series, such as concat() and join().

Uploaded by

Emanuel André
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
Download as txt, pdf, or txt
0% found this document useful (0 votes)
5 views2 pages

PANDAS Python

Pandas consists of two core objects: DataFrames, which are tables of data, and Series, which are sequences of data values. It offers two indexing paradigms—index-based selection (iloc) and label-based selection (loc)—with distinct behaviors regarding inclusivity in indexing. Additionally, Pandas provides mapping functions, data type management, and methods for combining DataFrames and Series, such as concat() and join().

Uploaded by

Emanuel André
Copyright
© © All Rights Reserved
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1/ 2

*PANDAS*

There are two core objects in pandas: the DataFrame and the Series.

Dataframes: A DataFrame is a table. It contains an array of individual entries,


each of which has a certain value. Each entry corresponds to a row (or record) and
a column.

Series: A Series, by contrast, is a sequence of data values. If a DataFrame is a


table, a Series is a list. And in fact you can create one with nothing more than a
list.

A Series is, in essence, a single column of a DataFrame. So you can assign row
labels to the Series the same way as before, using an index parameter. However, a
Series does not have a column name, it only has one overall name.

Index-based selection
Pandas indexing works in one of two paradigms. The first is index-based selection:
selecting data based on its numerical position in the data. iloc follows this
paradigm.

Both loc and iloc are row-first, column-second. This is the opposite of what we do
in native Python, which is column-first, row-second.

This means that it's marginally easier to retrieve rows, and marginally harder to
get retrieve columns. To get a column with iloc

Label-based selection
The second paradigm for attribute selection is the one followed by the loc
operator: label-based selection. In this paradigm, it's the data index value, not
its position, which matters.

***************************Choosing between loc and iloc******************


When choosing or transitioning between loc and iloc, there is one "gotcha" worth
keeping in mind, which is that the two methods use slightly different indexing
schemes.

iloc uses the Python stdlib indexing scheme, where the first element of the range
is included and the last one excluded. So 0:10 will select entries 0,...,9. loc,
meanwhile, indexes inclusively. So 0:10 will select entries 0,...,10.

Why the change? Remember that loc can index any stdlib type: strings, for example.
If we have a DataFrame with index values Apples, ..., Potatoes, ..., and we want to
select "all the alphabetical fruit choices between Apples and Potatoes", then it's
a lot more convenient to index df.loc['Apples':'Potatoes'] than it is to index
something like df.loc['Apples', 'Potatoet'] (t coming after s in the alphabet).

This is particularly confusing when the DataFrame index is a simple numerical list,
e.g. 0,...,1000. In this case df.iloc[0:1000] will return 1000 entries, while
df.loc[0:1000] return 1001 of them! To get 1000 elements using loc, you will need
to go one lower and ask for df.loc[0:999].

Otherwise, the semantics of using loc are the same as those for iloc.

Pandas comes with a few built-in conditional selectors, two of which we will
highlight here.

The first is isin. isin is lets you select data whose value "is in" a list of
values.
The second is isnull (and its companion notnull). These methods let you highlight
values which are (or are not) empty (NaN). For example, to filter out wines lacking
a price tag in the dataset.

Maps
A map is a term, borrowed from mathematics, for a function that takes one set of
values and "maps" them to another set of values. In data science we often have a
need for creating new representations from existing data, or for transforming data
from the format it is in now to the format that we want it to be in later. Maps are
what handle this work, making them extremely important for getting your work done!

There are two mapping methods that you will use often.

map() is the first, and slightly simpler one. For example, suppose that we wanted
to remean the scores the wines received to 0.

Dtypes
The data type for a column in a DataFrame or a Series is known as the dtype.

You can use the dtype property to grab the type of a specific column

Data types tell us something about how pandas is storing the data internally.
float64 means that it's using a 64-bit floating point number; int64 means a
similarly sized integer instead, and so on.

One peculiarity to keep in mind (and on display very clearly here) is that columns
consisting entirely of strings do not get their own type; they are instead given
the object type.

It's possible to convert a column of one type into another wherever such a
conversion makes sense by using the astype() function. For example, we may
transform the points column from its existing int64 data type into a float64 data
type

Combining
When performing operations on a dataset, we will sometimes need to combine
different DataFrames and/or Series in non-trivial ways. Pandas has three core
methods for doing this. In order of increasing complexity, these are concat(),
join(), and merge(). Most of what merge() can do can also be done more simply with
join(), so we will omit it and focus on the first two functions here.

The simplest combining method is concat(). Given a list of elements, this function
will smush those elements together along an axis.

This is useful when we have data in different DataFrame or Series objects but
having the same fields (columns). One example: the YouTube Videos dataset, which
splits the data up based on country of origin (e.g. Canada and the UK, in this
example).

You might also like