Python For Data Analytics Scientific and Technical Applications
Python For Data Analytics Scientific and Technical Applications
Abstract: Since the invention of computers or machines, their Nowadays, the programming language Python is rising as an
capability to perform various tasks has experienced an alternative to MATLAB, R and other related environments.
exponential growth. In the current times, data science and Python is an interpreted language, like MATLAB. Moreover, it
analytics, a branch of computer science, has revived due to also has object-oriented (OO) features and therefore provides a
the major increase in computer power, presence of huge cross-platform interface. Developers can gain more flexibility
amounts of data, and better understanding in techniques in and elegance. It is easy for by writing their software designs
the area of Data Analytics, Artificial Intelligence, Machine written in C to python. Object-oriented programming present
Learning, Deep Learning etc. Hence, they have become an in MATLAB is less convenient than in python. This is because
essential part of the technology industry, and are being used Python was initially made so that it could be extended using
to solve many challenging problems. In the search for a good compiled code to increase its efficiency. Many tools are also
programming language on which many data science available to ease this integration.
applications can be developed, python has emerged as a
complete programming solution. Due to the low learning Another advantage of Python is that it is far less complicated
curve, and flexibility of Python, it has become one of the in most other environments because its interfacing legacy
fastest growing languages. Python’s ever-evolving libraries software is written in C, C++ and other languages. This is due
make it a good choice for Data analytics. The paper talks to the fact that python was initially made so that it could be
about the features and characteristics of Python extended using compiled code to increase its efficiency. Many
programming language and later discusses reasons behind tools are also available to ease this integration. Having features
python being credited as one of the fastest growing (i)–(v), and power, the ability to perform parallel programming
programming language and why it is at the forefront of data and interfacing capabilities, Python represents a fine
science applications, research and development. environment for doing computational science (CS). Many
Python based frameworks have improved how we program
Keywords: Python, Data Analytics, Artificial Intelligence, many scientific applications. Frameworks like TensorFlow,
Deep Learning, Machine Learning, Natural Language PyTorch, Keras etc. has revolutionized the field of deep
Processing, Scientific Computations, Computer Languages, learning. Using these frameworks, even a new developer with
Frameworks. little knowledge can make neural networks.
20
[1]. There has been a tradition among scientists and developers
15
of using compiled languages like C++, C and LISP, for
10
scientific applications and data analytics. But, in recent years
5
many compiled languages usage is decreasing, giving way to
0
the rise of interpreted environments, such as Octave,
va
e
n
lab
h
C
tc
Ja
th
at
ra
he
Py
Sc
Authorized licensed use limited to: b-on: Instituto Politecnico de Braganca. Downloaded on November 23,2022 at 17:19:08 UTC from IEEE Xplore. Restrictions apply.
Python for Data Analytics, Scientific and Technical Applications
create tools that are more effective and simpler to use for
analysis and program development. He proposed that Python
meets both the needs as it is not a "toy" language and is very
suitable for purposes of teaching. Now, nearly 19 years after
his proposal [3], Python is widely taught as an introductory
course language at many of the US computer science
departments. Fig. 1 shows the number of US computer science
departments and the languages they use to teach their
introductory courses.
141
Authorized licensed use limited to: b-on: Instituto Politecnico de Braganca. Downloaded on November 23,2022 at 17:19:08 UTC from IEEE Xplore. Restrictions apply.
Python for Data Analytics, Scientific and Technical Applications
This is of great help to the developers working to write TABLE I: SPEED COMPARISION BETWEEN PYTHON AND
different algorithms in many domains viz. AI, neural network, CYTHON
deep learning, computer networks algorithms. It runs on
different platforms with the same interface. Language Time Speedup
Python 6.39 ns 1x
C. Platform Independent Compiled Python 4.1 ns 1.56x
Python gives developers the flexibility to provide an API from Cython 167 ns 38.26
the current programming language which turns out to be
extremely flexible for the new Python developers. Moreover, it E. Continuity
is platform independent. Developers can change the source Python is nearly an ideal candidate for a first programming
code of their project in a small way to get their project to run language. It is better to work using the language students
on different OS, thus saving a lot of time and work. studied in primary school. Another benefit is the fact that it
was not made for educational purposes only but also to be used
D. Balance of Low-Level and High-Level Programming
in later life in their professional career. It is used for example
Python has the ability to balance high-level programming with in network administration, web development, computer games
low-level. This is one of the most important feature Python. It programming, and many programs have an integrated support
provides high performance on its higher level objects like for Python scripts.
arrays and matrices. One such example is vectorizing in
algorithms. This makes it possible to work with an entire array F. Data Structures
than a single number. As a result, the coding accuracy and It is important for a developer to use the correct DS for an
coding speed increases. Such an operation is very important algorithm. This is particularly required in the area of research-
for good scientific coding. However, it is not helpful when oriented coding. The absence of such features makes a
during the development of new algorithms. Cython [6], a developer unable to learn and use good design patterns. Python
superset of python, is used to solve this problem. It first has sets, lists, dictionaries, tuples, thread- safe queues, strings,
translates Python code into C code and used the Python C API etc. Lists are used to hold any number of data objects and can
to run the code. A shared library, equivalent to the original be joined, indexed, sliced, split, and used as stacks. Sets can
module, is created that is loaded in the form of a Python have unique and unordered items. Dictionaries is used to map
module and runs slightly faster. from a key, that is unique, to anything. Heaps, similar to STL
heaps, are used on top of lists. Numpy provides an n-
Let us consider a simple example of finding the nth Fibonacci
dimensional array structure with broadcasting and matrix
term of the Fibonacci series to have a speed comparison
operations. SciPy provides image objects, sparse matrices,
between python and cython. Figure 4 and 5 shows the code for
time-series, KD-trees and much more.
Fibonacci series and Table I shows the comparison between
their runtimes. As seen from Table I, cython is faster than
python by 38 times even when we are considering such an easy
example.
1: def fib(n):
2: a=0
3: b=1
4: for i in range(n):
5: a, b = a+b, a
6: return a
Fig. 4. Code to find the nth term of Fibonacci series in Python
142
Authorized licensed use limited to: b-on: Instituto Politecnico de Braganca. Downloaded on November 23,2022 at 17:19:08 UTC from IEEE Xplore. Restrictions apply.
Python for Data Analytics, Scientific and Technical Applications
III. POPULAR PYTHON LIBRARIES Fig. 7. Performance comparison between python, numpy and
pandas
A. NumPy
C. TensorFlow
In order to make hash tables by enumerating a collection of
objects and dictionaries, Python contains some high-level data TensorFlow [10], an open source software library, offers a
structures like lists. However, they are not able to provide strong support for ML and deep learning and the flexible
high-performance numerical computation. Numpy [7], one of numerical computation core. Using its front-end API, we can
the most used python library, deriving its name from two make applications while it executes the applications in C++
words - Numerical Python, is used by most developers as it giving high performance.
provides high performance numerical and scientific
computations on multidimensional arrays. It consists of many TensorFlow is used for training and running deep neural nets
library functions and operations for the purpose of numerical for natural language processing (NLP), image recognition,
computation. It maintains its good performance even while sequence-to-sequence models for machine translation,
providing a high-level abstraction for numerical computation. recurrent neural networks (RNNs), etc. Data flow graph, a
It helps to cut down on python loops and instead, the main structure that describes how data will move a series of nodes,
time is spent on the vectorized array operations. When a can also be created by developers. All nodes present in the
python code is written in this manner, it is able to cut down on graph is one of mathematical operation while edges between
the computation time by a large amount. More performance is the nodes is a multi-dimensional array of data known as a
achieved by using compilers that are optimized. One example tensor. Instead of worrying about implementing algorithms and
is Cython. It allows a good control over cache effects. Numpy connecting different functions together, developers can
also consists of many smaller sub-packages for various concentrate on the main logic.
mathematical tasks like linear algebra, FFTs, generating a
random value, and polynomial manipulation. SciPy, another The math operations in TensorFlow are not performed in
python library is built on numpy. Python. These libraries provided by TensorFlow are written in
the form of C++ binaries. Python is used to directs traffic
B. Pandas occurring between the pieces and then hook them together. The
models on TensorFlow can be deployed by the developers
To perform real-world data analysis, pandas [8], a python across various platforms like desktops, clusters of servers,
library is being developed since 2008. It aims to be the mobile etc.
building block for data analysis. Being open-source, it is being
developed by a community of users. It gives high-performance D. NLTK
and provides tools and data structures that are simple and easy
to use. Developers use it for loading data, preparing data, NLTK [11] is a collection of libraries or modules for natural
manipulating it, followed by its modeling, and then to analyze language processing (NLP) in python for English. It was
it. Figure 7 [9] shows that numpy is better performing that developed for research and teaching purpose of NLP and fields
pandas. Despite its lower performance, it is preferred among like cognitive science, information retrieval, AI, ML, empirical
users because one is able to perform many time-consuming linguistics. Implemented as a collection of interdependent
data science tasks such as Boolean indexing, checking for modules, it consists of a set of core modules that defines the
NaN’s in a dataset, selecting and dropping a column from a basic data types. The other modules are used to perform a
dataset etc. All these tasks does not require loops. specific NLP task.
143
Authorized licensed use limited to: b-on: Instituto Politecnico de Braganca. Downloaded on November 23,2022 at 17:19:08 UTC from IEEE Xplore. Restrictions apply.
Python for Data Analytics, Scientific and Technical Applications
For e.g. NLTK's parser module is used for parsing or to derive Consequently, more errors arise only during runtime and thus,
the syntactic structure of any given sentence, and NLTK's it requires more testing. This can be frustrating, especially for
tokenizer module is used for tokenizing i.e. breaking the text scientists that are testing different models and want to focus
into parts. It provides full support to tasks such as stemming, more on the algorithm than the actual code. We can use only
classification, tagging, semantic reasoning functionalities etc. one thread in Python because of its global interpreter lock.
144
Authorized licensed use limited to: b-on: Instituto Politecnico de Braganca. Downloaded on November 23,2022 at 17:19:08 UTC from IEEE Xplore. Restrictions apply.
Python for Data Analytics, Scientific and Technical Applications
The training was performed using GTX1080 GPU on Kaggle's have to deal with enormous amounts of data, for their needs of
Dog Breed Identification dataset [14] for 10 epochs. All data processing and its analysis.
models were trained on the exact same data, where the same
method of data loading & preprocessing is applied. Each REFERENCES
model is trained with Adam and SGD optimizers with batch [1] X. Cai, H. Langtangen and H. Moe, "On the Performance of the
size = 4 and batch size = 16 respectively. Table III and table Python Programming Language for Serial and Parallel
IV shows the training time for the three frameworks on the Scientific Computations,” Scientific Programming, vol. 13, no.
three models for Adam and SGD. We can see that there is not 1, pp. 31-56, 2005.
much difference in time in case of TensorFlow and PyTorch. [2] G. V. Rossum, “Computer Programming for Everybody-A
However, Keras takes a significantly more time. When the Scouting Expedition for the Programmers of Tomorrow,” CNRI
dataset is not very big and the focus is on rapid Proposal 90120-1a, Corporation for National Research
experimentation, then Keras is preferred due to its simplicity. Initiatives, 1999.
On the other hand, it is preferable to use TensorFlow or [3] P. J. Guo. (2014) “Python is now the most popular introductory
teaching language at top U.S. universities”. [Online]. Available:
PyTorch when high performance is required.
http://cacm.acm.org/blogs/blog-cacm/176450-python-is-now-
the-most-popular-introductory-teaching-language-at-top-u-s-
TABLE III: FRAMEWORK AND THEIR SPEED FOR ADAM
universities/fulltext, 2014
OPTIMIZER
[4] R. Sand. (2012) The slideshare website. [Online]. Available:
Time (in seconds) https://www.slideshare.net/blackducksoftware/open-source-by-
Framework the-numbers/
VGG 16 VGG 19 RESNET50 [5] G. Piatetsky. (2017) Kdnuggets website. [online]. Available:
TensorFlow 21.41 24.49 10.39 https://www.kdnuggets.com/2017/08/python-overtakes-r-
leader-analytics-data-science.html
Keras 27.9 31.56 20.2 [6] S. Behnel, R. Bradshaw, C. Citro, L. Dalcin, D. Seljebotn and
PyTorch 22.28 24.58 11.35 K. Smith, "Cython: The Best of Both Worlds," Computing in
Science & Engineering, vol. 13, no. 2, pp. 31-39, 2011.
TABLE IV: FRAMEWORK AND THEIR SPEED FOR SGD [7] S. V. D. Walt, S. Colbert and G. Varoquaux, "The NumPy
OPTIMIZER Array: A Structure for Efficient Numerical Computation,"
Computing in Science & Engineering, vol. 13, no. 2, pp. 22-30,
Time (in seconds) 2011.
Framework [8] W. McKinney, “Pandas: a foundational Python library for data
VGG 16 VGG 19 RESNET50 analysis and statistics,” Python for High Performance and
TensorFlow 14.41 16.29 7.3 Scientific Computing, pp. 1-9, 2011.
Keras 14.12 16.51 12.41 [9] KM Lukas. (2017) The machinelearningexp website. [Online].
Available:http://machinelearningexp.com/data-science-
PyTorch 11.28 12.81 7.38 performance-of-python-vs-pandas-vs-numpy/
[10] Abadi, Martín, P. Barham, J. Chen, Z. Chen, A. Davis et al.
VI. CONCLUSION "Tensorflow: A system for large-scale machine learning," in
In this paper, we discussed the major reasons for the increasing OSDI, 2016, p. 265-283.
popularity and demand of python. Moreover, we also saw how [11] Bird, Steven, and E. Loper, "NLTK: the natural language
the performances of different deep learning frameworks on toolkit," in Proc. of the ACL 2004 on Interactive poster and
demonstration sessions, 2004, p. 31.
training a CNN. Even though Python is slower in runtime and
[12] F. Pedregosa, G. Varoquaux, A. Gramfort et al. “Scikit-learn:
has some design restrictions as compared to compiled Machine Learning in Python,” Journal of Machine Learning,
languages like C or C++, it is preferred by scientists and vol. 12, pp. 2825-2830, 2011.
developers in the field of data analytics, numerical [13] J. D. Hunter, "Matplotlib: A 2D Graphics Environment," in
computations and almost all technical domains like, Machine Computing in Science & Engineering, vol. 9, no. 3, pp. 90-95,
learning, AI, Deep learning, etc. Its open source deep learning May-June 2007.
libraries such as TensorFlow, Keras and PyTorch have made [14] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei, “Novel
the process of training a deep learning model very easier. That dataset for Fine-Grained Image Categorization: Stanford dogs,”
is why it is first choice of even the topmost companies in the in Proc. Computer Vision and Pattern Recognition workshop on
world such as Amazon, Facebook, Spotify and Instagram, that Fine-Grained Visual Categorization (FGVC), vol. 2, p. 1, 2011.
145
Authorized licensed use limited to: b-on: Instituto Politecnico de Braganca. Downloaded on November 23,2022 at 17:19:08 UTC from IEEE Xplore. Restrictions apply.