Big Data Analytics
Big Data Analytics
Due to the advent of new technologies, devices, and communication means like social
networking sites, the amount of data produced by mankind is growing rapidly every year.
This rate is still growing enormously. Though all this information produced is meaningful
and can be useful when processed, it is being neglected.
Big data is a term that describes the large volume of data both structured and
unstructured that inundates a business on a day-to-day basis. But its not the amount of data
thats important. Its what organizations do with the data that matters. Big data can be
analysed for insights that lead to better decisions and strategic business moves.
Big data is a collection of large datasets that cannot be processed using traditional
computing techniques
What Comes Under Big Data?
Big data involves the data produced by different devices and applications.
Black Box Data :
It is a component of helicopter, airplanes, and jets, etc. It captures voices of the flight
crew, recordings of microphones and earphones, and the performance information of the
aircraft.
Social Media Data :
Social media such as Facebook and Twitter hold information and the views
posted by millions of people across the globe.
Stock Exchange Data:
The stock exchange data holds information about the buy and sell decisions
made on a share of different companies made by the customers.
Power Grid Data:
The power grid data holds information consumed by a particular node with
respect to a base station.
Transport Data:
Transport data includes model, capacity, distance and availability of a vehicle.
Search Engine Data :
Search engines retrieve lots of data from different databases.
1|Page
Why is big data analytics important?
Big data analytics helps organizations harness their data and use it to identify new
opportunities.
1. Cost reduction.
Big data technologies such as Hadoop and cloud-based analytics bring
significant cost advantages when it comes to storing large amounts of data plus they
can identify more efficient ways of doing business.
2. Faster, better decision making.
With the speed of Hadoop and in-memory analytics, combined with the ability
to analyze new sources of data, businesses are able to analyze information
immediately and make decisions based on what theyve learned.
3. New products and services.
With the ability to gauge customer needs and satisfaction through analytics
comes the power to give customers what they want. Davenport points out that with
big data analytics, more companies are creating new products to meet customers
needs.
2|Page
3) Variety (range of data types, domains and sources)
Variety Refers to the different types of data we can now use. With big data
technology we can now analyse and bring together data of different types such as messages,
social media conversations, photos, sensor data, and video or voice recordings.
There are many different types of data and each of those types of data requires different types
of analyses or different tools to use.
Q2. Explain the different big data types with suitable example.
The importance of being able to manage the variety of data types. Big data
encompasses everything from dollar transactions to tweets to images to audio. Therefore,
taking advantage of big data requires all the information be integrated for analysis and data
management.
The three main types of big data:-
1. Structured data
2. Unstructured data
3. Semi-structured data
1. Structured data :-
The term structured data generally refers to data that has a defined length and format.
Structured data that has defined repeating patterns. And structured data is organized data in a
predefined format.
This kind of data accounts for about 20% of the data that is out there. Its usually
stored in a database.
For ex:-
Structured data include numbers, dates, and group of words and numbers called
strings.
Relational databases(in the form of table)
Flat files in the form of records(like tab separated files)
Multidimensional databases
Legacy databases.
3|Page
The sources of data are divided into two categories:-
1. Computer or machine generated:-
Machine-generated data generally refers to data that is created by a machine with-out
human intervention.
Ex: - Sensor data, web log data, financial data.
2. Human-generated:-
This is data that humans, in interaction with computers, supply.
Ex: - Input data, Gaming-related data.
Unstructured data:-
Unstructured data is data that does not follow a specified format. Unstructured data
refers to information that either does not have a pre-defined data model or is not organized in
a predefined manner.
Unstructured information is typically text-heavy, but may also contain data such as dates,
numbers and facts. 80% of business relevant information originates in unstructured form,
primarily text.
Sources:-
Social media:- YouTube, twitter, Facebook
Mobile data:- Text messages and location information.
Call center notes, e-mails, written comments in a survey, blog entries.
Mainly two types:-
1. Machine generated :-
It refers to data that is created by a machine without human intervention.
Ex:-Satellite images, scientific data, photographs and video.
2. Human generated :-
It is generated by humans, in interaction with computers, machines etc.
Ex:-Website content, text messages.
Semi-structured data:-
Semi-structured data is data that has not been organized into a specialized repository
such as a database, but that nevertheless has associated information, such as metadata, that
makes it more amenable to processing than raw data.
4|Page
Schema-less or self-describing structure refers to a form of structured data that
contains tags or mark-up elements in order to separate elements and generate hierarchies of
records and fields in the given data. Semi-structured data is a kind of data that falls between
structured and unstructured data.
Sources:-
File systems such as web data in the form of cookies.
Web server log and search patterns.
Sensor data.
Q3. Write down the difference between traditional Analytics & Big Data
Analytics.
Sr No. Big data analytics Traditional Analytics
1. The type of data used by big data analytics are Traditional analytics uses structured data
structured, unstructured, semi structured data. that are formatted in rows and columns.
2. The volume of data processed by big data The volume of data processed by traditional
analytics is very large i.e. 100 terabytes to analytics is tens of terabytes or less.
zettabytes.
3. There is continuous flow of data. There is no continuous flow of data.
4. Decision support based on real time data and Decision support based on historical data.
gives valuable business information and insights.
5. Data is collected for analysis from inside or Data is collected for analysis from internal
outside of organization example device data, source, records from transactional system.
social media, sensor data, and log data.
6. It uses inexpensive commodity boxes in cluster It uses specialized high end hardware and
node. software.
7. Data often physically distributed. All data are centralized.
8. This contains massive or voluminous data which The relationship between the data items can
increase the level of difficulty in figuring out the be explored easily as the number of
relationship between the data items information stored is small.
9. Data is stored in HDFS, NoSQL Data is stored in RDBMS
5|Page
Q4. What is Big Data Analytics? Explain its different types.
Predictive Analytics:
This type of analysis, is of likely scenarios of what might happen.
The deliverables are usually a predictive forecast.
Predictive analytics use big data to identify past patterns to predict the future.
For example, some companies are using predictive analytics for sales lead scoring.
Some companies have gone one step further use predictive analytics for the entire sales
process, social media, documents, CRM data, etc.
6|Page
Properly tuned predictive analytics can be used to support sales, marketing, or for other
types of complex forecasts.
Predictive analytics include next best offers, churn risk and renewal risk analysis.
Features:
1. Forward looking
2. Focused on non-discrete predictions of future states, relationship, and patterns
3. Description of prediction result set probability distributions and likelihoods
4. Model application
5. Non-discrete forecasting (forecasts communicated in probability distributions)
Prescriptive analytics:
This type of analysis reveals what actions should be taken.
This is the most valuable kind of analysis and usually results in rules and
recommendations for the next step.
This analysis is really valuable, but is not used that largely.
13% of the organizations are using predictive analysis of big data and only 3% are using
prescriptive analysis of big data.
It gives you a laser-like focus for a particular question.
It shows the best solution among a variety of choices, given the known parameters and
suggests options how to take advantage of a future opportunity or mitigate a future risk.
Features:
1. Forward looking
2. Focused on optimal decisions for future situations
3. Simple rules to complex models that are applied on an automated or programmatic
basis
4. Optimization and decision rules for future events
Diagnostic analytics:
It gives a look at past performance to determine what happened and why.
The result of the analysis is often an analytical dashboard.
They are used for discovery or to determine why something happened in the first place.
For example, for a social media marketing campaign, you can use descriptive analytics to
assess the number of posts, mentions, followers, fans, page views, reviews, pins, etc.
7|Page
There can be thousands of online mentions that can be distilled into a single view to see
what worked in your past campaigns and what didnt.
Features:
1. Backward looking and Focused on causal relationships and sequences
2. Target/dependent variable with independent variables/dimensions
Q5. Enlist and explain the different technologies used for handling Big Data.
8|Page
performance and scalability application for research and business
by using IMC. organization .
Hadoop
This technology develops open source software for reliable, scalable, distributed
computation.
The apache Hadoop software library is a framework that allows for a distributed
processing of a large datasets across clusters of computer using simple programming
model.
It used to distribute processing of datasets of big data using the MapReduce programming
model.
All the modules in Hadoop are designed with a fundamental assumption that hardware
failure are common occurrence & should automatically handle by the framework.
Hadoop performs well with several nodes without requiring shared memory or disks
among them.
Hadoop follows the client-server architecture in which the server works as a master and is
responsible for data distribution among clients that are commodity machines and work as
slaves to carry out all the computational tasks. The master node also perform the tasks of
job controlling, disk management and work allocation.
9|Page
Module In Hadoop : -
1. Hadoop common :- contains libraries & utilities needed by other Hadoop modules.
2. Hadoop distributed file system(HDFS) :- A distributed file system that stores data
on commodity machine.
3. Hadoop YARN :- A platform responsible for managing computing resources and
user application.
4. Hadoop mapreduce :- An implementation of the mapreduce programming model for
large scale data processing.
10 | P a g e
Cloud Delivery model for Big data
IaaS (Infrastructure as a service) :- The huge storage and computational power
requirements for big data are fulfilled by the limitless storage space and computing ability
to obtained by the IaaS cloud.
PaaS (Platform as a service) :- PaaS offering of various vendors have started adding
various popular Big Data platform that include MapReduce and Hadoop. These offering
save organizations from a lot of hassles which may occur in managing individual
hardware component and software application.
SaaS (Software as a service) :- A organization require identifying and analyzing the
voice of customer , particularly on social media platform. The social media data and the
platform for analyzing data are provided by SaaS vendors.
11 | P a g e
Q6. Explain different Big Data Business models.
12 | P a g e
Information as a Service (IaaS)
IaaS focuses on providing insights based on the analysis of processed data. In this
case the customers job-to-be-done is more about coming up with their own
conclusions or even selling an idea based on certain information.
Additionally, IaaS customers dont want to or do not have the resources to process
and analyze data. Rather they are willing to exchange value for analysis from trusted
parties.
Unlike the DaaS business model, which is about aggregation and dissemination of lots
of processed data for customers to create their own value propositions from, the IaaS
business model is all about turning data into information for customers who need
something and are willing to pay for something more tailored.
13 | P a g e
4. This approach lowers the risk of catastrophic system failure, even if a significant number
of nodes become inoperative.
5. Hadoop was inspired by Google's MapReduce, a software framework in which an
application is broken down into numerous small parts. Any of these parts (also called
fragments or blocks) can be run on any node in the cluster.
6. Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy
elephant.
7. The current Apache Hadoop ecosystem consists of the Hadoop kernel, MapReduce, the
Hadoop distributed file system (HDFS) and a number of related projects such as Apache
Hive, HBase and Zookeeper.
8. The Hadoop framework is used by major players including Google, Yahoo and IBM,
largely for applications involving search engines and advertising.
14 | P a g e
2. Code to data
Traditional data processing architecture
Nodes are broken up into separate processing and storage nodes connected by high-
capacity link.
Many data-intensive applications are not CPU demanding causing bottlenecks in network.
15 | P a g e
4. Abstract Complexity
Hadoop abstracts many complexities in distributed and concurrent applications
Defines small number of components
Provides simple and well defined interfaces of interactions between these components
Frees developer from worrying about system level challenges
Race conditions, data starvation
Processing pipelines, data partitioning, code distribution etc.
Allows developers to focus on application development and business logic
Q8. Write a note on the different problems for which Hadoop is suitable.
17 | P a g e
Aggregate Data, Score Data, Present Score As Rank, which, at its simplest, is what
Hadoop can do. But the introduction of the idea of a Data Sandbox and the ability, using
Sqoop, to push the analysed data back into a relational database (for a data warehouse for
example) means that you can run Hadoop independently and prove its worth in your
business very cheaply.
18 | P a g e
Q10. Explain Hadoop ECO system.
The Hadoop ecosystem refers to the various components of the Apache Hadoop
software library, as well as to the accessories and tools provided by the Apache Software
Foundation for these types of software projects, and to the ways that they work together.
Hadoop is a Java-based framework that is extremely popular for handling and analyzing large
sets of data.
1. MapReduce:-
MapReduce is now the most widely used general purpose computing model and
runtime system for distributed data analytics. MapReduce is based on the parallel
programming framework to process the large amounts of data dispersed across different
systems. The process is initiated when a user request is received to execute the MapReduce
program and terminated once the results are written back to HDFS. MapReduce enables the
computational processing of data stored in a file system without the requirement of loading
the data initially into a database
2. Pig:-
Pig is a platform for constructing data flows for Extract, Transform, & Load
processing and analysis of large data sets. Pig Latin, the programming language for pig
provides common data manipulation operations such as grouping, joining, & filtering. Pig
generates Hadoop MapReduce jobs to perform data flows. The pig Latin scripting language is
19 | P a g e
not only a higher level data flow language but only has operators similar to SQL(EX:-
FILTER,JOIN)
3. Hive:-
Hive is a SQL based data warehouse system for Hadoop that facilitates data
summarization, adhoc queries, and the analysis of large data sets stored in Hadoop
Compatible file system (Ex:- HDFS) and some NOSQL databases.
Hive is not a relational database but a query engine that supports the parts of SQL specific to
querying data, with some additional support for writing new tables or files, but not updating
individual records.
4. HDFS:-
HDFS is an effective, scalable, fault tolerant, and distributed approach for storing and
managing huge volumes of data. HDFS works on write once read many times approach and
this makes it capable of handling such huge volumes of data with the least possibilities of
errors caused by the replication of data.
5. Hadoop YARN:-
YARN is a core Hadoop Service that supports two major services:-
1. Resource Manager
2. Application Master
The Resource Manager is a master service that manages the node manager in each
node of the cluster. It also has scheduler that allocates system resources to specific running
applications. The scheduler however does not track the applications status. The Resource
container stores all the needed system information. It maintains detailed Resource attributes
that are important for running applications on the node and in the cluster. Application
Manager notifies the node manager if more resources are required for executing the
application.
6. HBASE:-
HBASE is one of the projects of APACHE Software Foundation that is distributed
under Apache Software License V2.0. It is a non-relational database suitable for distributed
Environment and uses HDFS as its persistence storage. HBASE facilitates reading/writing of
Big data randomly and efficiently in real time. It is highly configurable, allows efficient
management of huge amount of data, and helps in dealing with Big Data challenges in many
ways.
20 | P a g e
7. Sqoop:-
Sqoop is a tool for data transfer between hadoop and relational databases. Critical
Processes are employed by MapReduce to move data into Hadoop and back to other data
sources. Sqoop is a command line interpreter which sequentially executes sqoop commands.
Sqoop operates by selecting an appropriate import function for source data from the specified
database.
8. Zookeeper:-
Zookeeper helps in coordinating all the elements of distributed applications.
Zookeeper enables different nodes of a service to communicate and coordinate with each
other and also find other master IP addresses. Zookeeper provides a central location for
keeping information, thus acting as a coordinator that makes the stored information available
to all nodes of a service
9. Flume:-
Flume aids in transferring large amounts of data from distributed resources to a single
centralized repository. It is a robust and fault tolerant, and efficiently collects, assembles, and
transfers data. Flume is used for real time data capturing in hadoop. The simple and
extensible data model of Flume facilitates fast online data analytics.
10. Oozie :-
Oozie is an open source Apache Hadoop service used to manage and process
submitted jobs. It supports the workflow/coordination model and is highly extensible and
scalable. Oozie is a dataware service that coordinates dependencies among different jobs
executing on different platforms of hadoop such as hdfs, pig, and mapreduce
11. Mahout:-
Mahout is scalable machine learning and data mining library. There are currently 4
main groups of algorithm in Mahout:-
1. Recommendations or collective filtering
2. Classification, categorization
3. Clustering
4. Frequent item set mining, parallel frequent pattern mining.
21 | P a g e
Q11. Explain the storing & querying (reading & writing) the Big Data in
HDFS.
NAMENODE
The namenode is the commodity hardware that contains the GNU/Linux operating system
and the namenode software. It is a software that can be run on commodity hardware.
The system having the namenode acts as the master server and it does the following tasks:
Manages the file system namespace.
Regulates clients access to files.
It also executes file system operations such as renaming, closing, and opening files and
directories.
DATANODE
The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software. For every node (Commodity hardware/System) in a cluster, there will
be a datanode. These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client request.
They also perform operations such as block creation, deletion, and replication according
to the instructions of the namenode.
BLOCK
Generally the user data is stored in the files of HDFS.
The file in a file system will be divided into one or more segments and/or stored in
individual data nodes.
22 | P a g e
These file segments are called as blocks. In other words, the minimum amount of data
that HDFS can read or write is called a Block.
The default block size is 64MB, but it can be increased as per the need to change in
HDFS configuration.
23 | P a g e
Q12. Draw & explain the Hadoop architecture
Hadoop follows a master slave architecture design for data storage and distributed
data processing using HDFS and MapReduce respectively.
The master node for data storage is hadoop HDFS is the NameNode and the master
node for parallel processing of data using Hadoop MapReduce is the Job Tracker.
The slave nodes in the hadoop architecture are the other machines in the Hadoop
cluster which store data and perform complex computations.
Every slave node has a Task Tracker daemon and a DataNode that synchronizes the
processes with the Job Tracker and NameNode respectively.
A file on HDFS is split into multiple bocks and each is replicated within the Hadoop
cluster. Hadoop Distributed File System (HDFS) stores the application data and file
system metadata separately on dedicated servers.
24 | P a g e
It has two components: NameNode and DataNode
NameNode
File system metadata is stored on servers referred to as NameNode. All the files and
directories in the HDFS are represented on the NameNode. NameNode maps the entire
file system structure into memory.
DataNode
Application data is stored on servers referred to as DataNodes. HDFS replicates the
file content on multiple DataNodes DataNode manages the state of an HDFS node and
interacts with the blocks .A DataNode can perform CPU intensive jobs and I/O intensive
jobs like clustering, data import, data export, search, decompression, and indexing.
25 | P a g e
On receiving the job configuration, the job tracker identifies the number of splits
based on the input path and select Task Trackers based on their network vicinity to
the data sources. Job Tracker sends a request to the selected Task Trackers.
The processing of the Map phase begins where the Task Tracker extracts the input
data from the splits. Map function is invoked for each record. On completion of the
map task, Task Tracker notifies the Job Tracker.
When all Task Trackers are done, the Job Tracker notifies the selected Task Trackers
to begin the reduce phase. Task Tracker reads the region files and sorts the key-value
pairs for each key. The reduce function is then invoked which collects the aggregated
values into the output file.
26 | P a g e
Options
The -R option will make the change recursively through the
directory structure.
4. Chown Usage: hdfs dfs -chown [-R] [OWNER][:[GROUP]] URI [URI ]
Change the owner of files. The user must be a super-user. Additional
information is in the Permissions Guide.
Options
The -R option will make the change recursively through the
directory structure.
5. copyFromLocal Usage: hdfs dfs -copyFromLocal <localsrc> URI
Similar to put command, except that the source is restricted to a local
file reference.
Options:
The -f option will overwrite the destination if it already
exists.
6. copyToLocal Usage: hdfs dfs -copyToLocal [-ignorecrc] [-crc] URI <localdst>
Similar to get command, except that the destination is restricted to a
local file reference.
hdfs dfs -cp /user/hadoop/file1 /user/hadoop/file2
/user/hadoop/dir
Exit Code:
Returns 0 on success and -1 on error.
7. moveToLocal Usage: hdfs dfs -moveToLocal [-crc] <src> <dst>
Displays a "Not implemented yet" message.
27 | P a g e
Q14. Write a note on Hadoop advantages & challenges.
Hadoop has also proven valuable for many other more traditional enterprises based on some
of its big advantages:
1. Scalable
Hadoop is a highly scalable storage platform, because it can store and distribute very
large data sets across hundreds of inexpensive servers that operate in parallel. Unlike
traditional relational database systems (RDBMS) that cant scale to process large amounts of
data, Hadoop enables businesses to run applications on thousands of nodes involving
thousands of terabytes of data.
2. Cost effective
Hadoop also offers a cost effective storage solution for businesses exploding data
sets. The problem with traditional relational database management systems is that it is
extremely cost prohibitive to scale to such a degree in order to process such massive volumes
of data. In an effort to reduce costs, many companies in the past would have had to down-
sample data and classify it based on certain assumptions as to which data was the most
valuable. The cost savings are staggering: instead of costing thousands to tens of thousands of
pounds per terabyte, Hadoop offers computing and storage capabilities for hundreds of
pounds per terabyte.
3. Flexible
Hadoop enables businesses to easily access new data sources and tap into different
types of data (both structured and unstructured) to generate value from that data. This means
businesses can use Hadoop to derive valuable business insights from data sources such as
social media, email conversations or clickstream data.
4. Fast
Hadoops unique storage method is based on a distributed file system that basically maps
data wherever it is located on a cluster. The tools for data processing are often on the same
servers where the data is located, resulting in much faster data processing. If youre dealing
with large volumes of unstructured data, Hadoop is able to efficiently process terabytes of
data in just minutes, and petabytes in hours.
5. Advanced data analysis can be done in house
Hadoop makes it practical to work with large data sets and customize the outcome without
having to outsource the task to specialist service providers. Keeping operations in house helps
28 | P a g e
organizations be more agile, while also avoiding the ongoing operational expense of
outsourcing.
6. Run a commodity vs. custom architecture
Some of the tasks that Hadoop is being used for today were formerly run by MPCC and other
specialty, expensive computer systems. Hadoop commonly runs on commodity hardware.
Because it is the de facto big data standard, it is supported by a large and competitive solution
provider community, which protects customers from vendor lock-in.
Challenges of Hadoop:
1. Hadoop is a cutting edge technology
Hadoop is a new technology, and as with adopting any new technology, finding people who
know the technology is difficult!
2. Hadoop in the Enterprise Ecosystem
Hadoop is designed to solve Big Data problems encountered by Web and Social companies.
In doing so a lot of the features Enterprises need or want are put on the back burner. For
example, HDFS does not offer native support for security and authentication.
3. Hadoop is still rough around the edges
The development and admin tools for Hadoop are still pretty new. Companies like Cloudera,
Hortonworks, MapR and Karmasphere have been working on this issue. However the tooling
may not be as mature as Enterprises are used to (as say, Oracle Admin, etc.)
4. Hadoop is NOT cheap
Hardware Cost
Hadoop runs on 'commodity' hardware. But these are not cheapo machines, they are server
grade hardware.
So standing up a reasonably large Hadoop cluster, say 100 nodes, will cost a significant
amount of money.
IT and Operations costs
A large Hadoop cluster will require support from various teams like : Network Admins, IT,
Security Admins, System Admins.
Also one needs to think about operational costs like Data Center expenses: cooling,
electricity, etc.
5. Map Reduce is a different programming paradigm
Solving problems using Map Reduce requires a different kind of thinking. Engineering teams
generally need additional training to take advantage of Hadoop.
29 | P a g e
Q15. Explain different features of HBase.
1. Key Features in Hbase are is not an eventually consistent DataStore. This makes it
very suitable for tasks such as high-speed counter aggregation(Strongly consistent
reads/writes)
2. HBase tables are distributed on the cluster via regions, and regions are automatically split
and re-distributed as your data grows(Automatic sharding)
3. Automatic RegionServer failover
4. HBase supports out of the box as its distributed file system(Hadoop/HDFS Integration)
5. HBase supports massively parallelzed processing via mapreduce for using HBase as both
source and sink.(MapReduce)
6. HBase supports an easy to use Java API for programmatic access(Java Client API)
7. HBase also supports Thrift and REST for non-Java front-ends(Thrift/REST API)
8. HBase supports a Block Cache and Bloom Filters for high volume query
optimization(Block Cache and Bloom Filters)
9. HBase provides build-in web-pages for operational insight as well as JMX
metrics(Operational Management)
Use of hbase :
1. we should have milions or billions of rows and columns in table at that point only we
have use Hbase otherwise better to go RDBMS(we have use thousand of rows and
columns)
2. In RDBMS should runs on single database server but in hbase is distributed and scalable
and also run on commodity hardware.
3. Typed columns, secondary indexes, transactions, advanced query languages, etc these
features provided by Hbase,not by RDBMS.
30 | P a g e
Q16. Write down the difference between HDFS and HBase
31 | P a g e
Q17. Write down the difference between RDBMS and HBase
Sr No. RDBMS HBase
1. An RDBMS is governed by its schema, hence HBase is schema less, it doesnt follow the
it follows fixed schema that describes the concept of fixed schema, hence it has flexible
whole structure of tables. schemas
2. It is column oriented, define only column
It is mostly row oriented.
families.
3. It doesnt natively scale to distributed storage. It is distributed, versioned data storage system.
4. Since, it is fixed schema, doesnt support It supports dynamic addition of column in table
addition of columns. schema.
5. It is built for narrow tables. It is built for wide tables.
6. It is vertically but hard to scale. HBase is horizontally scaled.
7. Not optimized for sparse tables. Good with sparse tables.
8. Has no query language, only 3
Has SQL as its query language.
commands: put, get & scan.
9. HDFS is underlying layer of HBase and
Supports secondary indexes and improvises provides fault tolerance and linear scalability. It
data retrieval through SQL language. doesnt support secondary indexes and support
data in key-value pair.
10. RDBMS is transactional. No transactions are there in HBase.
11. It will have normalized data. It has de-normalized data.
12. It is good for semi-structured as well as
It is good for structured data.
structure data.
13. Max. data size is in TBs. Max. data size is in hundreds of PBs.
14. Read/write throughput limits are 1000s Read/write throughput limits are millions of
queries/second. queries/second.
15. RDBMS database technology is a very
proven, consistent, matured and highly
HBase helps Hadoop overcome the challenges
supported by world best companies. Hence,
in random read and write.
this is more appropriate for real time OLTP
processing.
32 | P a g e
Q18. Explain the storage mechanism of HBase
HBase is a column-oriented database and the tables in it are sorted by row. The table
schema defines only column families, which are the key value pairs. A table have multiple
column families and each column family can have any number of columns. Subsequent
column values are stored contiguously on the disk. Each cell value of the table has a
timestamp.
Example schema of table in HBase:
33 | P a g e
Column:
A column in HBase consists of a column family and a column qualifier, which are
delimited by a colon (:) character.
Column Family:
Column families physically co locate a set of columns and their values, often for
performance reasons. Each column family has a set of storage properties, such as whether its
values should be cached in memory, how its data is compressed or its row keys are encoded,
and others. Each row in a table has the same column families, though a given row might not
store anything in a given column family.
Following image shows column families in a column-oriented database:
Timestamp:
A timestamp is written alongside each value, and is the identifier for a given version
of a value. By default, the timestamp represents the time on the RegionServer when the
data was written, but you can specify a different timestamp value when you put data into
the cell.
Conceptual View:
At a conceptual level tables may be viewed as a sparse set of rows, they are physically
stored by column family.
Cells in this table that appear to be empty do not take space, or in fact exist, in HBase. This is
what makes HBase "sparse."
Time
Row Key ColumnFamilycontents ColumnFamily anchor
Stamp
"com.cnn.www" t9 anchor:cnnsi.com = "CNN"
34 | P a g e
anchor:my.look.ca =
"com.cnn.www" t8
"CNN.com"
contents:html =
"com.cnn.www" t6
"<html>"
Physical View:
A new column qualifier (column_family:column_qualifier) can be added to an
existing column family at any time.
The empty cells shown in the conceptual view are not stored at all. Thus a request for the
value of the contents:html column at time stamp t8 would return no value.
35 | P a g e
HBase architecture consists mainly of four components:
1. HMaster
2. HRegionServer
3. HRegion
4. Zookeeper
HMaster
HMaster is the implementation of Master server in HBase architecture. It acts like
monitoring agent to monitor all Region Server instances present in the cluster and acts as
an interface for all the metadata changes.
In a distributed cluster environment, Master runs on NameNode. Master runs several
background threads.
A master is responsible for:
Coordinating the region servers
Assigning regions on start-up , re-assigning regions for recovery or load balancing
Monitoring all RegionServer instances in the cluster
Interface for creating, deleting, updating tables
HRegionServer
HRegionServer is the Region Server implementation. It is responsible for serving and
managing regions or data that is present in distributed cluster.
The region servers run on Data Nodes present in the Hadoop cluster.
HRegion servers performs the following functions:
Hosting and managing regions
Splitting regions automatically
Handling read and writes requests
Communicating with the client directly
HRegion
HRegion are the basic building elements of HBase cluster that consists of the distribution
of tables and are comprised of Column families.
It contains multiple stores, one for each column family. It consists of mainly two
components, which are Memstore and HFile.
36 | P a g e
1. Memstore
When something is written to HBase, it is first written to an in-memory store
(memstore), once this memstore reaches a certain size, it is flushed to disk into a store file
(and is also written immediately to a log file for durability). The store files created on disk are
immutable.
2. HFile
An HFile contains a multi-layered index which allows HBase to seek the data without
having to read the whole file. The data is stored in HFile in key-value pair in increasing
order.
ZooKeeper
HBase uses ZooKeeper as a distributed coordination service to maintain server state in
the cluster.
Zookeeper maintains which servers are alive and available, and provides server failure
notification. Zookeeper uses consensus to guarantee common shared state.
Services provided by ZooKeeper are as follows :
Maintains Configuration information
Provides distributed synchronization
Client Communication establishment with region servers
Master servers usability of ephemeral nodes for discovering available servers in the
cluster
To track server failure and network partitions
HLog
HLog is the HBase Write Ahead Log (WAL) implementation and there is one HLog
instance per RegionServer.
Each RegionServer adds updates (Puts, Deletes) to its write-ahead log first and then to
the Memstore.
37 | P a g e
Q20. What is Pig? Explain its advantages/ benefits.
Pig is a high level scripting language that is used with Apache Hadoop. Pig enables
data workers to write complex data transformations without knowing Java.
Pigs simple SQL-like scripting language is called Pig Latin, and appeals to developers
already familiar with scripting languages and SQL.
Pig is complete, so you can do all required data manipulations in Apache Hadoop with Pig.
Through the User Defined Functions(UDF) facility in Pig, Pig can invoke code in many
languages like JRuby, Jython and Java. You can also embed Pig scripts in other languages.
The result is that you can use Pig as a component to build larger and more complex
applications that tackle real business problems.
Pig works with data from many sources, including structured and unstructured data, and store
the results into the Hadoop Data File System.Pig scripts are translated into a series of
MapReduce jobs that are run on the Apache Hadoop cluster.
Advantages of PIG
1. Decrease in development time. This is the biggest advantage especially considering
vanilla map-reduce jobs' complexity, time-spent and maintenance of the programs.
2. Learning curve is not steep, anyone who does not know how to write vanilla map-
reduce or SQL for that matter could pick up and can write map-reduce jobs; not easy
to master, though.
3. Procedural, not declarative unlike SQL, so easier to follow the commands and
provides better expressiveness in the transformation of data every step. Comparing to
vanilla map-reduce , it is much more like an English language. It is concise and unlike
Java but more like Python.
4. Since it is procedural, you could control of the execution of every step. If you want to
write your own UDF(User Defined Function) and inject in one specific part in the
pipeline, it is straightforward.
5. Speaking of UDFs, you could write your UDFs in Python.
6. Lazy evaluation: unless you do not produce an output file or does not output any
message, it does not get evaluated. This has an advantage in the logical plan, it could
optimize the program beginning to end and optimizer could produce an efficient plan
to execute.
38 | P a g e
7. Enjoys everything that Hadoop offers, parallelization, fault-tolerance with many
relational database features.
8. It is quite effective for unstructured and messy large datasets. Actually, Pig is one of
the best tool to make the large unstructured data to structured.
Q21. Write a note on Pig (like what is pig, its components, its different user
& for what it is used)
Apache Pig is a high level data flow platform for execution Map Reduce programs of
Hadoop.
The language for Pig is pig Latin.The Pig scripts get internally converted to Map Reduce jobs
and get executed on data stored in HDFS.
Every task which can be achieved using PIG can also be achieved using java used in Map
reduce.
39 | P a g e
Apache Pig Components :
As shown in the figure, there are various components in the Apache Pig framework.
Let us take a look at the major components.
1. Parser
Initially the Pig Scripts are handled by the Parser. It checks the syntax of the script,
does type checking, and other miscellaneous checks. The output of the parser will be a DAG
(directed acyclic graph), which represents the Pig Latin statements and logical operators.
In the DAG, the logical operators of the script are represented as the nodes and the data flows
are represented as edges.
2. Optimizer
The logical plan (DAG) is passed to the logical optimizer, which carries out the
logical optimizations such as projection and pushdown.
3. Compiler
The compiler compiles the optimized logical plan into a series of MapReduce jobs.
Execution engine
Finally the MapReduce jobs are submitted to Hadoop in a sorted order. Finally, these
MapReduce jobs are executed on Hadoop producing the desired results.
4. USES:
Apache Pig is generally used by data scientists for performing tasks involving ad-hoc
processing and quick prototyping. Apache Pig is used
To process huge data sources such as web logs.
To perform data processing for search platforms.
To process time sensitive data loads.
The Pig programming language is designed to handle any kind of data. Pig is made up
of two components:
the first is the language itself, which is called PigLatin. and the second is a runtime
environment where PigLatin programs are executed.
Pig Latin statements are the basic constructs you use to process data using Pig. A Pig Latin
statement is an operator that takes a relation as input and produces another relation as output.
(This definition applies to all Pig Latin operators except LOAD and STORE which read data
40 | P a g e
from and write data to the file system.)
Pig Latin statements may include expressions and schemas.
Pig Latin statements can span multiple lines and must end with a semicolon ( ; ).
By default, Pig Latin statements are processed using multi-query execution.
Pig Latin statements are generally organized as follows:
-A LOAD statement to read data from the file system.
-A series of "transformation" statements to process the data.
-A DUMP statement to view results or a STORE statement to save the results.
Note that a DUMP or STORE statement is required to generate output.
Example :
A = LOAD 'student' USING PigStorage() AS (name:chararray, age:int, gpa:float);
B = FOREACH A GENERATE name;
DUMP B;
Output:
(John)
(Mary)
(Bill)
(Joe)
Pig's language layer currently consists of a textual language called Pig Latin, which has
the following key properties:
1. Ease of programming.
It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data
analysis tasks.
Complex tasks comprised of multiple interrelated data transformations are explicitly encoded
as data flow sequences, making them easy to write, understand, and maintain.
2. Optimization opportunities.
The way in which tasks are encoded permits the system to optimize their execution
Automatically, allowing the user to focus on semantics rather than efficiency.
3. Extensibility.
Users can create their own functions to do special-purpose processing.
41 | P a g e
Q23. Explain relational operators in PigLatin with syntax, example.
Apache Pig, developed by Yahoo! helps in analysing large datasets and spend less
time in writing mapper and reducer programs.
Pig enables users to write complex data analysis code without prior knowledge of Java. Pigs
simple SQL-like scripting language is called Pig Latin and has its own Pig runtime
environment where Pig Latin programs are executed.
Relational Operators:
1. FOREACH
Generates data transformations based on columns of data.
Syntax
alias = FOREACH { gen_blk | nested_gen_blk } [AS schema];
Example:
A = load 'input' as (user:chararray, id:long, address:chararray, phone:chararray,
preferences:map[]);
B = foreach A generate user, id;
2. LOAD:
LOAD operator is used to load data from the file system or HDFS storage into a Pig
relation.
Syntax :
LOAD 'data' [USING function] [AS schema];
Example:
runt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' USING
PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray,
city:chararray );
grunt> student: { id: int,firstname: chararray,lastname: chararray,phone: chararray,city:
chararray }
3. FILTER:
This operator selects tuples from a relation based on a condition.
Syntax
grunt> Relation2_name = FILTER Relation1_name BY (condition);
42 | P a g e
Example:
X = FILTER A BY f3 ==3;
4. JOIN:
JOIN operator is used to perform an inner, equijoin join of two or more relations
based on common field values. The JOIN operator always performs an inner join. Inner joins
ignore null keys, so it makes sense to filter them out before the join.
Syntax
alias = JOIN alias BY {expression|'('expression [, expression ]')'} (, alias BY
{expression|'('expression [, expression ]')'} ) [USING 'replicated' | 'skewed' | 'merge']
[PARALLEL n]; \
Example:
grunt> customers3 = JOIN customers1 BY id, customers2 BY id;
5. ORDER BY:
Order By is used to sort a relation based on one or more fields. You can do sorting in
ascending or descending order using ASC and DESC keywords.
Syntax
alias = ORDER alias BY { * [ASC|DESC] | field_alias [ASC|DESC] [, field_alias
[ASC|DESC] ] } [PARALLEL n];
Example:
grunt> order_by_data = ORDER student_details BY age DESC;
6. DISTINCT:
Distinct removes duplicate tuples in a relation.
Syntax
grunt> Relation_name2 = DISTINCT Relatin_name1;
Example:
grunt> distinct_data = DISTINCT student_details;
7. STORE:
Store is used to save results to the file system.
Syntax:
43 | P a g e
STORE Relation_name INTO ' required_directory_path ' [USING function];
Example:
grunt> STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage (',');
8. GROUP:
The GROUP operator groups together the tuples with the same group key (key field).
The key field will be a tuple if the group key has more than one field, otherwise it will be the
same type as that of the group key. The result of a GROUP operation is a relation that
includes one tuple per group.
Syntax
alias = GROUP alias { ALL | BY expression} [, alias ALL | BY expression ] [USING
'collected'] [PARALLEL n];
Example:
grunt> group_data = GROUP student_details by age;
9. LIMIT:
LIMIT operator is used to limit the number of output tuples.
Syntax
grunt> Result = LIMIT Relation_name required number of tuples
Example:
grunt> limit_data = LIMIT student_details 4;
10. SPLIT:
SPLIT operator is used to partition the contents of a relation into two or more
relations based on some expression. Depending on the conditions stated in the expression.
Syntax
grunt> SPLIT Relation1_name INTO Relation2_name IF (condition1), Relation2_name
(condition2);
Example:
SPLIT student_details into student_details1 if age<23, student_details2 if (22<age and
age>25);
44 | P a g e
Q24. Explain the data types such as tuple, bag, relation and map in
PigLatin
The data model of Pig is fully nested. A Relation is the outermost structure of the Pig
Latin data model. And it is a bag where
A bag is a collection of tuples.
A tuple is an ordered set of fields.
A field is a piece of data.
45 | P a g e
Example : 60708090709
Complex Types
Null Values
Values for all the above data types can be NULL. Apache Pig treats null values in a
similar way as SQL does.
A null can be an unknown value or a non-existent value. It is used as a placeholder for
optional values. These nulls can occur naturally or can be the result of an operation.
46 | P a g e
Relational Operations
Operator Description
LOAD To Load the data from the file system (local/HDFS) into a
relation.
Filtering
Sorting
47 | P a g e
UNION To combine two or more relations into a single relation.
Diagnostic Operators
Hive is an open source volunteer project under the Apache Software Foundation. Hive
provides a mechanism to project structure onto this data, and query it using a Structured
Query Language (SQL)-like syntax called HiveQL (Hive Query Language).
This language also lets traditional map/reduce programmers plug in their custom mappers
and reducers when it's inconvenient or inefficient to express this logic in HiveQL.
It supports SQL-like access to structured data, as well as Big Data analysis with the help
of MapReduce.
HQL statements are broken down by the Hive service into MapReduce jobs and executed
across a Hadoop cluster.
Hive's primary responsibility is to provide data summarization, query and analysis.
48 | P a g e
Partitions:
Hive tables can have more than one partition. They are mapped to subdirectories and file
systems as well.
Buckets: In Hive data may be divided into buckets. Buckets are stored as files in partition
in the underlying file system.
EX : Create table and load data :
CREATE TABLE IF NOT EXISTS employee ( eid int, name String, salary String,
destination String)
ROW FORMAT DELIMITED FIELDS TERMINATED BY \t
STORED AS TEXTFILE;
49 | P a g e
Ex : creating a bucketed table
CREATE TABLE bucketed_user(
name VARCHAR(64),
city VARCHAR(64),
state VARCHAR(64),
country VARCHAR(64),
phone VARCHAR(64) )
PARTITIONED BY (country VARCHAR(64))
CLUSTERED BY (state) SORTED BY (city) INTO 32 BUCKETS
STORED AS SEQUENCEFILE;
50 | P a g e
Sr No. Unit Name Operation
1. User Interface Hive is a data warehouse infrastructure software that can create
interaction between user and HDFS. The user interfaces that Hive
supports are Hive Web UI, Hive command line, and Hive HD Insight
(In Windows server).
2. Meta Store Hive chooses respective database servers to store the schema or
Metadata of tables, databases, columns in a table, their data types,
and HDFS mapping.
5. HDFS or Hadoop distributed file system or HBASE are the data storage
HBASE techniques to store data into file system.
51 | P a g e
Q27. Write down the difference between RDBMS and Hive.
52 | P a g e
Q28 . Write a note on Apache Hive.
Apache Hive is an open-source data warehouse system for querying and analyzing large
datasets stored in Hadoop files. Hadoop is a framework for handling large datasets in a
distributed computing environment.
It is built on top of Hadoop to summarize Big Data, and makes querying and analyzing
easy. It was developed by Facebook and later it was open-sourced. It directly store data
on top of HDFS.
Hive has three main functions: data summarization, query and analysis. It supports
queries expressed in a language called HiveQL, which automatically translates SQL-like
queries into MapReduce jobs executed on Hadoop.
The traditional SQL queries must be implemented in the MapReduce Java API to execute
SQL applications and queries over a distributed data. Hive provides the necessary SQL
abstraction to integrate SQL-like Queries (HiveQL) into the underlying Java API without
the need to implement queries in the low-level Java API.
According to the Apache Hive wiki, "Hive is not designed for OLTP workloads and does
not offer real-time queries or row-level updates. It is best used for batch jobs over large
sets of append-only data (like web logs)."
Hive supports text files (flat files), Sequence Files (flat files consisting of
binary key/value pairs) and RC Files (Record Columnar Files which store columns of a
table in a columnar database way.)
Features of Hive:
1. Familiar: Query data with a SQL-based language for querying called HiveQL or HQL.
2. Fast: Interactive response times, even over huge datasets
3. Functions: Built-in user defined functions (UDFs) to manipulate dates, strings, and other
data-mining tools. Hive supports extending the UDF set to handle use-cases not supported by
built-in functions.
4. Scalable and Extensible: As data variety and volume grows, more commodity machines
can be added, without a corresponding reduction in performance
5. Compatible: Works with traditional data integration and data analytics tools.
6. Different storage types: Different storage types such as plain text, Sequence Files, RCFile
and others
53 | P a g e
Applications of Hive:
Best suited for Data Warehousing Applications
Data Mining
Ad-hoc Analysis
Business Intelligence
Data Visualization
Hive/Hadoop usage at Facebook:
Summarization:
E.g.: Daily/Weekly aggregations of impressions/click counts
Complex measures of user engagement
Ad-hoc Analysis:
E.g.: how many group admins broken down by state/country
Ad Optimization
Spam Detection
Application API usage patterns
Reports on pages
E.g.: Graphical representation of popularity of page and number of likes increase for a
particular page
While developed by Facebook, Apache Hive is now used and developed by other companies
such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains
uses Hive in Amazon Elastic MapReduce on Amazon Web Services.
Hive Architecture
The main components of Hive are:
UI The user interface for users to submit queries and other operations to the system. As
of 2011 the system had a command line interface and a web based GUI was being
developed.
Driver The component which receives the queries. This component implements the
notion of session handles and provides execute and fetch APIs modeled on JDBC/ODBC
interfaces.
54 | P a g e
Compiler The component that parses the query, does semantic analysis on the different
query blocks and query expressions and eventually generates an execution plan with the
help of the table and partition metadata looked up from the metastore.
Metastore The component that stores all the structure information of the various tables
and partitions in the warehouse including column and column type information, the
serializers and deserializers necessary to read and write data and the corresponding HDFS
files where the data is stored.
Execution Engine The component which executes the execution plan created by the
compiler. The plan is a DAG of stages. The execution engine manages the dependencies
between these different stages of the plan and executes these stages on the appropriate
system components.
The above figure shows how a typical query flows through the system.
Step 1:-
The UI calls the execute interface to the Driver.
Step 2:-
The Driver creates a session handle for the query and sends the query to the compiler to
generate an execution plan.
Step 3 and 4:-
The compiler gets the necessary metadata from the metastore.
Step 5:-
55 | P a g e
This metadata is used to typecheck the expressions in the query tree as well as to prune
partitions based on query predicates. The plan generated by the compiler (step 5) is a
DAG of stages with each stage being either a map/reduce job, a metadata operation or an
operation on HDFS.For map/reduce stages, the plan contains map operator trees (operator
trees that are executed on the mappers) and a reduce operator tree (for operations that
need reducers).
Step 6:-
The execution engine submits these stages to appropriate components (steps 6, 6.1, 6.2
and 6.3). In each task (mapper/reducer) the deserializer associated with the table or
intermediate outputs is used to read the rows from HDFS files and these are passed
through the associated operator tree. Once the output is generated, it is written to a
temporary HDFS file though the serializer (this happens in the mapper in case the
operation does not need a reduce). The temporary files are used to provide data to
subsequent map/reduce stages of the plan. For DML operations the final temporary file is
moved to the table's location. This scheme is used to ensure that dirty data is not read (file
rename being an atomic operation in HDFS).
56 | P a g e
Q30. Write down the difference between partition and bucket concepts with advantages.
Hive Partition
Hive Partitioning dividing the large amount of data into number pieces of folders
based on table columns value.
Hive Partition is often used for distributing load horizontally, this has performance
benefit, and helps in organizing data in a logical fashion.
If you want to use Partition in hive then you should use PARTITIONED BY (COL1,
COL2etc.) command while hive table creation.
We can perform partition on any number of columns in a table by using hive partition
concept.
We can perform Hive Partitioning concept on Hive Tables like Managed tables or
External tables
Partitioning is works better when the cardinality of the partitioning field is not too
high.
Supposes if we perform partition on Date column then new partition directories created for
every date this very burden to name node metadata.
Partitioning works best when the cardinality of the partitioning field is not too high.
Example:
Assume that you are storing information of people in entire world spread across 196+
countries spanning around 500 crores of entries. If you want to query people from a particular
country (Vatican City), in absence of partitioning, you have to scan all 500 crores of entries
even to fetch thousand entries of a country. If you partition the table based on country, you
can fine tune querying process by just checking the data for only one country partition. Hive
partition creates a separate directory for a column(s) value.
Advantages with Hive Partition
Distribute execution load horizontally
Faster execution of queries in case of partition with low volume of data. e.g. Get the
population from Vatican city returns very fast instead of searching entire
population of world.
No need to search entire table columns for a single record.
57 | P a g e
Disadvantages with Hive Partition
There is a possibility for creating too many folders in HDFS that is extra burden for
Namenode metadata.
So there is no guarantee for query optimization for all the times.
Hive Bucketing
Hive bucketing is responsible for dividing the data into number of equal parts
If you want to use bucketing in hive then you should use CLUSTERED BY (Col)
command while creating a table in Hive
We can perform Hive bucketing concept on Hive Managed tables or External tables
We can perform Hive bucketing optimization only on one column only not more than
one.
The value of this column will be hashed by a user-defined number into buckets.
bucketing works well when the field has high cardinality and data is evenly distributed
among buckets
If you want to perform queries on Date or Timestamp or other columns which are having high
records fields at that time Hive bucketing concept is perfectible.
Clustering aka bucketing on the other hand, will result with a fixed number of files, since you
do specify the number of buckets. What hive will do is to take the field, calculate a hash and
assign a record to that bucket.
But what happens if you use lets say 256 buckets and the field youre bucketing on has a low
cardinality (for instance, its a US state, so can be only 50 different values)? Youll have 50
buckets with data, and 206 buckets with no data.
Due to equal volumes of data in each partition, joins at Map side will be quicker.
You can define number of buckets during table creation but loading of equal volume of
data has to be done manually by programmers.
58 | P a g e
Q31. String & math operator in PigLatin
String Functions
Pig function names are case sensitive and UPPER CASE.
Pig string functions have an extra, first parameter: the string to which all the operations
are applied.
Pig may process results differently than as stated in the Java API Specification. If any
of the input parameters are null or if an insufficient number of parameters are supplied,
NULL is returned.
STRING OPERATOR:
Name Description
Syntax :
Syntax :
LAST_INDEX_OF(expression)
Syntax :
LCFIRST(expression)
Syntax :
LOWER(expression)
59 | P a g e
Syntax :
Example :
Syntax :
Example :
REGEX_EXTRACT_ALL('192.168.1.5:8020', '(.*)\:(.*)');
Syntax :
Syntax :
Syntax :
10. TRIM Returns a copy of a string with leading and trailing white space
removed.
Syntax :
TRIM(expression)
Syntax:
UPPER(expression)
60 | P a g e
MATH Functions:
Pig function names are case sensitive and UPPER CASE.
Pig may process results differently than as stated in the Java API Specification:
NAME DESCRIPTION
Syntax
ABS(expression)
Usage
If the result is not negative (x 0), the result is returned. If the result is negative
(x < 0), the negation of the result is returned.
Syntax :
ACOS(expression)
Usage
Syntax
ASIN(expression)
Usage
Syntax
ATAN(expression)
Usage
61 | P a g e
Use the ATAN function to return the arc tangent of an expression.
Syntax
CBRT(expression)
Usage
Syntax
CEIL(expression)
Usage
Use the CEIL function to return the value of an expression rounded up to the
nearest integer. This function never decreases the result value.
Example :
X CEIL(X)
4.6 5
3.5 4
2.4 3
Syntax
COSH(expression)
Usage
Syntax
COS(expression)
Usage
62 | P a g e
9. EXP Returns Euler's number e raised to the power of x.
Syntax
EXP(expression)
Usage
Use the EXP function to return the value of Euler's number e raised to the power
of x (where x is the result value of the expression).
10. FLOOR Returns the value of an expression rounded down to the nearest integer.
Syntax
FLOOR(expression)
Usage
Use the FLOOR function to return the value of an expression rounded down to
the nearest integer. This function never increases the result value.
Example
X FLOOR(X)
4.6 4
3.5 3
2.4 2
Syntax
LOG(expression)
Usage
Use the LOG function to return the natural logarithm (base e) of an expression.
Syntax
RANDOM( )
Usage
63 | P a g e
Use the RANDOM function to return a pseudo random number (type double)
greater than or equal to 0.0 and less than 1.0.
Syntax
ROUND(expression)
Usage
Example
X CEIL(X)
4.6 5
3.5 4
2.4 2
64 | P a g e