Data Processing and Data Analysis
Data Processing and Data Analysis
ON
SUBMITTED TO
Prof. Mohammed Harisur Rahman
Howlader
DEPARTMENT OF MANAGEMENT
UNIVERSITY OF CHITTAGONG
SUBMITTED BY
TILAK KARMAKAR
ID: 16302124
DEPARTMENT OF MANAGEMENT
UNIVERSITY OF CHITTAGONG
Means of coding data in case of category data, rating data and ranking data:
Coding of data in case of category data, rating data, and ranking data refers to the process of
transforming collected information or observations to a set of meaningful, cohesive categories. It
is a process of summarizing and re-presenting data in order to provide a systematic account of
the recorded or observed phenomenon. Data refer to a wide range of empirical objects such as
historical documents, newspaper articles, TV programming, field notes, interview or focus group
transcripts, pictures, face-to-face conversations, social media messages (e.g., tweets or YouTube
comments), and so on. Codes are concepts that link data with theory. They can either be
predefined by the researcher or emerge inductively from the coding process. By coding data,
researchers classify and attach conceptual labels to empirical objects under study in order to
organize and interpret.
Including numbers in the main body of text: Numbers are most effective in the main body of the
text of an essay, report or dissertation when there are only two values to compare. For
example: 86% of male students said they regularly ate breakfast compared to 62% of female
students.
If we are discussing three or more numbers, including them within the main body of text does
not facilitate comprehension or comparison and it is often more useful to use a table incorporated
within the text.
Using Table: Tables are the format in which most numerical data are initially stored and
analyzed and are likely to be the means you use to organize data collected during experiments
and dissertation research. However, when writing up, we will have to make a decision about
whether a table is the best way of presenting the data, or if it would be easier to understand if you
were to use a graph or chart. Tables are an effective way of presenting data,
When we wish to show how a single category of information varies when measured at
different points (in time or space). For example, a table would be an appropriate way of showing
how the category unemployment rate varies between different countries in the EU (different
points in space);
When the dataset contains relatively few numbers. This is because it is very hard for a reader
to assimilate and interpret many numbers in a table. In particular, avoid the use of complex tables
in talks and presentations when the audience will have a relatively short time to take in the
information and little or no opportunity to review it at a later stage;
When the precise value is crucial to our argument and a graph would not convey the same
level of precision. For example, when it is important that the reader knows that the result was
2.48 and not 2.45;
Graphs: Graphs are a good means of describing, exploring or summarizing numerical data
because the use of a visual image can simplify complex information and help to highlight
patterns and trends in the data. They are a particularly effective way of presenting a large amount
of data but can also be used instead of a table to present smaller datasets.
Types of Graph
Bar charts: Bar charts are one of the most commonly used types of graph and are used to
display and compare the number, frequency or other measure (e.g. mean) for different discrete
categories or groups. The graph is constructed such that the heights or lengths of the different
bars are proportional to the size of the category they represent. Since the x-axis (the horizontal
axis) represents the different categories it has no scale. The y-axis (the vertical axis) does have a
scale and this indicates the units of measurement. The bars can be drawn either vertically or
horizontally depending upon the number of categories and length or complexity of the category
labels.
Histograms: Histograms are a special form of bar chart where the data represent continuous
rather than discrete categories. For example, instead of drawing a bar for each individual age
between 0 and 65, the data could be grouped into a series of continuous age ranges such as 1624,
25-34, 35-44 etc. Unlike a bar chart, in a histogram both the x- and y-axes have a scale.
Pie charts: Pie charts are a visual way of displaying how the total data are distributed between
different categories. The example here shows the proportional distribution of visitors between
different types of tourist attractions. Similar uses of a pie chart would be to show the percentage
of the total votes received by each party in an election. Pie charts should only be used for
displaying nominal data (i.e. data that are classed into different categories).
Line graphs: Line graphs are usually used to show time series data – that is how one or more
variables vary over a continuous period of time. Typical examples of the types of data that can be
presented using line graphs are monthly rainfall and annual unemployment rates. Line graphs are
particularly useful for identifying patterns and trends in the data such as seasonal effects, large
changes and turning points.
SAS- SAS is a software suite that can mine, alter, manage and retrieve data from a variety of
sources and perform statistical analysis on it.
SYSTAT- Powerful statistical analysis and graphics software. Simplify research and enhance
publications with SYSTAT’s comprehensive suite of statistical functions and brilliant 2D and 3D
charts and graphs.
Microsoft Excel- Excel's basic data analysis tool will allow descriptive statistical including
frequencies and measures of central tendency to be computed. Spreadsheets packages like excel
continue to evolve and become more viable for performing many statistical analyses.
ANOVA: An ANOVA (Analysis of Variance) test is a way to find out if survey or experiment
results are significant. In other words, they help to figure out if it is need to reject the null
hypothesis or accept the alternate hypothesis. We might use Analysis of Variance (ANOVA),
when you want to test a particular hypothesis. We would use ANOVA to help in understand how
different groups respond, with a null hypothesis for the test that the means of the different groups
are equal. If there is a statistically significant result, then it means that the two populations are
unequal (or different)
A group of psychiatric patients are trying three different therapies: counseling, medication
and biofeedback. You want to see if one therapy is better than the others.
A manufacturer has two different processes to make light bulbs. They want to know if one
process is better than the other.
T- TEST OR Z- TEST:
1) When the population standard deviation (α) is known, the z-test is most appropriate.
2) When (α) is unknown (the situation in most marketing research studies), and the sample size
greater than 30, the z- test also can be used.
3) When (α) is unknown and the sample size is small, the t-test is most appropriate. Since the
two distribution are similar with larger sample sizes, the two tests often yield the same
conclusion
CHI SQUARE TEST: The Chi Square statistic is commonly used for testing relationships
between categorical variables. The null hypothesis of the Chi-Square test is that no relationship
exists on the categorical variables in the population; they are independent. The Chi-Square
statistic is most commonly used to evaluate Tests of Independence when using a cross tabulation
(also known as a bivariate table). Cross tabulation presents the distributions of two categorical
variables simultaneously, with the intersections of the categories of the variables appearing in the
cells of the table. The Test of Independence assesses whether an association exists between the
two variables by comparing the observed pattern of responses in the cells to the pattern that
would be expected if the variables were truly independent of each other. Calculating the
ChiSquare statistic and comparing it against a critical value from the Chi-Square distribution
allows the researcher to assess whether the observed cell counts are significantly different from
the expected cell counts.
If you get a large f value (one that is bigger than the F critical value found in a table) It means
something is significant, while a small p value means all your results are significant. The F
statistics just compares the joint effect of all the variables together.
In here, we see F value is 2.34 and p value is .156 that means p value is smaller.
Here Y denotes a regression equation. Note that all the sign in the equation are positive. Thus,
regression equation indicates that Y is positively related to X1, X2, X3.The coefficients show the
effect on the dependent variable of a 1 unit increase in any independent variable. The value
associated with X1 is 0.387. Thus 1 unit increase in X1 is actually associated with an increase of
(0.387*1). It goes same in case of X2 and X3.
So if the effects associated with X1, X2, X3 are positive we can support the hypotheses and vice
versa.
And the value R2=.843 means that 84.3 percent of the variance in the dependent variable is
explained by the independent variables.
Canonical correlation is appropriate in the same situations where multiple regression would be,
but where are there are multiple inter correlated outcome variables.
As an example, variables related to exercise and health. On one hand, you have variables
associated with exercise, observations such as the climbing rate on a stair stepper, how fast you
can run a certain distance, the amount of weight lifted on bench press, the number of push-ups
per minute, etc. On the other hand, you have variables that attempt to measure overall health,
such as blood pressure, cholesterol levels, glucose levels, body mass index, etc. Two types of
variables are measured and the relationships between the exercise variables and the health
variables are of interest.
End