Ecotrix With R and Python
Ecotrix With R and Python
Estimation
Testing hypotheses
Forecasting / predicting / simulation
Econometrics models
Data
Experimental vs observational data
Randomized control trials: extracting data in such a way to establish causation, is x causes y must be
established by research, we design an experiment, else it’s just a pure correlation. Eg small kids
given sugar candies, most parents believe it made students hyperactive, but research was conducted
to make sure of causation
We have a control room and treatment room, which was divided from a randomized control room.
Divide the sample into 2 groups, 50 in control and 50 in treatment, with no selection bias. Then, give
treatment to treatment room to test its whether its because of candy only that the kids are
becoming hyperactive (or maybe to check covid efficiency by giving vaccine). Check later what the
difference in control room and treatment room is. Then we make sure if any difference, it was
because of the treatment only. This type of data is an experimental type of data.
National Sample surveys, NFHS data are observation type of data. This type has the drawback of
causal data.
Pooled cross section data- taking a sample at one point in time, and taking another sample from a
different place at a different point of time, different from panel data cuz the samples are taken from
different location and different times
What is an estimator: OLS (reduce the distance by squaring), MLE (maximize likelihood to fit),
Logic models- yes or no type of data, regression analysis does not fit.
omitted variable bias (there was some var which was very imp in determination of the effect, but we
don’t have the variable or we haven’t included it, find a solution or get a proxy for that variable (like
monthly per capita income, since no one would actually disclose their own income, can’t do that for
high income groups though)),
multi collinearity (not a big problem, perfect multi collinearity, where one variable perfectly explains
another),
auto correlation (mostly a problem of time series data, error time in previous time affects present
error term which can also affect the future error term, causing unfariness)
Assessment of validity
Causation- an additional year of education causes wages to increase by a given amount, all else
equal
In R, we do not need to declare a variable before we use it, unlike other programming languages like
Java, C, C++, etc.
In R we can use the following for assignment of values to variables: ‘ = ’, ‘ <- ‘, ‘ -> ‘
In R, data type is not specified for the variables in advance, rather, it gets the data type of the R
object assigned to it. R is called a dynamically typed language, which means we can change a data
type of the same variables again and again, when using it in a program.
Data type tells us which types of value a variable has and what types of mathematical, relational or
logical operations can be applied to it without causing an error.
Data types-
Character/String- alphabets
Vectors
They are the most basic R data objects and there are six types of atomic variables. The six atomic
variables are-
Logical
Numeric
Integer
Complex
Character
In general, a vector is defined and initialized in the following manner
In the case of c(1,1.2,1.3), it’ll get stored as numeric and denote 1L as 1.0.
Vec 6 becomes all characters
If we don’t want all to become characters, we use “lists” which stores the data as it is without
changing its type.
Lists are quite similar to vectors, but lists are the R objects which can contain elements of different
types like numbers, strings, vectors and another list inside it.
Matrix
Matrix is the R object in which the elements are arranged in a 2 dimensional rectangular layout.
Basic syntax for creating a matrix in R- matrix(data, nrow, ncol, byrow, dimnames)
Where data is the input vector which becomes the data elements of the matrix.
Byrow is a logical clue. If TRUE, then the input vector elements are arranged by row.
If the number of columns and rows are more than required, the data elements start repearitng.
Array
Arrays in R are data objects which can be used to store data in more than two dimensions. It takes
vectors as input and uses the values in the ‘dim’ parameter to create an array.
where
data is the input vector which becomes the data elements of the array
dim is the dimension of the array, where you pass the number of rows, column and the number of
matrices to be created by mentioned dimensions.
Data Frame
A dataframe is a table or a two dimensional array like structure in which each column contains
values of one variable and each row contains one set of values for each column . In a data frame;
Emp_data
Data Operators
A+b
a-b
a*b
a/b
a%%b
> x=22
> y=7
> x+y [1] 29 #addition
Relational Operators
These operators help us perform the relational operators like checking if a variable is
greater/lesser/equal to another variable. The output of a relational operator is always a logical
value.
Logical Operators
These operators compare 2 entities and are typically used with Boolean (logical) values such as and,
or, not
A&b it combines each element of vectors and gives an output TRUE if both the values are TRUE
A|b it combines each element of the vector and gives output True if one of the elements is TRUE.
!a takes each element of the vector and gives the opposite logical value
A&&b, a||b only the first value in a vector is compared, works as double-and / double-or
a=7
b=7
{ if(a>b)
else if (a<b)
else
Loops
Repeat loop: it repeats a statement or a groups of statements while given conditions are TRUE.
Repeat loop is the best example of an exit controlled loop where the code is first executed and then
the condition is checked to determine if the control should be inside the loop or exit from it.
While loop: it helps to repeat a statement or a group of statements while a given condition is true.
While loop, when compared to the repeat loop is slightly different. It is an example of an entry
controlled loop where the condition is first checked and only if the condition is found to be true
does the control be delivered inside the loop to execute the code.
For Loop: repeat a statement or a group of statements for a fixed number of times. Unlike repeat
and while loop, the for loop is used in situations where we are aware of the number of times the
code needs to be executed before=hand. It is like the while loop where the condition is first checked
and only then the code written inside gets executed.
x=2
repeat
x=x^2
if (x>100)
print(x)
break
}
}
Fibonacci Sequence
Mean
Median
MODE
Functions
A <- c (1,2,3,4,5,6,7,8)
y <- table(A)
y
names(y)[which(y==max(y))]
(a) Is an argument
myfunction (data)
return myfunction
Function syntax
Function_name<-function(arg1,arg2….) {
#code fragments
}
Importing and Exporting data
R works most easily with datasets stored as text files. Typically, values in text files are separated or
delimited, by tabs or spaces:
Variables, or columns of a data-frame can be selected with the $ operator, and the resulting object is
a vector. We can further subset elements of the selected column vector using [ ].
Use str() to see the structure of the object, including its class and the data types of elements. We
also see the first few rows of each variable.
Adding new variables to the data frame
mydata$logHeight = log(mydata$height)
You cannot create a variable larger than the dataframe, hence you’ll get an error
Correct: mydata$z(0,3)
log() -logarithm
cut() -cut a continuous variable into intervals with the new integer value signifying into which
interval the original value falls
Appending
Combining of two data frames before merging them. Make sure to check that the columns are the
same using ‘names’. Rows of one data frame get added to the other.
Merging
Merging is the adding together of different columns in one variable.
The ‘dplyr’ package have several types of merges that can be done using different functions like
inner_join(), left join, right join and such.