A committee of researchers is examining the ranking of a set of postgraduate dissertations submitted to a dissertation writing competition. The committee is concerned that Randolph, one of the committee members, tends to be strongly biased in favor of comparative grading. In short, they’re concerned that the grade that Randolph assigns to the current paper is pretty strongly dependent on the grade he assigned to the last paper. Below is the sequence of scores that Randolph has assigned to the dissertations in the prize competition. You’ll be responsible for determining whether the concerns about Randolph are founded.
scores = c(16, 14, 19, 17, 15, 20, 16, 14, 17,
15, 20, 17, 18, 19, 15, 17, 16, 18,
15, 17, 18, 14, 16, 15, 17, 14, 20)
scores
. Describe how the score of the papers changes as Randolph grades.1Below, we can identify the number of times Randolph scores a dissertation above the average and then, in the next score, gives it below the average.
low.to.hi = ((scores[1:26] < mean(scores)) * (scores[2:27] > mean(scores)))
hi.to.low = ((scores[1:26] > mean(scores)) * (scores[2:27] < mean(scores)))
crosses.mean = low.to.hi + hi.to.low
total.crossings = sum(crosses.mean)
Thus, Randolph’s series crosses the mean score 17 times in 26 chances to do so.
outcome = c(0) # this is the starting value
phi = c(.25)
sigma = 1 # deviation of the process
t = 1000 # number of time periods
p = 1 # order
for(i in p:(t+p)){
outcome = c(outcome, sum(outcome[(i-p+1):i] * phi) + rnorm(1,0,sigma**2))
}
Return to the specifications we worked with in the previous section. Looking only at the ones that were AR(1):
In case the results in Exercise 2.2 are in any doubt, when \(\phi\) is larger than 1 (or smaller than \(-1\)) in an AR(1) model, the error term becomes cumulative. Thinking intuitively, this means that the “total” error in the process (the sum of all “epsilon” terms) becomes larger and larger as time goes on. If \(|\phi| < 1\), the errors after a certain point in the past go to zero.6
Some basic time-series functionality is built directly into R
’s default stats
module. For one of the series you simulated above, estimate an AR(p) model using the ar
function in R:
model = ar(outcome)
Note that, by default, this will fit an autoregressive process where \(p\) is estimated using the Akaike information criterion. Since you know the “right” order of each of the outcomes we worked with in the previous qustions, fit an AR to them and see if the parameters are close to correct given their standard errors and point estimates. 7
Further, since you do know the “right” order, force the ar
function to fit an autoregressive model at the “wrong order.” For instance, if I wanted to force the AR function to fit a model with order k
, I’d have to specify that I do not want to fit using AIC maximization for the order and specify that order k
should be fit:
k=3
model.bad.specification = ar(outcome, aic=FALSE, order.max=k)
Fit an AR on the AR(5) where you force the model estimated to be an AR(20). Then, read about what the model’s $aic
attribute means in the docstring of the ar
function by running ?ar
. Plot this attribute. Do you see a tradeoff between model fit and model complexity?
acf
and pacf
functions in R.
In general, there are many spatially-correlated mixed-effect structures; further, many spatially-varying slopes models in multilevel settings (even if they’re not continuously-varying) can be thought of as having spatial correlation in their errors or effects. So, consider a typical varying-slope multilevel model.
NOTE: This is an example exercise; it is not required to complete, but it provides a working knowledge of the models discussed in the Spatial Autoregression section above.
First, we’ll fit some spatially-autoregressive models. We’ll need to synthesize a few ideas to do this. If you haven’t done this before or don’t recall some of this, consult Rich Harris’s “Introduction to Mapping and Spatial Modelling in R”, as well as Roger Bivand’s viginettes on creating spatial weights. For this, we’ll use some data I have on airbnb pricing in Brooklyn, NY.
library(spdep)
library(sp)
library(ggmap)
library(readr)
library(dplyr)
bklyn = ggmap::get_map(location='Brooklyn, NY', zoom=12, source='stamen', maptype='toner-lite')
airbnb = read_csv('http://ljwolf.org/teaching/listings.csv')
Since the data is quite large (and this is a cleaned extract), let’s only work with a small random subsample of it.
airbnb = sample_n(airbnb, 1500)
head(airbnb, 10)
The data is quite wide, containing large string columns containing nearly all the information about the listing. In addition, each listing has a long list of amenities that the Airbnb has. We’ll focus on a few conventional price-relevant factors in our modeling.
To see a map of prices, we can use ggmap
:
ggmap(bklyn) + geom_point(aes(x=X, y=Y, col=airbnb$price), data=airbnb)
Note that the listings are not evenly distributed over the space. Airbnbs are much more common in central and north Brooklyn than they are in south Brooklyn. So, we need to use some method to accout for this.
Fortunately, many of the methods we consider can incorporate this structure well, such as using K-nearest neighbour weights. To construct these from our data:
coords = sp::coordinates(airbnb[c('X','Y')])
k200 = nb2listw(knn2nb(knearneigh(coords, k=200)))
log(price) ~ accommodates + bathrooms + bedrooms + days_since_last_review + host_is_superhost
+ instant_bookable + review_scores_rating
+ has_pool + has_wifi + has_beach + is_private + has_pets
and identify whether an autoregressive model may even be appropriate.9S.CARbym
or S.CARdissimilarity
structure.12 How are its marginal effects affected by the spatial component of that model relative to the standard linear model?hint for 1.1: a line plot of the data centered on zero may be more helpful than just what plot(scores)
would provide.↩
hint for 1.2: since we’re only dealing with approximates, round(runif(…)) can get you close↩
hint for 1.3: Write a function to do each step: first, a function to make a sequence of random scores; then, a function to get the number of mean-crossings in this sequence. Finally, write a loop to do this 1000 times.↩
hint for Exercise 2.1 part 2: you could do the standard deviation with a result vector sdeviations
using the following loop that computes the running standard deviation of the outcome starting at iteration 10: for(i in 10:length(outcome)){sdeviations = c(sdeviations, sd(outcome[1:i]))}
↩
hint for Exercise 2.1 part 2: specifically, focus on the properties of \(\phi\) as a whole… what’s different about the \(\phi\) vectors for those last two AR(5) processes? Can you adjust the stable one to become unstable by changing just one \(\phi_i\)?↩
But, if you recall from Exercise 2.1, in an AR(p), with \(p > 1\), it’s not sufficient to say that every \(\phi_j\), \(j = 1 \dots p\) means that a process is stable. The full stability criterion has to do with the location of the roots of the lag polynomial. For more, see Sargent & Stachurski on stability/ergodicity conditions for these kinds of processes (or a generalization thereof).↩
hint for 2.3 part 1: note that you can grab a few attributes of the fitted AR object, such as asy.se.coef
and aic
using the typical model$asy.se.coef
pattern. Also, you can access help on the object or its output by running ?ar
.↩
hint for Exercise 3.1 part 1: A good answer to this will talk about the main, core idea discussed in the Autoregressive Processes Section. Note that the Markov property is not necessary to have an autoregressive process (e.g. the distinction between SAR and CAR structures in spatial models), but it is helpful in reducing its complexity. Multilevel models with autoregressive structures are possible.↩
hint for Exercise 3.1 part 2: One reasonable way to do this uses spdep::lm.morantest
. Again, consult your SM2 documents if you do not recall how to do this.↩
Thus, we are all atomized & alienated from both our current states and our predicted futures. If this were a different course, here would be a great place to begin an assault on the fundamentally neoliberal assumptions of statistical orthodoxy, but I’ll stick to the math here…↩
This is kind of like how \(\beta\) changes interpretation in many GLM specifications. If you’re interested, a brief intro to this problem in the spatial lag model is discussed by Golgher & Voss (2015).↩
hint for Exercise 3.1 last part: One nice thing that makes repeatedly fitting long models in R is the formula
command. Using this, you can call formula(model.i.fit.already)
and get the formula string from that previous model.↩