These exercises will primarily use functions in R but we will also
briefly touch upon the freebayes
software package. The
exercises are build up over two types of problems: introductory text
that you should run through to make sure you understand how to use and a
few additional questions for you to explore.
We will be needing the MESS
, energy
,
multiway
, pscl
and compositions
packages, which can all be installed using the
install.packages()
function.
install.packages(c("MESS", "energy", "compositions", "multiway", "pscl"))
Here we will look at a dataset concerned with the genetic associations to body-mass index (BMI) that we looked at earlier. The data can be downloaded directly from the web page using
options(timeout = 5*60)
load(url("https://www.biostatistics.dk/teaching/bioinformatics/data/gwasgt.rda"))
load(url("https://www.biostatistics.dk/teaching/bioinformatics/data/gwaspt.rda"))
We are interested in looking for other relationships that we did not find previously.
Use the same data as before. Pick 20 SNPs from columns 791 to 810 which are located around one of the SNPs of interest. If we use all the SNPs then the computations will take too long.
Compute the Pearson correlation matrix for the 20 genes/SNPs
selected above. You can use the cor()
function for
this.
Load the energy
package to get access to the
dcor()
function. Compute the distance correlation for each
of the SNPs against SNP 802. Does the results differ from the findings
from cor()
? [ Hint: you can use something like
sapply(791:810, function(i) { dcor(genotypes[,802], genotypes[,i])})
to do the computations. Make sure you understand what goes on here. ]
Use dcor()
to compare the SNPs to the BMI outcome
for the SNPs that were found on Wednesday to be related to the outcome.
Do you find any apparent relationships between the SNPs and the
phenotypes, that you did not find using the traditional regression
model?
Compute t tests for each of the distance correlations. What are
your findings? [Hint: use the dcorT.test()
function.
]
Why does it make sense to compare the outcome from the regression
model to results from dcor()
?
Zero-inflated and hurdle regression models with Poisson and
negative-binomial models can be modeled in R using the pscl
package. There are two primary functions we will be using:
hurdle()
and zeroinfl()
. Both functions work
similary to the glm()
function that we have used
previously.
Load the data from webpage by running the following command
load(url("http://www.biostatistics.dk/microbiome.rda"))
The data consists of two conditions, A and B, the relative abundance from 15 taxa and 35 samples from each condition. There is also information on age an sex of the individuals the provided the samples.
Fit a zero-inflated Poisson model for the raw relative abundance
of taxa 1 (otu 1) to compare the two conditions using the
zeroinfl()
function. Hint: you can use the
subset
argument to restrict the analysis to only one
taxa.
Also, the zeroinfl()
function uses the Poisson model by
default so it is necessary to round the relative abundance to the
closest integer using, say, the round()
function.
What is the conclusion? What are the directions of the effect of the conditions? Is this what you would expect?
Fit the same model with argument dist="negbin"
(still for the raw relative abundance of taxa 1). Which of the two
models fit best?
Now try the same two models but include additional information on
the gender and age. You may run into problems with the model fit /
convergence. You can specify individual models for the proper
distribution and the logistic regression model by specifying individual
models. This can be done in the model formula by giving two models
separate split with a vertical bar such as
y~x | x+z
.
Try fitting the same models with a hurdle model (use the
hurdle()
function)
Use the microbiome
data from the exercise above. The
data consists of two conditions, A and B, the relative abundance from 15
taxa and 35 samples from each condition. There is also information on
age an sex of the individuals the provided the samples. We start by
extracting the intensities as a matrix with 15 columns and 70 rows.
We start be extracting the values as a matrix with 70 rows and 15 columns
Arrange by otu!
exprmatrix <- matrix(microbiome$value, nrow=70)
compositions
package and compute the
isometric-log-ratio for each row of the expression matrix. Store the
result in some object for later use. [ Hint: You can use the
ilr()
function.]For this exercise we will be using the multiway
package
and the array x
that is grabbed from an R dataset on the
web. We wish to decompose 5 small metabolite arrays that are found in
x
and possibly get information on the basis functions.
library("multiway")
load(url("http://www.biostatistics.dk/parafac.rda")) # Loads x
The individual metabolite scans can be accessed through the last
index of the array. For example will x[,,1]
give the \(120\times 120\) matrix of values for
individual 1. To get an idea of the data try plotting a 3D plot of the
output from person 1:
persp(x[,,1])
If you have the rgl
package installed you can create a
3D plot that can be rotated:
library("rgl")
persp3d(x[,,1], col="lightblue")
There are two peaks on the array. The same is true for the other individuals although the relative height of the peaks vary from person to person.
Let us create the matrices used to create the tensor product:
res <- parafac(x, nfac=2)
summary(res)
res
contain? In particular look at the elements A
,
B
, and C
. Why do they have the dimensions they
have?res$A[1,]
. Plot the three components from
the first matrix and interpret them.We can create our matrix approximation with the fitted()
function.
appx <- fitted(res)
nfac
argument. Try to approximate the
tensor with 3 components. How much gain is there in the tensor
approximation now?Claus Ekstrøm 2024