Visualization - PCA

Post date: Nov 11, 2019 6:16:5 PM

I realized last Friday that only the odd numbered individuals had been sequenced.

I use the following command to keep them nin a new file called oddPandoSamples.txt

grep '^[0126][0-9][13579]' PandoSamples.csv > oddPandoSamples.csv

I modified the script to map the samples under mappingPandoOdd.R and I use source("mappingPandoOdd.R") to run it in R but I did not have any output so I now try under the bash terminal using R CMD BATCH mappingPandoOdd.R.

To visualize the output I openend ssh_config file situated in /etc/ssh/ and uncommented the lines related to X11Forward.

Running the R script in the terminal opened the map in a new window. Yayy!

Now I would like to export the colour column to use it for the PCA.

I use the colour column from the mappingPandoOdd.R file which correcponds to column number 4. I use

write.table(samps[,4], "type.txt") in R

Now I have a "type.txt" column that is sorted according to the individuals sample numbers (maybe I should export sample ID here as well) and that has a different color number for each sample.

However, the number of lines is 283 while th enumber of sequenced samples is 280.

I go back to the vcf file and extract the individual names from there.

These are the two line sthat worked:

grep 'potr' filtered2xHiCov_pando_variants.vcf > ./rozenn/type.txt

cut -f 10- type.txt > type_clean.txt

I was able to get the first PCA using the following code:

library(jpeg)

library(Cairo)

library(FactoMineR)

library(factoextra)

# Work with FactoMineR

#http://factominer.free.fr/factomethods/principal-components-analysis.html

######## Prepare the data ##############

# First, look at genotypes point estimates

pntest <- as.matrix(read.table("pntest_filtered2xHiCov_pando_variants.txt"))

names <- as.matrix(read.table("sampleID.txt"))

type <- as.matrix(read.table("type_clean.txt"))

# Specify which individuals are Pando

track_type = rep("aroundPando",each=length(type))

for (i in c(which(type == "potr-017-B"):(which(type == "potr-277-B"))))

{

track_type[i] <- "Pando"

}

for (i in c(which(type == "potr-603-B"):(which(type == "potr-659-B"))))

{

track_type[i] <- "nearbyClone"

}

data <- rbind(names,track_type,pntest)

data <- as.data.frame(t(data))

#data <- rbind(type,pntest)

# Transpose matrix to have it with individuals as rows and variables as columns

#data[,c(1,2)] <- as.character(data[,c(1,2)])

for (i in c(3:dim(data)[2]))

{

data[,i] <- as.numeric(as.character(unlist(data[[i]])))

}

######## Plot the data ##############

# Use habillage to specify groups for coloring

fviz_pca_ind(res.pca,

label = "none", # hide individual labels

habillage = data[,2], # color by groups

palette = c("#af8dc3", "#E7B800", "#5ab4ac"),

#addEllipses = TRUE # Concentration ellipses

)

which gives the following:

There is no filter being applied now, this PCA represents all individuals.

This coloring is rather arbitrary and based on microsatellite and mapping data.

Pando = from individual 017 to individual 277

nearbyClone = individuals from 603 to 659

nearby Pando = everything else

Applying read depth filter:

dp <- as.matrix(read.table("rawDepth.txt"))

dp <- sort(dp)

mean(dp)

[1] 1122.135

median(dp)

[1] 34

> sd(dp)

[1] 2620.133

Read depths per SNPs:

> quantile(dp)

0% 25% 50% 75% 100%

1 1 34 1311 76707

Find out the read depth per individual.

Page updated

Google Sites

Report abuse