Post date: Jun 06, 2017 7:21:51 PM
6vi17. My notes:
Bayesian sparse linear mixed model -- all SNPs at once, one trait at a time
There were 1895 genotyped inds (all GBSed Lycaeides).
My ind names are different than the GBS names: an extra 0, stuff on the end. So, ZG wrote a perl script to match individuals between the two files (but not to change names in either of the files).
**We caught some names that needed to be fixed -- the measurements that got added late didn't have 0s, two "DSB" weren't corrected to DBS, two DOP inds were 12-xxx not 02-xxx... These are now in resid-coord-centroidsize-procrustes-30v17.csv. On ZG computer, called coordPheno.txt. coordPheno.txt: no header, got rid of two ind with no genetic data (EGP11-027, LAN10-020), substituted commas for spaces.
We have 1000 individuals for mapping element position.
Fixed the same name issues in resid-size-3vi17.csv. ZG's file is called sizePheno.txt. We have 1113 inds for mapping element size.
The genetic data we'll use (point estimate of genotype from k=2 because some of the other k's didn't finish properly / but k's are usually highly correlated): gen_coordPheno.txt, gen_sizePheno.txt.
We have 75,476 SNPs (include common and not as common SNPs, not very rare SNPs).
ZG reformated the genotype files so that there's one ind per row, and scaffold information.
Reformatted: geno_size.txt, geno_coord.txt; mod_geno_size.txt, mod_geno_coord.txt
Phenotype files needed a final reformatting for mapping (one column per character, no individual names): pheno_size.txt, pheno_coord.txt
17 size traits to map
26 position traits to map
Name of program: Gemma
ZG's notes:
1. missing genetic data from EGP11-27 and LAN10-20, dropped from the
phenotype files
2. subset and sorted genotype data to match the phenotype data, this is
in
king:/uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_wings/Geno
## for size, generates gen_sizePheno.txt with 1113 inds.
perl sortSubsetGenos.pl genEstWings.txt sizePheno.txt
## for position, generates gen_coordPheno.txt with 1000 inds.
perl sortSubsetGenos.pl genEstWings.txt coordPheno.txt
genotype files contain 75476 SNPs
3. Convert to final format for genotype
run code in transposeG.R (pasted below)
#######
## size
N<-1113
L<-75476
g<-matrix(scan("gen_sizePheno.txt",n=N*L,sep=","),nrow=N,ncol=L,byrow=TRUE)
G<-t(g)
write.table(G,"geno_size.txt",row.names=FALSE,col.names=FALSE,quote=FALSE)
## position
N<-1000
L<-75476
g<-matrix(scan("gen_coordPheno.txt",n=N*L,sep=","),nrow=N,ncol=L,byrow=TRUE)
G<-t(g)
write.table(G,"geno_coord.txt",row.names=FALSE,col.names=FALSE,quote=FALSE)
########
run perl script for final formatting
perl mkGeno.pl geno_size.txt
perl mkGeno.pl geno_coord.txt
4. format phenotype files, drops id columns
cut -f 2-53 -d " " coordPheno.txt > pheno_coord.txt
cut -f 4-20 -d " " sizePheno.txt > pheno_size.txt
5. Final infiles are moved to
king:/uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_wings/Gemma/
-----
Running gemma:
1 = quantitative data
First it does a relatedness matrix file. Parsing out individual SNP associations and how much phenotypic variation is explained by the relatedness matrix.
----
Gemma BSLMM for whole data set
Jobs were submitted from,
king:/uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_wings/Gemma
And results will be in,
king:/uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_wings/Gemma/output
Submission command = perl wrap_qsub_slurm_gemma.pl
This was run twice, once with the size files and once with the
coordinate files. This is edited in the script. The key part of the
script is here:
###################################
## build array of jobs to be run individually (serially) by slurm
my $dir =
'/uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_wings/Gemma';
my @jobarray;
my $cd = "cd $dir\n";
my $job;
#my $nph = 17; ## size, 17 phenos
my $nph = 52; ## position, 52 phenos
foreach my $ph (1..$nph){
$job = "perl forkRunGemma.pl mod_geno_coord.txt pheno_coord.txt
$ph\n";
#$job = "perl forkRunGemma.pl mod_geno_size.txt pheno_size.txt
$ph\n";
push (@jobarray, "$cd"."$job");
print "$cd"."$job";
}
##################################
This really just runs the forkRunGemma.pl script, which is used to fit
the BSLMM for each phenotype with 5 chains. Here is the script:
##################################
#!/usr/bin/perl
#
# creates a child process per chain
#
$g = shift(@ARGV); ## genotype file
$p = shift(@ARGV); ## phenotype file
$n = shift(@ARGV); ## phenotype number
$out = $p;
$out =~ s/\.txt// or die "failed sub $p\n";
$out = "o_$n"."_$out";
foreach $ch (0..4){
$pid = fork;
if ($pid) { ## fork successful
$outch = "$out"."_$ch";
exec "gemma-0.94.1 -g $g -p $p -bslmm 1 -n $n -o $outch -maf
0.001 -w 200000 -s 1000000";
}
push(@pids,$pid);
}
#wait for all children to finish
for my $pid (@pids) {
waitpid $pid, 0;
}
#################################
MCMC details are: number of chains = 5, burnin = 200,000, sampling steps
= 1 million, thinning interval = 10
------
For mapping within subgroups, the following files have subgroup info in them: resid-coord-centroidsize-procrustes-6vi17-subgroups.csv, resid-size-6vi17-subgroups.csv. These files also have EGP11-27 and LAN10-20 removed. Wing area was removed from the size file.
Other one we will run: multivariate linear mixed model; use this datafile for this model (both size and position are in the same file, which means there are lots of NAs for individuals that don't have the position data): resid-sizeANDcoord-6vi17-subgroups
This manual is useful: http://www.xzlab.org/software/GEMMAmanual.pdf