mapping: BSLMM for all inds

Post date: Jun 06, 2017 7:21:51 PM

6vi17. My notes:

Bayesian sparse linear mixed model -- all SNPs at once, one trait at a time

There were 1895 genotyped inds (all GBSed Lycaeides).

My ind names are different than the GBS names: an extra 0, stuff on the end. So, ZG wrote a perl script to match individuals between the two files (but not to change names in either of the files).

**We caught some names that needed to be fixed -- the measurements that got added late didn't have 0s, two "DSB" weren't corrected to DBS, two DOP inds were 12-xxx not 02-xxx... These are now in resid-coord-centroidsize-procrustes-30v17.csv. On ZG computer, called coordPheno.txt. coordPheno.txt: no header, got rid of two ind with no genetic data (EGP11-027, LAN10-020), substituted commas for spaces.

We have 1000 individuals for mapping element position.

Fixed the same name issues in resid-size-3vi17.csv. ZG's file is called sizePheno.txt. We have 1113 inds for mapping element size.

The genetic data we'll use (point estimate of genotype from k=2 because some of the other k's didn't finish properly / but k's are usually highly correlated): gen_coordPheno.txt, gen_sizePheno.txt.

We have 75,476 SNPs (include common and not as common SNPs, not very rare SNPs).

ZG reformated the genotype files so that there's one ind per row, and scaffold information.

Reformatted: geno_size.txt, geno_coord.txt; mod_geno_size.txt, mod_geno_coord.txt

Phenotype files needed a final reformatting for mapping (one column per character, no individual names): pheno_size.txt, pheno_coord.txt

17 size traits to map

26 position traits to map

Name of program: Gemma

ZG's notes:

1. missing genetic data from EGP11-27 and LAN10-20, dropped from the

phenotype files

2. subset and sorted genotype data to match the phenotype data, this is

king:/uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_wings/Geno

## for size, generates gen_sizePheno.txt with 1113 inds.

perl sortSubsetGenos.pl genEstWings.txt sizePheno.txt

## for position, generates gen_coordPheno.txt with 1000 inds.

perl sortSubsetGenos.pl genEstWings.txt coordPheno.txt

genotype files contain 75476 SNPs

3. Convert to final format for genotype

run code in transposeG.R (pasted below)

#######

## size

N<-1113

L<-75476

g<-matrix(scan("gen_sizePheno.txt",n=N*L,sep=","),nrow=N,ncol=L,byrow=TRUE)

G<-t(g)

write.table(G,"geno_size.txt",row.names=FALSE,col.names=FALSE,quote=FALSE)

## position

N<-1000

L<-75476

g<-matrix(scan("gen_coordPheno.txt",n=N*L,sep=","),nrow=N,ncol=L,byrow=TRUE)

G<-t(g)

write.table(G,"geno_coord.txt",row.names=FALSE,col.names=FALSE,quote=FALSE)

########

run perl script for final formatting

perl mkGeno.pl geno_size.txt

perl mkGeno.pl geno_coord.txt

4. format phenotype files, drops id columns

cut -f 2-53 -d " " coordPheno.txt > pheno_coord.txt

cut -f 4-20 -d " " sizePheno.txt > pheno_size.txt

5. Final infiles are moved to

king:/uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_wings/Gemma/

-----

Running gemma:

1 = quantitative data

First it does a relatedness matrix file. Parsing out individual SNP associations and how much phenotypic variation is explained by the relatedness matrix.

----

Gemma BSLMM for whole data set

Jobs were submitted from,

king:/uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_wings/Gemma

And results will be in,

king:/uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_wings/Gemma/output

Submission command = perl wrap_qsub_slurm_gemma.pl

This was run twice, once with the size files and once with the

coordinate files. This is edited in the script. The key part of the

script is here:

###################################

## build array of jobs to be run individually (serially) by slurm

my $dir =

'/uufs/chpc.utah.edu/common/home/u6000989/projects/lycaeides_wings/Gemma';

my @jobarray;

my $cd = "cd $dir\n";

my $job;

#my $nph = 17; ## size, 17 phenos

my $nph = 52; ## position, 52 phenos

foreach my $ph (1..$nph){

$job = "perl forkRunGemma.pl mod_geno_coord.txt pheno_coord.txt

$ph\n";

#$job = "perl forkRunGemma.pl mod_geno_size.txt pheno_size.txt

$ph\n";

push (@jobarray, "$cd"."$job");

print "$cd"."$job";

}

##################################

This really just runs the forkRunGemma.pl script, which is used to fit

the BSLMM for each phenotype with 5 chains. Here is the script:

##################################

#!/usr/bin/perl

# creates a child process per chain

$g = shift(@ARGV); ## genotype file

$p = shift(@ARGV); ## phenotype file

$n = shift(@ARGV); ## phenotype number

$out = $p;

$out =~ s/\.txt// or die "failed sub $p\n";

$out = "o_$n"."_$out";

foreach $ch (0..4){

$pid = fork;

if ($pid) { ## fork successful

$outch = "$out"."_$ch";

exec "gemma-0.94.1 -g $g -p $p -bslmm 1 -n $n -o $outch -maf

0.001 -w 200000 -s 1000000";

}

push(@pids,$pid);

}

#wait for all children to finish

for my $pid (@pids) {

waitpid $pid, 0;

}

#################################

MCMC details are: number of chains = 5, burnin = 200,000, sampling steps

= 1 million, thinning interval = 10

------

For mapping within subgroups, the following files have subgroup info in them: resid-coord-centroidsize-procrustes-6vi17-subgroups.csv, resid-size-6vi17-subgroups.csv. These files also have EGP11-27 and LAN10-20 removed. Wing area was removed from the size file.

Other one we will run: multivariate linear mixed model; use this datafile for this model (both size and position are in the same file, which means there are lots of NAs for individuals that don't have the position data): resid-sizeANDcoord-6vi17-subgroups

This manual is useful: http://www.xzlab.org/software/GEMMAmanual.pdf