Manual

Last update: 22 November 2020

Model Fitting

Usage

Half-Cauchy

./ssgp --phenotype_file phen.csv --trait_name milk --covariate_file covar.csv --covariate_names g1,g2,g3 --snp_info_file snp_info.csv --snp_set_name group --binary_genotype_file geno --snp_scale_value 10 --output out --num_threads 2

Inverse-gamma

./ssgp --model ig --phenotype phen.csv --trait milk --error_weight_name milk_wt --covariate_file covar.csv --covariate_names all --snp_info_file snp_info.csv --snp_set group --snp_weight_name snp_wt --binary_genotype_file geno --snp_shape_value 0.01 --snp_scale_value 1 --error_shape_value 0.01 --error_scale_value 0.01 --output out --num_threads 2

Hyper-parameter optimization

./ssgp --model ig --phenotype_file phen.csv --trait_name milk --covariate_file covar.csv --covariate_names all --snp_info_file snp_info.csv --snp_set_name group --binary_genotype_file geno --hyper_opt --snp_hyper_exp_rate 1 --output out --num_threads 2

Beta weights

Options on the command line can be recognized so long as they provide a prefix of the option name that matches exactly one of the accepted options. For example, the option --trait will match as --trait_name, but --covariate (either --covariate_file or --covariate_names) will not match uniquely, so an error will be raised.

Input

See the input files used by the commands above. All plain text files are CSV (comma-separated values) files.

Model

Inverse-gamma prior for variances of SNP effect size and for error variance

--model ig

Half-Cauchy prior for variances of SNP effect size

--model hc

The error variance is assumed to have a prior probability, p(σ²)∝1/σ².

The default model option is half-Cauchy.

Phenotypes

--phenotype_file phen.csv --trait_name milk

--phenotype_file phen.csv --trait_name milk --error_weight_name milk_wt

The phenotype file should be a CSV file with the first row being header. The first column must be individual ID, and the following columns are phenotypes (one trait per column) or individual-specific weights for error variance (one trait per column). You can use names in header to specify trait and error weight.

Leave missing values empty in the phenotype file. Individuals with missing phenotype or missing error weight (if specified) will not be used in model fitting.

When the error weight option is not specified, the weights will be all set to 1.

Genotypes

SSGP binary file

We defined a simple binary file (with extension bin) to store individual-major genotype data, in which each genotype is stored in a byte. There are also two flat files accompanying binary file, i.e., indi file and mrk file. The indi file lists all the individuals in the binary file in the same order. The mrk file lists all the SNP markers in the binary file in the same order, of which the three columns are SNP ID, chromosome and physical position, respectively. Neither of the files has header.

SSGP has an option to convert CSV genotype file to bin, indi and mrk files, as shown below. Three files, geno.bin, geno.indi and geno.mrk would be generated.

./ssgp --csv_genotype_file geno.csv --binary_genotype_file geno

In the CSV genotype file, the first five rows list SNP IDs, SNP chromosomes, physical positions, allele 1 and allele 2, respectively, and the first column lists individual IDs. SSGP assumes that users use the same genotype coding rule in training and validation populations. When the CSV genotype file option is used, SSGP will do file conversion only. To do model fitting, the binary genotype file option must be given.

PLINK binary file

SSGP can also read PLINK binary file (bed/bim/fam) to do model fitting. Refer to the PLINK website for description of the file formats. Note that SSGP only uses within-family ID in fam file.

Option to read genotype files

--binary_genotype_file geno

With this option, SSGP will first search for SSGP files (geno.bin, geno.indi and geno.mrk). If they are not found, SSGP will further search for PLINK files (geno.bed, geno.bim and geno.fam).

SSGP assumes that there is no missing genotype and that SNPs have been sorted by physical positions.

Covariates

--covariate_file covar.csv

--covariate_file covar.csv --covariate_names all

--covariate_file covar.csv --covariate_names covar1,covar2,covar3

The covariate file should be a CSV file with the first row being header. The first column must be individual ID, and the following columns are covariates (one per column). When you want to use all the covariates in the covariate file, typing "all" is sufficient.

If the covariate file option is not specified, SSGP will search covariates in the phenotype file. In such case, do not use the key word "all." If the covariate names option is not specified, SSGP will only consider intercept. However, if the covariate names option is specified, SSGP will not intentionally add intercept into covariates, so you have to consider intercept in your covariate file.

Leave missing values empty in the covariate file. Individuals with any missing value for the specified covariates will not be used in model fitting.

SNP info

--snp_info_file snp_info.csv --snp_set_name group --snp_weight_name snp_wt

--snp_info_file snp_info.csv --snp_set_name group

--snp_info_file snp_info.csv --snp_weight_name snp_wt

--snp_info_file snp_info.csv

The SNP info file should be a CSV file with the first row being header. The first column must be SNP ID. There may be additional columns specifying SNP sets or SNP weights for variance of effect size.

When the SNP set option is not specified, all SNPs will be considered in one group called "NULL." When the SNP weight option is not given, all SNPs will be given a weight of 1. Leave missing values empty in the SNP info file. Individuals with missing SNP set or missing SNP weight (if specified) will not be used in model fitting.

SSGP only uses the SNPs listed in the SNP info file. Thus, you can easily specify what SNPs to be used in model fitting without changing big genotype file. SSGP also provides an option to filter SNPs by minor allele frequency (MAF), as shown blow.

--min_maf 0.01

By default, SSGP excludes markers that have MAF of 0 in the analysis population.

Set info

We need to set hyper-parameters of the variance prior for each SNP set to do model fitting. Inverse-gamma model has two hyper-parameters (i.e., shape and scale of inverse-gamma prior for SNP effect variance) for each SNP set, while half-Cauchy model has one (i.e., [squared] scale of half-Cauchy prior for SNP effect variance). SSGP has two ways to get this info.

One value for all SNP sets

SSGP has the options snp_shape_value and snp_scale_value. With the options below, we set both shape and scale parameters to 0.01 for all SNP sets. If the half-Cauchy model is used, only the snp_scale_value option is needed. To make hyper-parameters comparable between inverse-gamma prior and half-Cauchy prior, we actually use the snp_scale_value option to set squared scale rather than scale for the half-Cauchy model.

--snp_shape_value 0.01 --snp_scale_value 1

The default value is 0.001 for both shape and scale.

Distinct values for SNP sets

If you want to set set-specific hyper-parameter values, use a file instead.

--set_info_file set_hyper.csv

The set info file is a CSV file with the first row being header. The first column is SNP set ID. There must be one additional column for shape (not necessary for the half-Cauchy model) and one for scale. SSGP will search for key words "shape" and "scale" in header to determine which column lists shape and which column lists scale.

Any SNP sets that are not listed or have missing value in the set info file are given hyper-parameter values set by the options snp_shape_value and snp_scale_value.

--set_info_file set_hyper.csv --snp_shape_value 0.01 --snp_scale_value 0.01

Error prior

--error_shape_value 0.01 --error_scale_value 0.01

The two options are used to set hyper-parameter values of the inverse-gamma prior for error variance. A small value usually works well for model fitting. The default value is set to 0.001 for both shape and scale.

These two options are not needed when the half-Cauchy model is used.

Output

--output out

This option specify the prefix of output filenames. For example, with --output out, SSGP generates four CSV files with the same prefix, i.e., out.snp.csv, out.set.csv, out.covar.csv and out.f.csv, which contain SNP effect estimates, variance estimates of SNP sets, covariate effect estimates and variational free energy of model fitting, respectively. The contents of these files are self-evident (see an example).

MAF-based SNP weights

--beta_weight a,b

In this option, a and b are the two parameters of a beta distribution. When you want to give more weight to rare variants, you may use --beta_weight 1,10. SSGP automatically replaces any SNP weights specified by --snp_weight_name with the beta weights.

LD-based optimization of SNP weights

--optimize_weight_by_ld

Drawing on improved GRM estimator obtained by LD-based SNP weighting (Wang et al. 2017), we implemented LD-based SNP weighting in SSGP. This is supposed to result in a better estimation of genetic variance when all SNPs are treated in one set. However, its effect on genomic prediction is still unclear. Thus, it is not recommended to use this option.

This option has higher priority than beta weights; that is, SSGP will always use LD-based SNP weights if both of the options are used.

Hyper-parameter optimization

--hyper_optimization

When this option is turned on, SSGP will optimize hyper-parameters (shape and scale parameters in inverse-gamma prior of variances, or scale in half-Cauchy prior of variances) regarding likelihood or posterior. That is, SSGP will obtain their maximum likelihood (ML) or maximum a posteriori (MAP) estimates.

In the inverse-gamma model with hyper-parameter optimization, shape and scale parameters for each of SNP sets and error term have the same value, which is further given an exponential hyper-prior. The rate parameter in the exponential distribution can be set by --snp_hyper_exp_rate for SNP sets (one value for all) and by --error_hyper_exp_rate or error, e.g., --snp_hyper_exp_rate 0.1 --error_hyper_exp_rate 0.1. The default value for both of the options is 0.001.

In the half-Cauchy model, SSGP can obtain ML estimate of the scale parameter in half-Cauchy prior, when hyper-parameter optimization is used. However, we do not recommend it for the half-Cauchy model, because it is generally easy to determine a good scale value (to be given --snp_scale_value).

Other options

Set number of threads

--num_threads 10

Using multiple threads accelerates computation. The default number is 1.

Set subset size

--subset_size 1000

When SNP sets are big, SSGP can split them into smaller subsets to reduce memory use. The number is also used to control the block size in computation of LD-based SNP weights. The default number is 2000.

Note that splitting big SNP sets into smaller subsets will compromise estimation of SNP effects unless the subsets are independent from one another regarding genotypes.

Set tolerance

--tolerance 0.01

Iterations end, when the change in variational free energy from the previous iteration to the current one is smaller than the tolerance. The default value is 0.01.

Set max number of iterations

--max_num_iterations 300

The default number is 500.

Prediction

This option is used to predict total genetic values (TGVs) or genomic breeding values (GBVs) of new individuals.

./ssgp --prediction --snp_estimate_file out.snp.csv --binary_genotype_file validation.geno --output validation.gebv

Input files include the SNP effect estimate file obtained from model fitting and the binary genotype file containing the individuals of which you want to predict TGVs. The output is a two-column CSV file, of which the first column is individual ID and the second is predicted TGV.

Cross Validation

--cross_validation 2000,1000,10

The three numbers (joined by comma) mean the size of training population, the size of validation population and the number of replications, respectively. For each replication, SSGP randomly selects 2000 individuals as training and other 1000 as validation. Note that this is not k-fold cross validation. The benefit of such a design is that we can use only a small part of total data in each replication, reducing computational burden for large data sets.

This option lets SSGP generate two output files, out.corr.cv.csv and out.regr.cv.csv. The first file (*.corr.cv.csv) contains the Pearson correlation coefficient of predicted total genomic value and observed phenotype in the validation population, which can be used to assess prediction accuracy (the higher, the better). The second file (*.regr.cv.csv) contains the slope coefficient of the simple linear regression of observed phenotype on predicted total genomic value in the validation population, which can be used to assess unbiasedness (the closer to 1, the better).

The kth replicate in one run of cross validation will have exactly the same training population and validation population as the kth replicate in another run, as long as genotype files, phenotype file and covariate file are not changed. Thus, you can run cross validation multiple times with different SNP-set settings or with different hyper-parameter values, and then compare prediction performance between them based on the CV output (even with paired t-test).

Grid Search

Whole-genome markers can be divided into continuous, non-overlapping chunks, each containing k SNPs. This is a simple way to define SNP sets without using any external information. However, optimal k value is unknown. To find it, you can let SSGP do grid search.

--grid_search 10,20,50,100 --cross_validation 2000,1000,10

Each number in the grid search option is a set size (i.e., k) to examine. The numbers (as many as 20) must be joined by comma. With the options above, SSGP will evaluate four different set sizes by cross validation. The cross validation option must be set if you want to use the grid search option.

The cross validation results for grid search are written in two output files (*.corr.cv.csv and *.regr.cv.csv), so that you can manually check the optimal set size based on either "correlation" or "regression." SSGP selects optimal set size based on "correlation" by default. If you want to use "regression," you must put s (indicating slope of simple linear regression) at the end of the cross validation option.

--grid_search 10,20,50,100 --cross_validation 2000,1000,10,s

After grid search, SSGP will automatically complete analysis with the selected optimal set size.

Note that in grid search, the SNP sets are defined using the SNPs filtered by MAF rather than the original SNP list.

References

Wang, Bowen, Serge Sverdlov, and Elizabeth Thompson. "Efficient Estimation of Realized Kinship from Single Nucleotide Polymorphism Genotypes." Genetics 205.3 (2017): 1063-1078.

Google Sites

Report abuse