How can I see the SNPs that were actually used to build the principal components?
in the subdirectory pcaer_output_name/ you will find multiple bim-files. look for output_name.menv.bim or output_name.mepr.bim. both are (should be) identical and list the SNPs used for building the PCAs.
How can I set the threshold for related pairs?
with the option --rel you can define which threshold of IBD defines a related pair. e.g. if you want only duplicated (not the related) samples to be removed, please use --rel 0.9
How can I perform only PCA, so not deduping the dataset?
have a look at the file output_name.menv.trans.mds. it contains all individuals. the PCAs for related/ duplicated individuals are duplicated, so exactly the same
I am getting an error when calculating PCs (epca step of pca module) for my datasets.
ID too long
fatalx
It's the IDs of some SNP names. This problem occurs in the new phase 3 imputation (since the indel-alleles are coded in some SNP names and these got more than 20 characters). Use the latest version of ricopili.
For checking relatedness between a large number of individuals I want to run the pcaer but unable to finish this module. Just before the generation of the genome file the module halts. Do you know what is going on?
the –dedup12 command might not be working, so this option was removed (included it before to only compare the first dataset with all the other datasets that follow in the command, so no pairs calculated between the other datasets - which was preferable as the purpose was to find duplicates of the first dataset in all the other datasets). Then the module finished properly but the output (*.mepr.overlap.pdf) was not interpretable because the table was too large. It was therefore advised to use the *overlap file to grep the individuals of interest. This was a good solution to quickly extract the individuals of the first dataset with a PI_HAT score > 0.2.
I did run PCA, but I'm getting this weird PC1-vs-PC2 plot in the end, what's wrong?
This is clearly an issue of insufficient pruning, owing to the HLA region. The three clusters appearing here correspond to the three different genotypes of SNPs in very high LD, that over-dominate the sample. You should prune your dataset and re-run.
epca step of PCA gives walltime error, even after increasing walltime and memory. What should I do?
Please update 'eloc' variable in ricopili.conf file (in you home directory) with a new path to the latest version of EIGENSTRAT (EIG6.0 or higher).
How can I integrate 1KG population with Ricopili PCA module? Are they publicly available in correct format for easy integration?
Yes, it is possible to integrate 1KG population with Ricopili PCA module easily for your analysis. You can download the plink files from here and place them into a new directory along with your plink files (study cohort) and invoke the PCA module. This link consists of two plink file sets: pop_4pop_mix_SEQ.* should be use to compare with world wide set of individual while pop_euro_eur_SEQ.det.* contains only European population separated in different countries.