Quicklinks
When most people think of DNA, they think of the double helix from Rosalind Franklin's notes. This helix can be decomposed into several structural elements. The original version of DNAShape used Monte Carlo simulations of DNA fragments for which the structural features were experimentally determined. The resulting model is used to predict the structural features shown below in sequences without experimental data. An updated model was published in an R package in 2017. This update expanded the number of features considered by the model from 4 to 13 and is shown below.
https://rohslab.usc.edu/Papers/2017_Li_etal_NAR.pdf
https://rohslab.usc.edu/Papers/2013_NAR_DNAshape.pdf
System and R configuration on the cluster
Installing DNAShapeR
Obtaining matrices of first and second order DNA shape features using DNAShapeR
The matrix for first order encoded features has dimensions of 18789x386. The matrix for second order encoded features has dimensions of 18789x382. Given that the input fasta file has 18853, some of the sequences were discarded in the process of running DNAShapeR. The row.names element of the matrices does not contain identifying information that relates the rows with the sequences in the fasta file they represent. Thus, it is not apparent which sequence corresponds to which shape vector so analysis at the individual gene level will not be possible without more information.
The expression values from additional file 3 of "RNA-seq of life stages of the oomycete Phytophthora infestans reveals dynamic changes in metabolic, signal transduction, and pathogenesis genes and a major role for calcium signaling in development" will be used as additional features for clustering.
Source: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-3585-x#Sec31