Summary: Machine learning-derived embeddings are a compressed representation of high content data modalities. Embeddings can capture detailed information about disease states and have been qualitatively shown to be useful in genetic discovery. Despite their promise, embeddings have a major limitation: it is unclear if genetic variants associated with embeddings are relevant to the disease or trait of interest. In this work, we describe EmbedGEM (Embedding Genetic Evaluation Methods), a framework to systematically evaluate the utility of embeddings in genetic discovery.
PODCAST: https://notebooklm.google.com/notebook/cc2ecf95-3142-4186-8fad-1622a96de732/audio
NOTEBOOK: Link.
GitHub: https://github.com/insitro/EmbedGEM
* Can we use this approach with the TCGA report data (linked to TCGA genetics data)? i.e. see if the predictive value for some disease (or something) goes up when embedded TCGA report PRS scores are included in a model vs. a model with only structured TCGA fields?
* We could use this to measure how well a visualizable low-dim (2D or 3D) representation of the data captures heritability and how predictive it is for disease. This could potentially be useful for EDA (the basic form of Jonas' data cartography idea)
How to avoid the fate of ICGC? (https://dcc.icgc.org)
Aviv Regev talking about cell atlas Thursday 1-2pm
TCGA appears to be the most important and most solid data resource for biomolecular cancer research. GDC appears solid. Is https://www.cbioportal.org ok?
Start here: https://developer.chrome.com/docs/ai/built-in
[ thank you Praful ! ]
Start here: https://huggingface.co/models?pipeline_tag=feature-extraction&library=transformers.js&sort=trending&search=xenova
transformers = await import('https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.4.0/+esm')
const pipe = await transformers.pipeline("task","model")
etc