Glossary

ABC - Adam, Bob, Chris - the three PIs, also often referred to by Roy via their initials
APA - Adam P. Arkin (Roy often refers to Adam by his initials)
Bench science/biologist - experimental work(er) in scientific research (i.e. doing science on a workbench in a lab rather than purely analytical/theoretical work)
BER - DOE’s (Department of Energy) office of Biological and Environmental Research - main funders of KBase
ENIGMA - one of Adam Arkin’s other projects, focused on advancing understanding of microbial biology and impact of microbial communities on their ecosystems. Ecosystems and Networks Integrated with Genes and Molecular Assemblies – is a multi-institutional consortium funded by the U.S. Department of Energy (DOE) through its Scientific Focus Area (SFA) grant program and managed by DOE’s Lawrence Berkeley National Laboratory (Berkeley Lab). https://enigma.lbl.gov/
GSP - [From Bob] the DOE Genomic Sciences Program Annual Principal Investigator Meeting - Every year the agenda of this conference includes presentations from the leaders of the user facilities that the DOE Genomic Sciences Program supports, so we think of this as an important milestone of our progress.
- yearly event in February where KBase presents their work, connects with users
- Check out the abstract from what they presented in the past: https://genomicscience.energy.gov/pubs/awardeeworkshops.shtml
KBase Leads - Roy, Elisha, Shane, Paramvir, Steve
KE - Knowledge Engine (See RESKE)
MAG - Metagenome-assembled genome (see KBase Science Refresher/Study Guide)
- MAG is a data type. Metagenome Assembed Genome, as in JGI MAGs:
- Janaka: An annotated bin
- Ex: Sean J: Leveraging the power of sequencing from mixed populations is this new data type
- Zach clarification: “Metagenomics data” would be the data that the MAGs are extracted from.
PIs - Principal Investigators. In Canada and the United States, the term principal investigator (PI) refers to the holder of an independent grant and the lead researcher for the grant project, usually in the sciences, such as a laboratory study or a clinical trial.
- KBase PIs: Adam, Chris, Bob
PD - product description, the document currently used by KBase staff in order to detail a proposal for something they want to build. Example for KBase concierge: https://docs.google.com/document/d/1MLVJuGM78f7D7MNsQ1zmERvuCGCyEZcqEMvuzscrO_4/edit
RE - Relation Engine (see RESKE)
RESKE - Relation Engine, Search, Knowledge Engine. Combined services which are a major roadmap goal for KBase. Relation engine holds "precomputed" relationships among objects on the system, this relationship mapping supports the search of these objects, which in turn supports the Knowledge Engine which uses the relationships to make predictions.
- All hands presentation notes where Adam talks more about these concepts: https://docs.google.com/document/d/1ir8L_WILIQnPnwyCQjvv7QmZMP7IMHvCDpYhiwRl_pU/edit
RSV - Reverse Site Visit - yearly (?) event in September where BER reviews KBase’s progress, gives feedback, approves further funding
- See documentation section for proposals and responses between BER and KBase
- This directly informs their roadmap
- RSV: [from Bob] Every 3 years KBase is formally reviewed based on a written proposal we will submit in July, and a subsequent in-person, day & a half series of presentations and Q&A before a panel of reviewers selected by DOE. This latter event we refer to as a Reverse Site Visit or RSV since it is held in DC in September, not at any of our sites. [But this year it was moved to Feb 2021 due to COVID]
- We will have two presentations this month from our fearless leaders related to our plans and preparation for the RSV.
S&P - Shane & Paramvir, who run software dev
SAC - Scientific Advisory Committee
SFA - Science Focus Area

File types

Assembly: (1) generically, raw sequence data, often non-contiguous from a sequencing machine; (2) if created in KBase, it is a set of reads contiguously assembled via a program that may also store quality reads from fastq files. In simpler terms, according to Steve: an assembler take genes and slices it into tiny bits in a chemical process that shreds it, then a machine looks into the shreds and tells you what the sequencing is; then you have to reassemble the bits. It's like shredding a book and then trying to guess what the page numbers were and put it back together. After importing, multiple assembly files can be combined via app into an AssemblySet.
- FASTA (fa, faa, fas, fasta, fna)
  - faa: contains amino acid sequence (protein, peptide)
  - fna: contains nucleotide sequence (DNA)
Reads: raw sequence and quality score data, often non-contiguous, from a sequencing machine; one such file may contain millions of reads. Depending on use sample, an import cell could include one single interleaved file or one forward and one reverse.
- FASTQ (fastq, fastq.gz, fnq, fq)
- SRA (sra)
Annotated genomes: assembly/reads file annotated with the positions of features of interest (genes mostly)
- GenBank Genome (gb, gbk, gbff, gpff): a more detailed description of sequences and the genes encoded in these
  - gbff: contains genome sequence and annotation
  - gpff: contains protein sequence and annotation
- GFF [meta] genome (gff): standardized, tab-delimited format for [meta] genome annotations. This must be paired with a FASTA assembly file when importing to a narrative.
- ContigSet is an older version of this type
Expression Matrix: gene expression values taken under given sampling conditions.
- tsv: tab delimited text file that has genes across the rows and sample observations across the columns
Media: set of chemical compounds and reactions organism can use for its growth
- tsv/xls: tab delimited text file with four columns; if using Excel, the worksheet must be named "MediaCompounds"
Flux Balance Analysis (FBA) Model: computational models of biological processes
- Systems Biology Markup Language (sbml, xml)
Phenotype Set: experimental data about an organism’s ability to grow on a specific media condition, recorded as either growth or no growth
- tsv/xls/tab: must have five columns
Sample Set: group of RNA-seq reads with associated experimental metadata for running RNA-seq apps in batch mode. This requires a lot of metadata to import.

Data Dictionary

"finish year" field = when the year that the app running occurred in in Jason B.'s report?

Context: I (Jason Baumohl) have a start_date and finish_date for each app run. So I chose the finish date.

"Total User Accounts and Retention" tab = just simply that users logged into the system again at some point after signing up.

Data Fields

days_since_last_sign_in = tells you have they been actively recently

days_signin_minus_signup = will tell you how long they stayed with KBase

From Jason B: So Ideally you want users with a low days since last signin and a high days signin minus signup.

Page updated

Report abuse