DOING MAPS ON STATA
https://medium.com/the-stata-guide/maps-in-stata-iii-geoplot-a764cf42688a
REGULAR SINTAX TO SHIFT GRAPH LEGENDS TO BOTTOM (6 o'clock) AND WITH 1 ROW
legend(rows(1) position(6))
STRIPPLOT GENERAL SYNTAX
stripplot soldprice, over(city) box iqr vertical xla(, ang(45)) stack msize(0.4pt) msymbol(smcircle) h(0) jitter(8)
*h(#) parameter shifts points to right, without jitter it makes points look like histogram
DISPLAYING N OF OBSERVATION PER GROUP ON THE STRIPPLOT
replace where = 0.4 //create a point in y axis that will display n of observations
egen count = count(probability41enc) if n_vaccines==0, by(lineage cleavage41enc) //create var that holds n of observations to be displayed accordin to the over and by of the stripplot below
stripplot probability41 if n_vaccines==0, over(cleavage41enc) by(lineage) box iqr vertical xla(, ang(45)) stack msize(1pt) msymbol(square) h(0.5) ///
addplot(scatter where cleavage41enc, mla(count) ms(none) mlabpos(0) mlabsize(small)) //add a layer to the stripplot on the where coordinates with the count values
CREATING MACRO FOR CURRENT DATE (USEFUL FOR SAVING FILES)
local suffix: display %tdDmCY =daily("`c(current_date)'", "DMY")
di `suffix'
EXTRACTING CONTINUOUS QUARTER (QUARTERLY)
gen quarter = quarter(date)
levelsof year
gen quarterly = .
foreach lev in `r(levels)' {
replace quarterly = quarter + ((year-FIRST_YEAR_OF_THE_SERIES)*4) if year == `lev'
}
CREATING INDICATOR OF ADVANCE IN A LOOP
count
local obs = r(N)
local indicator = 1
foreach lev in `r(levels)' {
whatever command
di "`indicator' of `obs'"
local indicator = `indicator'+1
}
AVOIDING GRAY LINES AROUND GRAPHS
graphregion(color (white) fcolor(white) ifcolor(white))
BOXES FOR SYMBOLS IN GRAPHS
symy(3) symx(4)
SIZES OF LEGENDS
legend(on symxsize(4) symysize(4))
RETRIEVING AND USING STORED RESULTS
count
return list //try running after commands other than count to see if they have stored results
local number = r(N)
CREATING BOOKMARKS ON DO FILE
**# FIRST LEVEL BOOKMARK
**## SECOND LEVEL BOOKMARK
**### THIRD LEVEL BOOKMARK, AND SO ON
*SELECTABLE LIST OF BOOKMARS SHOWS UP ON BOTTOM LEFT CORNER OF DO FILE
LOOPING OVER ALL VARIABLES
foreach v of varlist var* {
do something if `v'
}
LOOPING THROUGH A LIST OF VARIABLES
foreach v of varlist xxx - yyy {
do something if `v'
}
LOOPING OVER ALL VALUES OF A VARIABLE
levelsof id
foreach lev in `r(levels)' {
gen test`lev' = 1 if id == `lev'
}
DIRECTLY IMPORTING DATA FROM THE INTERNET - SEVERAL DIFFERENT APPROACHES THAT *MAY* WORK
gen s = ""
replace s = fileread("LINK_TO_WEBPAGE")
readhtmltable LINK_TO_WEBPAGE // notice no quotation marks
readhtmllist LINK_TO_WEBPAGE // notice no quotation marks
clear
import delimited "LINK_TO_WEBPAGE.txt" // good for txt, csv, tsv... files
DOING AN ACTION TO ALL FILES OF CERTAIN EXTENSION IN A FOLDER
TRANSFORMING ALL CSV INTO DTA
local files : dir "PATH_TO_FOLDER" files "*.csv"
foreach file in `files' {
clear
import delimited "`file'"
save "`file'.dta", replace
}
APPENDING ALL DTA
local files : dir "PATH_TO_FOLDER" files "*.dta"
foreach file in `files' {
append using "`file'"
capture gen file = ""
capture replace file = "`file'" if file == ""
}
OPENING FASTA FILES ON STATA
import delimited "PATH_TO_YOUR_FASTA_FILE\FILENAME.fasta", delimiter(whitespace)
rename v1 fastaid //give whatever name you want
gen trim_ntseq = fastaid[_n+1]
gen char = strlen(fastaid)
drop if char > 20 //depending on how long your sequences ID actually are. Run a tab char prior to dropping and identify a good cutoff point
compress
drop char
ASSIGNING FIRST OBSERVATION AS VAR NAMES
ds // https://www.stata.com/manuals/dds.pdf for options
capture renvars `r(varlist)' , map(strtoname(@[1]))
drop in 1"
CREATE AND ERASE DIRECTORIES
mkdir "PATH\NAME_NEW_DIR"
erase "PATH"
USING STOPBOXES AND IF COMMAND (NOT IF QUALIFIER)
*CREATE POOR LENGTH WARNING - THIS IS JUST AND EXAMPLE OF A CODE
count if (char!=603 & char!=606)
local i = r(N)
if(`i'!=0){ then
capture window stopbox rusure "There are `i' sequences with inadequate length. They may or may not be typed. Do you wish to continue?"
if _rc == 1 {
exit
else {
}
}
else{
}
}
TODAY DATE
gen date = date("$S_DATE","DMY")
COMPARING MULTIPLE REGRESSION MODELS
logistic outbreak var1 var2 var3##var4 var5##var6 var7##var8 var9, base or
estimates store test1
logistic outbreak var1 var2 var3 var4 var5##var6 var7##var8 var9, base or
estimates store test2
(...)
estimates table test1 test2, eform base star stats(aic bic N r2_p)
DESCRIBE ALL CHARACTERS PRESENT ON A STRING VARIABLE
charlist var
*you may need to download charlist. findit charlist
F1 SCORE IN STATA
logistic command
estat classification, cutoff(0.5)
di 2*((r(P_1p)*r(P_p1))/(r(P_1p)+r(P_p1)))
USING FRAMES STATA
frame create xxx //xxx is the name of the new frame (default frame is called "default")
frame change xxx
frame drop xxx
sysuse auto.dta
logistic foreign price mpg headroom trunk, base or
frame create f1 cutoff f1score sens esp ppv npv accuracy
forvalues i = 0(0.01)1{
estat classification, cutoff(`i')
local f = 2*((r(P_1p)*r(P_p1))/(r(P_1p)+r(P_p1)))
frame post f1 (`i') (`f') (r(P_p1)) (r(P_n0)) (r(P_1p)) (r(P_0n)) (r(P_corr))
}
frame change f1
twoway (line f1score cutoff) (line accuracy cutoff), ylabel(0(10)100) xlabel(0(0.1)1) xmtick(0(0.01)1)
frame change default
CLASSIFICATION (CATHEGORICAL) RANDOM FOREST IN STATA
RANDOM FOREST IN STATA
SOURCE:
https://www.stata.com/meeting/canada18/slides/canada18_Zou.pdf
FOR REGRESSION (CONTINUOUS) RANDOM FOREST IN STATA, CHECK LINK ABOVE.
*DOWNLOAD THE FOLLOWING FILE ON YOUR COMPUTER:
https://drive.google.com/file/d/1HATpqf-tQ_3l0DOUdVmhRVH8_vbIydXr/view?usp=share_link
clear
import delimited "PATH TO FILE YOU JUST DOWNLOADED"
set seed 201807
*discarding obs to make this run faster. On a real scenario, use all data
gen discard = uniform()
sort discard
gen n = _n
drop if n > 1000
*fixing marriage for dummy
tab MARRIAGE, gen(m)
*interesting setup that records errors for i iterations (in this case, number of trees)
gen u=uniform()
sort u
gen out_of_bag_error1 = .
gen validation_error = .
gen iter1 = .
local j = 0
forvalues i = 10(5)500{
local j = `j' + 1
rforest defaultpaymentnextmonth LIMIT_BAL SEX ///
EDUCATION m* AGE PAY* BILL* in 1/500, ///
type(class) iter(`i') numvars(1)
replace iter1 = `i' in `j'
replace out_of_bag_error1 = `e(OOB_Error)' in `j'
predict p in 501/1000
replace validation_error = `e(error_rate)' in `j'
drop p
display `i'
}
twoway (scatter out_of_bag_error1 iter1) (scatter validation_error iter1)
*500 was chosen as iter number because graph above shows that's more or less where error levels off and stabilizes
*interesting setup that records errors for i numvars (in this case, number of trees)
gen oob_error = .
gen nvars = .
gen val_error = .
local j = 0
forvalues i = 1(1)26{
local j = `j' + 1
rforest defaultpaymentnextmonth LIMIT_BAL SEX ///
EDUCATION m* AGE PAY* BILL* ///
in 1/500, type(class) iter(500) numvars(`i')
replace nvars = `i' in `j'
replace oob_error = `e(OOB_Error)' in `j' //gets the OOB erros from the rforest command (run only on the train part of the data)
predict p in 501/1000
replace val_error = `e(error_rate)' in `j' //gets the error_rate from the predict command (run only on the test part of the data)
drop p
display `i'
}
twoway (scatter out_of_bag_error1 nvars) (scatter validation_error nvars)
*12 was chosen as final numvars because graph above shows that'w the minimum validation error
*Finalized model is:
rforest defaultpaymentnextmonth LIMIT_BAL SEX ///
EDUCATION m* AGE PAY* BILL* ///
in 1/500, type(class) iter(1000) numvars(12)
*Compute expected classes of variable defaultpaymentnextmonth
predict p1
*Compute expected class probabilities of variable defaultpaymentnextmonth
predict c1 c2, pr
*VARIABLE IMPORTANCE PLOT
matrix importance = e(importance)
*convert matrix into variable
svmat importance
list importance in 1/5
gen id=""
local mynames : rownames importance
local k : word count `mynames'
// If there are more variables than observations
if `k'>_N {
set obs `k'
}
forvalues i = 1(1)`k' {
local aword : word `i' of `mynames'
local alabel : variable label `aword'
if ("`alabel'"!="") qui replace id= "`alabel'" in `i'
else qui replace id= "`aword'" in `i'
}
graph hbar (mean) importance, over(id, sort(1)) ytitle(Importance)
BEAST ANALYSIS
*BEAST CONFIGURATIONS (MCC TREE AND OTHER BAYESIAN ANALYSIS)
{
*CREATE xml FOR BEAST RUN
*RUN BEAUTI
*DRAG ALIGNED AND TRIMMED FASTA FILE TO IT
*"TIPS"
*CHECK USE TIP DATES
*CLICK PARSE DATES (DEFINED BY A PREFIX AND ITS ORDER, ORDER LAST, PREFIX _ )
*GUARANTEE THAT DATES ARE BEING PARSED CORRECTLY. IT SHOULD BE ID.L_YYYY-MM-DD
*"SITES"
*CHANGE SUBSTITUTION MODEL FROM HKY TO GTR
*CHANGE SITE HETEROGENEITY MODEL FROM NONE TO GAMMA+INVARIANT
*CHANGE PARTITION INTO CODON TO "3 PARTITIONS: POSITIONS 1, 2, 3"
*UNCHECK "UNLINK SUBSTITUTION RATE PARAMETERS ACROSS CODON POSITION"
*"CLOCKS"
*CHANGE FROM STRICK CLOCK TO UNCORRELATED RELAXED CLOCK (DECISION OF WHICH SHOULD BE TEMPEST BASED)
******************SON SUGGESTION - SEE DISTRIBUTION ON PAPER ON BEAST WITH PRRS - CHECK WHICH RELAXED DISTRIBUTION
*"TREES"
******************SON SUGGESTION - TRY SKYRIDE ON ONE JUST TO SEE WHAT HAPPENS SINCE CONFIGURATION IS EASIER
*CHANGE TREE PRIOR FROM "CONSTANT SIZE" TO "BAYESIAN SKYGRID"
*TIME AT LAST TRANSITION POINT SHOULD BE SCALED TO REFLECT TIME ELAPSED SINCE ROOT.
*ALL SEQS FROM ONE LINEAGE/SUB-LINEAGE, CONSTRUCT 100 MAXIMUM LIKELIHOOD TREE FOR IT. ENTER IT ON TEMPEST, SEE TIME MOST RECENT COMMON ANCESTER
*CALCULATE TIME DIFFERENCE BETWEEN LAST SAMPLE AND MRCA - FOR MOST PRRS LINEAGES SHOULD BE ~30-20
*THIS WILL BE FIXED FOR ALL SAMPLES OF A GIVEN LINEAGE
***-----------> SEE RANGE ON TEMPEST. DENNIS SUGGESTION. RANGE ON TEMPEST SHOULD SAY WHAT THAT VALUE SHOULD BE
*"MCMC"
*CHANGE LENGTH OF CHAIN FROM 10 MIL TO 200 MIILLION
*SAVE
*RUN BEAST
*CIPRES
*UPLOAD XML CREATED BY BEAUTI
*BEAST on XSEDE
*LEAVE ALL PARAMETERS DEFAULT, EXCEPT TIME (CHANGE FROM 0.5 HOURS TO 100 HOURS)
*RUN TRACER
*DRAG "LOG" FILE CREATED BY BEAST
*SEE IF PARAMETERS ARE GOOD
*BURNIN SHOULD INCLUDE NOISY PORTION
*BURNIN FRACTION CAN BE INCREASED ABOVE ON TOP LEFT
*ESS (ESTIMATED SAMPLE SIZE) SHOULD BE >150 OR 200 FOR ALL. IF NOT, MORE REPETITIONS ARE NEEDED
*WHEN CREATING THE SKYGRID PLOT, AGE OF YOUNGES TIP SHOULD BE YEAR OF THE MOST RECENT SEQUENCE IN THAT PARTICULAR RUN
**ESTIMATING MOLECULAR CLOCK
*TEMPEST REQUIRES NEWICK TREE FILE, ONE PER LINEAGE/SL
*OPEN MEGA
*OPEN FASTA FILE, ANALYZE, NUCLEOTIDE, PROTEIN CODING, STANDARD
*PHYLOGENY, MAXIMUM LIKELYHOOD TREE, NUMBER OF BOOTSTRAPS IDEALLY 1000 (BUT TAKES FOREVER). CHANGE TO 100
*RUN
*SAVE AS NEWICK (INCLUDE BRANCH LENGTH AND BOOTSTRAP VALUES) - FILE SHOULD HAVE NWK EXTENSION
*SAVE MEGA SESSION (FOR EASE OF MIND IN CASE THIS NEEDS TO BE REVIEWED) - FILE SHOULD HAVE MTSX EXTENSION
*NWK CAN BE OPENED ON FIGTREE OR TEMPEST
*OPEN TEMPEST
*SELECT NWK FILE CREATED ABOVE
*ON WINDOW "SAMPLE DATES", MODIFY DATE SPECIFIED AS YEARS TO DAYS
*CLICK PARSE DATES (DEFINED BY A PREFIX AND ITS ORDER, ORDER LAST, PREFIX _ )
*CHECK TREE WINDOW FOR TREE
*CHECK ROOT-TO-TIP WINDOW FOR OUTLIERS (CONSIDER REMOVING THOSE?)
*CHECK RESIDUALS WINDOW FOR OUTLIERS (CONSIDER REMOVING THOSE?)
*CHECK OPTION "BEST-FITTING ROOT" ON UP LEFT CORNER AND LOOK R-SQUARED VALUE - R2 > 0.3? INDICATE SOMEWHAT STRONG EVIDENCE OF MOLECULAR CLOCK
*TO CREATE TREES OF EACH LINEAGE/SUBLINEAGE, SHOULD USE RAXML
*USING THE RAND1, RAND2, RAND3 FASTA FILES FOR EACH L/SL, GO TO CIPRES, CREATE APPROPRIATE FOLDERS UPLOAD FASTA
*CHOOSE "RAXML HPC 2"
*ON PARAMETERS, SET TIME TO 100 HOURS
*LOGCOMBINER
*UPLOAD EACH LOG FILE OF EACH RAND RUN FOR EACH SL INTO LOGCOMBINER AND COMBINE THEM. NO MYSTERY.
*TRACER
*READ COMBINED LOG FILE
*ENSURE ESS ARE ALL UNDER 150 (IF NOT, RERUN BEAST WITH MORE REPETITIONS)
*GO TO ANALYSIS TAB, SKYGRID. ENTER THE LAST YEAR IN WHICH A SEQ WAS IDENTIFIED
*TO OBTAIN SUBSTITUTION RATES
*RUN BEAST. UPLOAD LOG FILES TO TRACER
*SEE "MEAN RATE" (ESS SHOULD BE >150, IF NOT, INCREASE CHAINS AND REPEAT BEAST RUN)
*TO OBTAIN SUBSTITUTION RATES FOR EACH BRANCH
*MAKE MCC TREE (SEE BELOW). UPLOAD MCC TREE TO FIGTREE
*GO TO NODE LABELS, DISPLAY (AND COLOR) BY RATE (OR RATE 95% HPD)
*TO MAKE MCC TREE
*RUN BEAST. UPLOAD TREES FILE TO TREEANNOTATOR
*SPECIFY BURNIN (LOOK ON TRACER HOW MUCH BURN IN IS GOOD)
*SELECT TARGET TREE TYPE AS MCC
*IF TREEANNOTATOR DOESN'T FINISH, DIMINISH TREES FILE SIZE ON LOGCOMBINER (RESAMPLE AT LOWER FREQUENCY)
*SAVE FILE AS .TREE EXTENSION
*TO VISUALIZE MCC TREE, OPEN TXT ON FIGTREE OR R AND...
*tree<-read.beast("D:/Igor/Sublineages BEAST CIPRES run aug2020/L1 only rand1/L1 rand1 treeannotator.tree")#for single tree
*trx <- ggtree(tree, mrsd="2018-12-31",right=F,lwd = 0.1)+
*geom_tippoint(size=3, alpha=.75) +
*theme_tree2()
*# +theme(panel.grid.major = element_line(color="black", size=.2),
*# panel.grid.minor = element_line(color="grey", size=.2),
*# panel.grid.major.y = element_blank(),
*# panel.grid.minor.y = element_blank())+
*# theme(axis.text.x = element_text( size=14,color="black"))
*trx
*lineage <- read.csv("D:/Igor/Sublineages BEAST CIPRES run aug2020/L1 only rand1/MCC tree L1 rand1.csv")
*lineage$newsublinstring <- as.factor(lineage$newsublinstring)
*trx<-ggtree(tree, mrsd="2018-12-31",right=F,lwd = 0.8) %<+% lineage +
*theme_tree2()+
*geom_tippoint(aes(color=newsublinstring),size=4)+
*scale_color_manual(values=c("#00ffff","#c10534","#01b001","#ffd200","#800080","#ff0000","#938dd2","#d7d29e","#d9c263"),
* breaks = c("L1A","L1B","L1C","L1Dalpha","L1E","L1F","L1G","L1H","L1Dbeta"))
*trx
}