SAS for R users
September 2018
My learnings of SAS
Why should I learn SAS?
SAS is a legacy software in many industries (40+ years!). It's fairly easy for newbies to pick up with its point and click and has a 'business analytics' side making it more attractive to industry. It can integrate fairly well with databases as well.
Can I learn SAS?
SAS is proprietary but you can get a university version which i've used for this. However, it's not really built to work on a mac.
SAS
I will use up the some of my LinkedIn Premium features and take a SAS course on there https://www.linkedin.com/learning/sas-programming-for-r-users-part-1/introduction-to-sas-and-sas-studio. I will skip over the R stuff and take note of the SAS stuff as it not easy to call R with SAS on my mac (http://support.sas.com/documentation/cdl/en/imlug/64248/HTML/default/viewer.htm#imlug_r_sect003.htm)
Base SAS - Built in functions
SAS/ACCESS - Reading in data
SAS/STAT - Analytic models
SAS/IML - Interactive matrix language
ETS license - Time series, forecasting
SAS Studio has different windows:
- code editor/work area
- navigation pane
Store data in Libraries permanently. WORK library is temporary.
Introduction and Working is SAS
Working in SAS Studio
Click on Libraries -> SASHELP -> CARS (double click)
opens a new Table in a new tab. Blue arrow goes to 'next page'34.
Change columns shown by using the tick boxes.
Right click column header to sort data by that column e.g. ascending or filtered (e.g. Invoice >= 30000
) . To remove the filter click on the 'x' in the View tab. To view the code for this action click on 'Display the query that creates the current table'. There is three procedures (as SQL procedure, a datasets procedure and print procedure):
PROC SQL;
CREATE TABLE WORK.query AS
SELECT Make , DriveTrain , Invoice , Cylinders FROM SASHELP.CARS WHERE Invoice>=30000;
RUN;
QUIT;
PROC DATASETS NOLIST NODETAILS;
CONTENTS DATA=WORK.query OUT=WORK.details;
RUN;
PROC PRINT DATA=WORK.details;
RUN;
Writing a program
Click on 'New Options' (seven dots button in top right). You apply a procedure to a data table. Type PROC PRINT
and it will pop the help text box if you don't want this you can remove it by clicking the 'More application options' (three horizontal lines button in top right) -> Preferences -> Editor -> Click off 'Enable autocomplete' -> Save
. You can also right click a keywork e.g. Print and click 'Syntax Help'. Click the running man button to run. Check the LOG tab for error outputs. Click on the Note and it'll take you to that line. You can click on 'open in a new browser tab' to see the table more clearly.
Every statement must end with a ;
* Print the CARS data table;
*PROC PRINT DATA=SASHELP.CARS;
*RUN;
* Prints the entire data table;
/*
* Print certain columns in the CARS data table.
*/
PROC PRINT DATA=SASHELP.CARS;
* VAR is additional arguments to the PROC;
* You can choose the columns by clicking on libraries -> CARS;
* Hold control on the keyboard and click the columns of interest;
* Drag and drop to after VAR;
* Remove the 'SASHELP.CARS';
VAR Make Model MPG_City;
RUN;
Using Tasks and Snippets in SAS Studio
Tasks = Point and click features which generate code behind the scene
Click 'Tasks and Utilities' -> 'Tasks' -> 'Statistics' -> 'Summary Statistics' -> 'Select a table' -> add 'Weight' to Analysis variables and this will generate some code. -> 'options' -> 'PLOTS' -> 'Histogram' and 'Add normal density curve' -> 'Run'.
ods noproctitle;
ods graphics / imagemap=on;
proc means data=SASHELP.CARS chartype mean std min max n vardef=df;
var Weight;
run;
proc univariate data=SASHELP.CARS vardef=df noprint;
var Weight;
histogram Weight / normal(noprint);
run;
Click on 'Snippets' = starter codes -> 'Snippets' -> 'Graph' -> 'Scatter Plot Matrix' -> 'Run'
ods noproctitle;
ods graphics / imagemap=on;
proc means data=SASHELP.CARS chartype mean std min max n vardef=df;
var Weight;
run;
proc univariate data=SASHELP.CARS vardef=df noprint;
var Weight;
histogram Weight / normal(noprint);
run;
Click on 'New Snippet' and copy above 'proc print' code -> 'Save' as 'Print Variables'. Will save under 'My Snippets'. Can then drag and drop it into code.
Bayesian Logistic Regression
Logistic Regression = Regression on data when the dependent variable is binary (e.g. True, False)
Bayesian = Update statistical inference (e.g. what kind of distribution) as more data becomes available. Here is a nice plot showing the distribution being updated.
mcmc = Markov Chain Monte Carlo. A Markov Chain is a stochastic model which the probability of each event depends on the state attained in the previous event (here is an example). Markov Chain Monte Carlo is convergence of a probability distribution given a number of samples. You can read more about in the SAS documentation
The code for this is HERE
Data step is for reading in data, altering data, subsetting data.
You can highlight code just to run that part.
Poker Simulation
IML = Interactive Matrix Language
Texas Hold 'em:
- 9 players at a table
- Each player: 2 cards face-down
- Dealer: 5 cards face-up
The code for this is HERE
Multiple Linear Regression Power Analysis
The code for this is HERE
Calling R from SAS
The code for this is HERE
SAS libraries
Work library is temporary. Sashelp has sample datasets. Sasuser save datasets you work with often.
Define your own library as:
libname SP4R "s:\workshop"
Add data sets in your library and call as spr4.frog.
Procedure syntax
In PROC Step: STATEMENT ... <option>;
Can save data using outpost=
in PROC mcmc for example.
Click on the product e.g. SAS Analytical Products 14.1 -> What's new in SAS/STAT -> Contents -> Procedures -> The MCMC or Topics -> Bayesian analysis -> MCMC and you can see the statements. Click on 'PROPDIST' and click on Metropolis and Metropolis-Hastings Algorithms to find more about the method. Click the examples tab and click on Logistic Regression Model with a Diffuse Prior then copy and paste the example.
Importing and Reporting Data
Creating datasets
The code for this is HERE
* Save as example_data in sp4r library;
* Specific length of 25 for characters;
data sp4r.example_data;
length First_Name $ 25 Last_Name $ 25;
input First_Name $ Last_Name $ age height;
datalines;
Jordan Bakerman 27 68
Bruce Wayne 35 70
Walter White 51 70
Henry Hill 65 66
JeanClaude VanDamme 55 69
;
run;
* @@ is trailing hold the line;
data sp4r.example_data2;
length First_Name $ 25 Last_Name $ 25;
input First_Name $ Last_Name $ age height @@;
datalines;
Jordan Bakerman 27 68 Bruce Wayne 35 70 Walter White 51 70
Henry Hill 65 66 JeanClaude VanDamme 55 69
;
run;
Importing raw data files
Can use DATA step as above or use PROC IMPORT (.csv, .xlsx)
The code for this is HERE and data is HERE
* Import data using DATA step;
data sp4r.all_names;
length First_Name $ 25 Last_Name $ 25;
infile "&path\allnames.csv" dlm=',';
input First_Name $ Last_Name $ age height;
run;
* Import data using PROC import;
* REPLACE will overwrite sp4r.baseball;
* getnames=yes will use header;
* data starts on second row;
proc import out=sp4r.baseball
datafile= "&path\baseball.csv" DBMS=CSV REPLACE;
getnames=yes;
datarow=2;
run;
* Rename the variables;
/*Rename the variables*/
data sp4r.baseball;
set sp4r.baseball;
rename nAtBat = At_Bats
nHits = Hits
nHome = Home_Runs
nRuns = Runs
nRBI = RBIs
nError = Errors;
run;
Reporting data
See data type and length:
proc contents data=sp4r.cars varnum;
run;
See the data head(1:6):
proc print data=sp4r.cars (firstobs=1 obs=6);
run;
Print unique level
proc sql;
SELECT UNIQUE origin FROM sp4r.cars;
quit;
Use upcase to print conditionally if you don't know the case of the variable name.
proc print data=sp4r.cars;
var gender
where upcase(gender)='MALE'; * It is actually called 'Male'
run;
* ^= is not equal. = is equal to. Can also use NE, GE
* IN is it equal to one of a list. where country in ('US','CA'); where country is USA or Canada.
* &, | = AND, OR. ^=NOT. where country not in ('US','CA');
Change column labels and data format. The code for this is HERE
* Change FN column to first name;
proc print data=sp4r.business label;
label FN='First Name'
run;
* Change data format;
* $ = character;
*DOLLAR12.2 - convert to dollar with 12 characters and 2 d.p;
*MMDDYY10. Convert SAS numeric day to date
proc print data=sp4r.business;
format salary dollar12.2 hire_date mmddyy10.;
run;
* can create on format or look at docs online
proc format;
value $jobformat 'SR'='Sales Rep'
'SM'='Sales Manager';
value bonusformat 0='No' 1='Yes';
run;
proc print data=sp4r.business;
format job $jobformat. bonus bonusformat.;
run;
data employees;
input name $ bday :mmddyy8. @@;
datalines;
Jill 01011960 Jack 05111988 Joe 08221975
;
run;
proc print data=employees;
run;
* Now use label and format;
data employees;
input name $ bday :mmddyy8. @@;
format bday mmddyy10.;
label name="First Name" bday="Birthday";
datalines;
Jill 01011960 Jack 05111988 Joe 08221975
;
run;
proc print data=employees label;
run;
Create variables and new data
* Add two columns;
data sp4r.cars;
set sp4r.cars;
wheelbase_plus_length = wheelbase+length;
run;
* Change values conditionally;
data sp4r.cars;
set sp4r.cars;
if mpg<20 then bonus=0;
else if mpg_highway<30 then bonus=1000;
else bonus=2000
run;
* Create new character variable
data spr4.cars;
set sp4r.cars
length type2 $ 25;
if type in ('Hybrid','SUV')
then type2='Family Vehicle';
else type2='Truck or Sports Vehicle';
run;
* DO group if need to create more than one variable
data sp4r.cars;
set sp4r.cars;
length frequency $ 12;
if mpg_highway<20 then do;
bonus=0;
frequency='No Payment';
end;
else if mpg_highway<30 then do;
bonus=1000;
frequency='One Payment';
end;
else do;
bonus=1000;
frequency='Two Payments';
end;
run;
Create and use functions
data sp4r.cars;
set spr4.cars;
log_price = log(msrp);
run;
* Mean across rows
data sp4r.cars;
set spr4.cars;
mean_mpg = mean(mpg_highway,mpg_city);
run;
* _NULL_ = don't edit values
data _NULL_;
a=mean(1,2,3,4,5);
b=exp(3);
c=var(10,20,30);
d=poisson(1,2);
put a b c d; * to Log;
run;
* String functions e.g. SUBSTR, SCAN.
newstr = substr(str,length(str),1)
newstr = scan(str,2,',') ; second word and ,
concatstr = catx(' ',str1,str2);
newvar = transwrd(var,'str ','newstr ') * replace str with newstr in column
Functions can be found at SAS 9.4 -> documentation by title -> functions and CALL routines -> dictionary of functions and CALL routines. e.g. look at example in FIND
Create functions in functions compiler procedure (return single values)
* Switch order of string
proc fcmp outlib=sp4r.functions.newfuncs;
function ReverseName(name $) $;
length newname $ 40;
newname=catx(' ',scan(name,2,','),scan(name,1,','));
return(newname);
endsub;
quit;
options cmplib=sp4r.functions
data sp4r.school;
set sp4r.school;
FLName=ReverseName(name);
run;
Subset data
* Keep some variables (columns)
data sp4r.cars2 (keep=make msrp invoice); * can also use drop
set sp4r.cars;
run;
* subset by row [25:50]
data sp4r.cars2;
set sp4r.cars (firstobs=25 obs=50);
run;
* subset conditionally
data sp4r.cars2;
set sp4r.cars;
where mpg_city > 35;
run;
* Create table
proc sql;
create table sp4r.origin as
SELECT UNIQUE origin FROM sp4r.cars;
quit;
Concat data
* Combine rows using SET
data a_all;
SET a1 a2;
run;
* Combine columns
data b_all;
SET b_col1;
SET b_col2;
run;
* use merge to concat data tables with different dimension (cbind)
data c_all;
merge c_small_col c_long_col;
run;
* can do merge according to a common variable (similar to SQL join)
* sort data first using PROC SORT
DO loop
do i=2 to 10 by 2;
*do i=10 to 2 by -2;
end;
data loop;
*data loop (keep=x rep);
*data loop (drop=i);
do i=2 to 10 by 2;
x = i+1;
rep = 1;
output; * save all
end;
run;
* Iterate over values in data (similar to enumerate) to append a coloumn
data doloop;
do i=1 to 2;
output;
end;
run;
data doloop;
set doloop;
do j=1 to 2;
output;
end;
run;
Generate random numbers
RAND('Normal',mean,std)
Do loop and random number generator. Code HERE
/*Part A*/
data sp4r.random (drop=i);
call streaminit(123);
do i=1 to 10;
rnorm = rand('Normal',20,5);
rbinom = rand('Binomial',.25,1);
runif = rand('Uniform')*10;
rexp = rand('Exponential')*5;
output;
end;
run;
proc print data=sp4r.random;
run;
/*Part B*/
data sp4r.random;
call streaminit(123);
set sp4r.random;
rgeom = rand('Geometric',.1);
run;
proc print data=sp4r.random;
run;
/*Part C*/
data sp4r.doloop (drop=j);
call streaminit(123);
do group=1 to 5;
do j=1 to 3;
rpois = rand('Poisson',25);
rbeta = rand('Beta',.5,.5);
seq+1;
output;
end;
end;
run;
proc print data=sp4r.doloop;
run;
/*Part D*/
data sp4r.quants;
do q=-3 to 3 by .5;
pdf = pdf('Normal',q,0,1);
cdf = cdf('Normal',q,0,1);
quantile = quantile('Normal',cdf,0,1);
output;
end;
run;
proc print data=sp4r.quants;
run;
~R plots using PROC SGPLOT (statistical graphics plot). Code HERE
proc sgplot data=sales;
scatter x=month y=revenue;
scatter x=month y=revenue_2;
*series x=month y=revenue;
*series x=month y=revenue_2;
run;
proc sgplot data=sales;
scatter x=month y=revenue / group=company;
run;
proc sgplot data=sales;
scatter x=month y=revenue;
by company;
run;
/*Part A*/
data sp4r.hist_data;
call streaminit(123);
do i=1 to 1000;
x = rand('exponential')*10;
output;
end;
run;
proc sgplot data=sp4r.hist_data;
histogram x;
run;
proc sgplot data=sp4r.hist_data;
histogram x / binwidth=1;
density x / type=normal;
density x / type=kernel;
run;
/*Part B*/
data sp4r.boxplot_data (drop=rep);
call streaminit(123);
do group=1 to 3;
do rep=1 to 100;
response = rand('exponential')*10;
output;
end;
end;
run;
proc sgplot data=sp4r.boxplot_data;
hbox response;
run;
proc sgplot data=sp4r.boxplot_data;
hbox response / category=group;
run;
/*Part C*/
data sp4r.sales;
call streaminit(123);
do month=1 to 12;
revenue = rand('Normal',10000,5000);
output;
end;
run;
proc sgplot data=sp4r.sales;
vbar month / response=revenue;
run;
/*Part D*/
data sp4r.series_data (keep=x y1 y2);
call streaminit(123);
do x=1 to 30;
beta01 = 10;
beta11 = 1;
y1 = beta01 + beta11*x + rand('Normal',0,5);
beta02 = 35;
beta12 = .5;
y2 = beta02 + beta12*x + rand('Normal',0,5);
output;
end;
run;
proc sgplot data=sp4r.series_data;
scatter x=x y=y1;
scatter x=x y=y2;
run;
proc sgplot data=sp4r.series_data;
series x=x y=y1;
series x=x y=y2;
run;
proc sgplot data=sp4r.series_data;
series x=x y=y1;
scatter x=x y=y1;
series x=x y=y2;
scatter x=x y=y2;
run;
/*Part E*/
* regression, confidence limits and prediction limits
proc sgplot data=sp4r.series_data;
reg x=x y=y1 / clm cli;
reg x=x y=y2 / clm cli;
run;
Enhancing the plot. Can save as a pdf. Code HERE
/*Part A*/
data sp4r.sales;
call streaminit(123);
do month=1 to 12;
revenue = rand('Normal',10000,1000);
revenue_2 = rand('Normal',13000,500);
output;
end;
run;
/*Part B*/
proc sgplot data=sp4r.sales;
series x=month y=revenue / legendlabel='Company A'
lineattrs=(color=blue pattern=dash);
series x=month y=revenue_2 / legendlabel='Company B'
lineattrs=(color=red pattern=dash);
title 'Monthly Sales of Company A and B for 2015';
xaxis label="Month" values=(1 to 12 by 1);
yaxis label="Revenue for 2015";
inset "Jordan Bakerman" / position=bottomright;
refline 6.5 / transparency= 0.5 axis=x;
refline 11000 / transparency= 0.5;
run;
title;
/*Part C*/
proc sgplot data=sp4r.sales;
series x=month y=revenue / legendlabel='Company A' name='Company A'
lineattrs=(color=blue pattern=dash);
scatter x=month y=revenue / markerattrs=(color=blue
symbol=circlefilled);
series x=month y=revenue_2 / legendlabel='Company B'
name='Company B' lineattrs=(color=red pattern=dash);
scatter x=month y=revenue_2 / markerattrs=(color=red
symbol=circlefilled);
title 'Monthly Sales of Company A and B for 2015';
xaxis label="Month" values=(1 to 12 by 1);
yaxis label="Revenue for 2015" min=8000 max=14000;
inset "Jordan Bakerman" / position=bottomright;
refline 11000 / transparency= 0.5;
refline 6.5 / transparency= 0.5 axis=x;
keylegend 'Company A' 'Company B';
run;
title;
Create faceted plots PROC SCSCATTER (matrix, plot, compare). code HERE
proc sgscatter data=sp4r.cars;
plot mgg_cars*weight mpg_city*length
weight*length / columns=3;
run;
* multi-cell plot
ods layout start rows=1 columns=3;
ods region row=1 column=3;
proc sgplot data=sp4r.cars;
hbox mpg_city;
run;
ods layout end;
proc sgpanel data=sp4r.cars;
panelby origin / columns=3;
histogram mpg_city;
run;
proc sgpanel data=sp4r.lesscars;
panelby origin type / rows=1 columns=3;
reg x=weight y=mpg_city;
run;
/*Part A*/
data sp4r.multi;
call streaminit(123);
do Sex='F', 'M';
do j=1 to 1000;
if sex='F' then height = rand('Normal',66,2);
else height = rand('Normal',72,2);
output;
end;
end;
run;
/*Part B*/
proc sgpanel data=sp4r.multi;
panelby sex;
histogram height;
density height / type=normal;
title 'Heights of Males and Females';
colaxis label='Height';
run;
title;
/*Part C*/
ods layout Start rows=1 columns=3 row_height=(1in) column_gutter=0;
ods region row=1 column=1;
proc sgplot data=sp4r.multi (where= (sex='F'));
histogram height / binwidth=.5;
title 'Histogram of Female Heights';
run;
title;
ods region row=1 column=2;
proc sgplot data=sp4r.multi (where= (sex='F'));
density height / type=kernel;
title 'Density Estimate of Female Heights';
run;
title;
ods region row=1 column=3;
proc sgplot data=sp4r.multi (where= (sex='F'));
hbox height;
title 'Boxplot of Female Hieghts';
run;
title;
ods layout end;
Descriptive Procedures, Output Delivery System, and Macros
CORR, FREQ, MEANS, UNIVARITE Procedures on varaible (column)
* Correlation matrix and covariance matrix
proc corr data=sp4r.cars cov;
var horsepower weight length;
run;
* categorical data
proc freq data=sp4r.cars;
tables origin type;
run;
proc freq data=sp4r.cars;
tables origin*type; * cross table
*tables origin*type / norow nocol nopercent;
run;
proc freq data=sp4r.cars nlevels;
tables origin*type; * cross table
*tables origin*type / noprint;
run;
MEANS gives stats summary and UNIVARTE. e.g. SWEWNESS, P10. code HERE
proc means data=spr4.cars maxdec=2 * mean median var;
var mpg_city mpg_highway;
run;
HISTOGRAM
QQPLOT
/*Part A*/
proc contents data=sp4r.ameshousing varnum;
run;
/*Part B*/
proc univariate data=sp4r.ameshousing;
var saleprice;
histogram saleprice / normal kernel;
inset n mean std / position=ne;
qqplot saleprice / normal(mu=est sigma=est);
run;
Output Delivery System (ODS) (makes the tables). Code HERE
ods trace on;
proc univariate data=sp4r.ameshousing;
var saleprice;
qqplot saleprice / normall(me=est sigma=est)
run;
ods trace off;
ods select basicmeasures qqplot;
proc univariate data=sp4r.ameshousing;
var saleprice;
qqplot saleprice / normal(mu=est sigma=est);
run;
Save as a new SAS data set in PROC Step. code HERE
ods output basicmeasures = SP_BasicMeasures; * object = data-set-name
proc univariate data=sp4r.ameshousing;
var saleprice;
run;
* To save a value
proc univariate data=sp4r.ameshousing;
var saleprice;
output out=stats mean=sp_mean;
run;
/*Part A*/
ods select basicmeasures;
ods output basicmeasures = sp4r.SalePrice_BasicMeasures;
proc univariate data=sp4r.ameshousing;
var saleprice;
run;
proc print data=sp4r.saleprice_basicmeasures;
run;
/*Part B*/
proc univariate data=sp4r.ameshousing;
var saleprice;
* choose percentile points
output out=sp4r.stats mean=saleprice_mean pctlpts= 40, 45, 50, 55, 60
pctlpre=saleprice_;
run;
proc print data=sp4r.stats;
run;
/*Part C*/
proc means data=sp4r.ameshousing;
var saleprice garage_area;
output out=sp4r.stats mean(saleprice)=sp_mean median(garage_area)=ga_med;
run;
proc print data=sp4r.stats;
run;
/*Part D*/
proc means data=sp4r.ameshousing;
var saleprice garage_area;
output out=sp4r.stats mean= std= / autoname;
run;
proc print data=sp4r.stats;
run;
Global macro variables (use variables in other datasets)
%let height = 67;
%len name = Ray Bell;
&height ;to use
* value
%let year = 2010;
proc print data=sp4r.ameshousing;
where yr_sold = &year;
var yr_sold saleprice;
title "Price of Homes Sold in &year"
run;
* str
%let gtype = Attached;
proc print data=sp4r.ameshousing;
where g = ">ype";
var yr_sold saleprice;
title "a >ype"
run;
Automating creating global macro variables in PROC SQL. Code HERE
proc means data=sp4r.ameshousing;
var saleprice;
output out=stats mean=mean std=sd; * out put mean and sd
run;
proc sql;
select mean into :sp_mean from stats;
select sd into :sp_sd from stats;
quit;
%put The mean and sd are &sp_mean ... * to write to log
%put _USER_
/*Part A*/
proc means data=sp4r.ameshousing;
var saleprice;
output out=sp4r.stats mean=sp_mean std=sp_sd;
run;
proc sql;
select sp_mean into :sp_mean from sp4r.stats;
select sp_sd into :sp_sd from sp4r.stats;
quit;
/*Part B*/
data sp4r.ameshousing;
set sp4r.ameshousing;
sp_stan = (saleprice - &sp_mean) / &sp_sd;
run;
proc print data=sp4r.ameshousing (obs=6);
var saleprice sp_stan;
run;
proc means data=sp4r.ameshousing mean std;
var saleprice sp_stan;
run;
/*Part C*/
proc contents data=sp4r.cars varnum out=carscontents;
run;
proc print data=carscontents;
var name type;
run;
/*Part D*/
proc sql;
select distinct name into: vars_cont separated by ' ' from carscontents where type=1;
select distinct name into: vars_cat separated by ' ' from carscontents where type=2;
quit;
%put The continuous variables are &vars_cont and the categorical variables are &vars_cat;
Macro programs = R Function. code HERE?
%macro today
%out Today is $sysday $sysdate9;
%mend;
%today
%macro calc(dsn,vars);
proc means data=&dsn;
var &vars;
run;
%mean calc;
%calc(business,yield)
Keyword parameters (like python)
start=01jan08
Business example
Daily report
Weekly report every Friday
/*Part A*/
%macro mymac(dist,param1,param2=,n=100,stats=no,plot=no);
/*Part B*/
%if &dist= %then %do;
%put Dist is a required argument;
%return;
%end;
%if ¶m1= %then %do;
%put Param1 is a required argument;
%return;
%end;
/*Part C*/
%if ¶m2= %then %do;
data random (drop=i);
do i=1 to &n;
y=rand("&dist",¶m1);
x+1;
output;
end;
run;
%end;
%else %do;
data random (drop=i);
do i=1 to &n;
y=rand("&dist",¶m1,¶m2);
x+1;
output;
end;
run;
%end;
/*Part D*/
%if %upcase(&stats)=YES %then %do;
proc means data=random mean std;
var y;
run;
%end;
/*Part E*/
%if %upcase(&plot)=YES %then %do;
proc sgplot data=random;
histogram y / binwidth=1;
density y / type=kernel;
run;
%end;
%mend;
/*Part F*/
%mymac(param1=0.2,stats=yes)
/*Part G*/
%mymac(dist=Geometric,param1=0.2,param2=,stats=yes)
/*Part H*/
options mprint;
%mymac(dist=Normal,param1=100,param2=10,n=1000,plot=yes)
Macro program for iterative processing. Code HERE
%macro myappend(start,stop);
%do year=&start %to &stop;
proc import datafile="&path\sales_&year..csv" out=sp4r.sales_&year dbms=csv replace;
run;
proc append base=sp4r.sales_all data=sp4r.sales_&year;
run;
proc datasets library=sp4r noprint;
delete sales_&year;
quit;
%end;
%mend;
options mprint;
%myappend(2000,2009)
/*Why did we use a double period to specify the DATAFILE above?*/
%let mypath = s:workshop\;
%put &mypathmydata.csv;
%put &mypath.mydata.csv;
%let mydata = sales_data;
%put &mydata.csv;
%put &mydata..csv;
SAS Webinars
My notes on SAS Webinars I listened to:
AI = Training computers to perform tasks to mimic human reasoning.
Machine Learning = Subset of AI to automatically learn and improve from experience without being explicitly programmed.
Applications:
- Graph analytics
- Correlation, Regression analysis
- Cluster Analysis (groups, spot outliners)
- Neural Network, Predictive Analytics (complex and unknown patterns)
Robotic Process Automation = Software to mimic human action by automating simply and repetitive talks. e.g. chat bot.
SAS Adaptive Learning and Intelligent Agent System.
https://www.sas.com/en_us/software/anti-money-laundering.html
You can use SAS in Python and use Python in SAS.
swat library
impute to fill in missing values.
Can 'promote' dataset to colleagues.
https://github.com/sassoftware/sas-prog-for-r-users ; https://github.com/sassoftware/saspy ; https://github.com/sassoftware/saspy-examples ; ... https://github.com/sassoftware
SAS automated ML pipeline
SAS Model Studio
Data -> Imputation -> SAS logistical regression
-> Python model
-> R model -> model comparison
Create 'New Pipeline' OpenSourceHMEQ template. Write R and Python code (sklearn ensembles random forrest) in SAS.
Can compare against other pipelines and see 'Gradient Boosting' is best. Register the model.
Register models -> compare models -> select champion -> validate champion -> deploy -> score new -> monitor -> retrain/new -> back to start.
SAS Model Manager
Lift? (model validate; https://en.wikipedia.org/wiki/Lift_(data_mining))
Python Flask application to make a binary decision based on the model e.g. can I get a loan.
support.sas.com/rusers
sas-viya-programming https://github.com/sassoftware/sas-viya-programming
Challenges: too much data; poor quality; multiple sources (inconsistent); inability to deliver data.
Best practices: Profile data; preparation; standardization; match identification (e.g. Sam vs Samuel); monitoring (e.g. anomalies); repeatable process and workflow.
Can check for missing values; min and max; data/time issues. Pattern frequency distribution e.g. FL, F.L.
Build scheme. change all things that should be FL to FL e.g. F.L. Florida, florida.