SciDB-R

Regarding SciDB xldb 2013: http://www.youtube.com/watch?v=SsF_Mke0Mlw&feature=youtube_gdata         Tutorial on SciDB and SciDB-R by Alex Poliakov and Paul Brown

Some SciDB R commands:

iquery('count(modis.data)', return = 1)
instance: 1-2 CPU cores, 4GB RAM,

config.ini
-------------------------------------------------------------------------------------------------------------------
[x16]
server-0=p4xen7.local.paradigm4.com,3 #means 4 on coord
server-1=p4xen8.local.paradigm4.com,4
server-2=p4xen9.local.paradigm4.com,4
server-3=p4xen10.local.paradigm4.com,4
db_user=x16_user #PG credentials
db_passwd=x16_password
install_root=/opt/scidb/13.6 #binaries
pluginsdir=/opt/scidb/13.6/lib/scidb/plugins
logconf=/opt/scidb/13.6/share/scidb/log4cxx.properties
base-port=1239 #coordinator port
base-path=/home/scidb/scidbdata #see also data-dir-prefix
tmp-path=/datadisk1 #room for tmp storage
# (continued on next slide)

## Thread settings: up to 4 queries, 4 threads per
execution-threads=6 #MAX_NO_QUERIES+2
result-prefetch-queue-size=4 #threads per query
operator-threads=4 #threads per query
result-prefetch-threads=16 #MAX_NO_QUERIES * t.p.q
## Memory settings
mem-array-threshold=512 #temp query cache
smgr-cache-size=512 #persistent array cache
network-buffer=512 #scatter/gather buffer size
merge-sort-buffer=64 #used for sort, one per thread
# (continued on next slide)
-------------------------------------------------------------------------------------------------------------------
$ scidb.py stopall x16
$ scidb.py init_syscat x16 #as postgres user
$ scidb.py initall x16 #x16 is the config name
$ scidb.py startall x16
$ cat ~/.config/scidb/iquery.conf #iquery config file
{
"format":"lcsv+"
}

$ iquery -aq "list('instances')"

==================
Install shim on coordinator machine:
– https://github.com/Paradigm4/shim/wiki/Installing-shim
• Access shim from browser:
– From R / RStudio:
> install.packages("scidb")
> library("scidb")
> scidbconnect()
> scidblist()
===================
$ scidb.py stopall demo_db
$ scidb.py startall demo_db
$ sudo rstudio-server restart
===================
Loading Data
• Overall process:
1. Visualize the desired array (or arrays)
2. Prepare the files
3. Load the data in to SciDB (typically as a 1-dimensional array)
4. Handle Load/Data-Quality Errors
5. Rearrange the 1-D array into the desired array(s)
• Several Loading Techniques
– CSV Load
– Binary Load - faster
– Parallel Load -
– Opaque Load - across instances
• Overall process holds;
=======================
$ head -n 5 laml_methyl_composite.csv
TCGA-AB-2802-03A-01D-0741-05,cg00000029,0.521668865344633,RBL2,16,53468112
TCGA-AB-2802-03A-01D-0741-05,cg00000108,NA,C3orf35,3,37459206
TCGA-AB-2802-03A-01D-0741-05,cg00000109,NA,FNDC3B,3,171916037
TCGA-AB-2802-03A-01D-0741-05,cg00000165,0.100722321673368,,1,91194674
TCGA-AB-2802-03A-01D-0741-05,cg00000236,0.837944995677383,VDAC3,8,42263294
$ iquery -aq "create array laml_methylation_flat
<sample_id:string,
probe_id:string,
beta_value:double,
gene_id:string,
chromosome:string,
genomic_coordinate:uint64>
[row_num=0:*, 1000000,0]"
$ loadcsv.py -i laml_methyl_composite.csv -a laml_methylation_flat           #parallel load
$ iquery -aq "count(laml_methylation_flat)"
i,count
0,94201938
======================
loadcsv.py:    csv file -> splitcsv -> splits the csv file, passes it along to instances and they start loading as it is being transfered.
$ loadcsv.py --help
-d DB_ADDRESS SciDB Coordinator Hostname or IP Address (Default = "localhost")
-p DB_PORT SciDB Coordinator Port (Default = 1239)
-r DB_ROOT SciDB Installation Root Folder (Default = "/opt/scidb/13.6”)
-i INPUT_FILE CSV Input File (Default = stdin)
-n SKIP # Lines to Skip (Default = 0)
-t TYPE_PATTERN N number, S string, s nullable-string, C char (e.g., "NNsCS”)
-D DELIMITER Delimiter (Default = ",")
-f STARTING_COORDINATE Starting Coordinate (Default = 0)
-c CHUNK_SIZE Chunk Size (Default = 500000)
-o OUTPUT_BASE Output File Base Name (Default = INPUT_FILE or "stdin.csv")
-P SSH_PORT SSH Port (Default = System Default)
-u SSH_USERNAME SSH Username
-k SSH_KEYFILE SSH Key/Identity File
-a LOAD_NAME Load Array Name
-s LOAD_SCHEMA Load Array Schema
-w SHADOW_NAME Shadow Array Name
-e ERRORS_ALLOWED # Load Errors Allowed per Instance (Default = 0)
-A TARGET_NAME Target Array Name
-S TARGET_SCHEMA Target Array Schema
======================

#Save to coordinator
iquery -aq "save(laml_methylation_flat,
'/home/scidb/laml_methyl_flat.scidb',
-2,
'lcsv+')"
#Save a piece on each instance in parallel
iquery -aq "save(laml_methylation_flat,
'laml_methyl_flat.scidb', #Goes into data_dir        http://youtu.be/SsF_Mke0Mlw?t=49m16s       to increase # of instances e.g. from 8 o 16 and scidb takes care of redistributing data for you
-1,
'opaque')"
#Reload in parallel: useful for poor man’s elasticity
iquery –aq "load(laml_methylation_flat,
'symlink_to_laml_methyl_flat.scidb',
-1,
'opaque')"
======================
Arrays Versus Tables
• Everything is an Array:
– 1 or more dimensions
– 1 or more attributes in each cell
– chunked and distributed
– sparse or dense
• Operators redimension() and redimension_store() can
be used to turn attributes into dimensions and viceversa
• Chunk sizing is important
======================
scidb can  do 3x3 window moving average
separate null cells from empty cells
======================
materialized view is for future work.
======================
pick a chunk size that holds about 1 million non-empty cells per chunk (if not all cells are full or if the matrix is sparse)
$ iquery -aq "load_library('example_udos')"  
  # this adds a operator 'uniq'
$ iquery -ocsv+ -aq "
between(
index_lookup(              converts stock market name to integer
trades_flat,
stock_symbol_index,
trades_flat.stock_symbol,
symbol_id
),
0, 5
)"!



$ iquery –aq "
aggregate(
redimension(
substitute(
index_lookup(
trades_flat,
stock_symbol_index,
trades_flat.stock_symbol,
symbol_id
),
build(<val:int64>[x=0:0,1,0], -1)
),
<count:uint64 null> [symbol_id=0:*,200,0, time=0:*,86400000,0],
count(*) as count
),
max(count), avg(count)
)"



$ iquery -aq "
create array trades
<price:double, volume:uint64>
[symbol_id=0:*,200,0, time=0:*,86400000,0, trade_no=0:499,500,0]"
$ iquery -anq ”
redimension_store(
index_lookup(
trades_flat,
stock_symbol_index,
trades_flat.stock_symbol,
symbol_id
),
trades
")


$ iquery ('list(operators)')
======================
in between operator use null to say unlimited bound.
======================
regrid: change chunks??
======================
window aggregate
window(
aggregate(
trades,
avg(price) as price,
symbol_id, time
),
0, 0,
60000, 0,
avg(price)
)

---------
repart() is an op that changes chunk sizes, adds overlap
• window() inserts a repart() operation if needed
• Consider storing the array with overlap to speed things
up
• Sometimes adjusting repart() config speeds repart() up
FIXED WINDOW Aggregates
store(repart()) query
setopt('repart-algorithm', 'sparse')

-----------
VARIABLE WINDOW Aggregates
Window expands or contracts to ensure that each
aggregate value is calculated on the same number of
non-empty cells
• Applies only to one-dimensional windows
• Example: moving average price over the last 100 trades
• To speed up – put the entire row in a single chunk
======================
AFL% list('arrays');
AFL% list('operators');
AFL% list('types');
AFL% list('functions');
AFL% list('aggregates');
AFL% list('instances');
AFL% list('queries');
======================
======================
insert(a,b) = store(merge(a,b),b)
if you've modified an array too many times scidb keeps all back logs and might get laggy, store it in a anew array remove rpevious one and renbame new array to the former array
======================
USER DEFINED
Datatypes: (src/examples/point, rational)
• Functions: support for point, rational
• Aggregates: penmax
http://www.scidb.org/forum/viewtopic.php?f=18&t=1122
• Operators: example_udos
======================
SCIDB-r UNDER THE HOOD
rewrite_r_expressions = function()!
{!
options(scidb.debug=TRUE)!
x = as.scidb(iris) #download from R to SciDB!
head(x) #R expr on SciDB object!
!
y = scidb("laml_matrix") #R -> SciDB pointer!
y[,3][] #R subset and download!
!
z = cbind(rnorm(485578)) #R vector!
A = y %*% z #SciDB matrix * vector!
}!
======================
http://illposed.net/
======================

======================

======================










Comments