https://dask.org/
dask is a framework for distributed data frame. Tutorial video: https://www.youtube.com/watch?v=mbfsog3e5DA
--install
pip install dask distributed tornado==4.5.3 dask-ml --upgrade
pip install pyzmq==17.0.0 --upgrade
(jupyter nbextension enable --py --sys-prefix widgetsnbextension)
get tutorial on jupyter notebook
https://github.com/dask/dask-tutorial
and
https://www.analyticsvidhya.com/blog/2018/08/dask-big-datasets-machine_learning-python/
--setup network
launch dscheduler on each mode and worker
$ dask-scheduler
Scheduler started at 127.0.0.1:8786
$ dask-worker 127.0.0.1:8786
$ dask-worker 127.0.0.1:8786
$ dask-worker 127.0.0.1:8786
or from python
>>> from dask.distributed import Client
>>> client = Client() # set up local cluster on your laptop >>> client
>>> from dask.distributed import Client
>>> client = Client('127.0.0.1:8786')
#Run client-server
>>> def square(x):
return x ** 2
>>> def neg(x):
return -x
>>> A = client.map(square, range(10))
>>> B = client.map(neg, A)
>>> total = client.submit(sum, B)
>>> total.result() -285
from : http://distributed.dask.org/en/latest/quickstart.html
--setup network using python API or MPI
http://distributed.dask.org/en/latest/setup.html
1. create dask arrayimport dask.array as da
#using arange to create an array with values from 0 to 10
X = da.arange(11, chunks=5)
X.compute()
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10])
#to see size of each chunk
X.chunks
((5, 5, 1),)
2. convert numpy to dask array
x = np.arange(10)
y = da.from_array(x, chunks=5)
3. read csv file
>>> import dask.dataframe as dd
>>> df = dd.read_csv('power_CRAC3.csv')
>>> df.groupby(df.dev_name).Consumed_active_energy_kW.mean()
Dask Series Structure:
npartitions=1
float64
...
...
Name: Consumed_active_energy_kW, dtype: float64
Dask Name: truediv, 8 tasks
4. run to see 05_distributed.ipynb, 06_distributed_advanced.ipynb, 08_machine_learning.ipynb in https://github.com/dask/dask-tutorial