Memo

I often spend lots of time on setting up the same software over and over again for different purposes. In order to save time, I decide to write down some useful procedures or hints here. Hope that they could be helpful to myself and others who visit here.

Run Apache Spark with Docker

posted May 5, 2015, 7:32 PM by Teng-Yok Lee   [ updated May 9, 2015, 8:34 AM ]

I want to learn spark, but I don't have a cluster, so I uses Docker to simulate one to practice. This memo mainly re-organizes multiple tutorials online:

Prepaere a virtual machine

This step is needed since I am using Windows. At the beginning I used the virtual machine came with boot2docker, but the root (/) was stored in RAM, and thus I lost all configuration changes (e.g. BASH) after rebooted the guest OS. Thus I decide to install a Ubuntu on VirtualBox instead. The tutorial in the link below has clear illustration

http://www.wikihow.com/Install-Ubuntu-on-VirtualBox

NOTE

  1. Create a disk with at least 20GB. At the beginning I only prepared 8GB, which quickly ran out of space.
  2. Also, don't use fixed size because it cannot be resized if needed (REF).
  3. Assign enough CPUs. I used 4 cores.
  4. Once the guest Ubuntu is installed, install openssh-server (REF) in order to log in to the Ubuntu via putty.

Install Docker


Docker's official site has a clear instructions. Once log in to the Ubuntu, run the following 2 commands:

$ wget -qO- https://get.docker.com/ | sh

$ sudo docker run hello-world


Run spark


Nevertheless, the git repo URL is different. Now the git command should be:

$ git clone -b blogpost https://github.com/amplab/docker-scripts.git

Then launch the docker containers for spark:
$ sudo ./docker-scripts/deploy/deploy.sh -i amplab/spark:0.8.0 -c

NOTE
  1. These script cannot work for newer version of Spark. First, it only supports up to 1.0.0. Second, its script for spark 1.0.0 will not work. The docker command will keep waiting for the master, which is impossible since the master cannot launch spark.
  2. This command will launch a shell for scala. To run pyspark, just type command exit to terminate this shell (or it will take all CPUs for its own workers)

Run PySpark

The following is based on the instructions by Aris. The previous step should have output to indicate the master information. For instance:

***********************************************************************

start shell via:            sudo /home/leeten/projects/docker-scripts/deploy/start_shell.sh -i amplab/spark-shell:0.8.0 -n 5b37cadb558db3380eef69adfd9bcc533dd98a604f529447c81331533dfa951b

visit Spark WebUI at:       http://172.17.0.4:8080/

visit Hadoop Namenode at:   http://172.17.0.4:50070

ssh into master via:        ssh -i /home/leeten/projects/docker-scripts/deploy/../apache-hadoop-hdfs-precise/files/id_rsa -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@172.17.0.4

/data mapped:

kill master via:           sudo docker kill 7fff30fe8ef3b766504844e0f5eace95a10c66d1e72327fbdd5604b5b8536a16

***********************************************************************

Now you can log in to the master after change the permission of the id_rsa file:

$ chmode 400 /home/leeten/projects/docker-scripts/deploy/../apache-hadoop-hdfs-precise/files/id_rsa
$ ssh -i /home/leeten/projects/docker-scripts/deploy/../apache-hadoop-hdfs-precise/files/id_rsa -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@172.17.0.4

On the docker container, launch pyspark:

$ /opt/spark-0.8.0/pyspark


Example: Estimate PI



The following python code can estimate PI. It is based on the code segment in Spark Examples, and the source code can be found on git. However the current version fails at the statement to create SparkContext with Spark 0.8.0. Thus I revise it as follows:

# REF: https://spark.apache.org/examples.html
# Complete (But not runnable code): https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py

import sys
from random import random
from operator import add

def sample(p):
    x, y = random(), random();
    return 1 if x*x + y*y < 1 else 0;

NUM_SAMPLES = 100;
count = sc.parallelize(xrange(0, NUM_SAMPLES)).map(sample).reduce(lambda a, b: a + b);
print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES);


Troubleshooting

What if docker keeps waiting for the master?

You can log in to the master manually. As mentioned in the previous section, the following command print the master's IP.

$ ./docker-scripts/deploy/deploy.sh -i amplab/spark:0.8.0 -c


Then you can directly log in to the host.

$ ssh -i /home/leeten/projects/docker-scripts/deploy/../apache-hadoop-hdfs-precise/files/id_rsa -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no root@172.17.0.4

Once logged in, check whether spark is running, or manually launch spark to see whether it is working. In my case, the master failed to launch spark so docker is waiting.

Error: WARN ClusterScheduler: Initial job has not accepted any resource

If spark or pyspakr shows the following message, it means that no worker is available (REF):


WARN ClusterScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory

You can check the spark status on the WebUI, as shown above. In this example, it is http://172.17.0.4:8080/.

References

http://www.wikihow.com/Install-Ubuntu-on-VirtualBox
https://docs.docker.com/installation/ubuntulinux/
https://amplab.cs.berkeley.edu/author/schumach/
http://www.rankfocus.com/run-berkeley-sparks-pyspark-using-docker-couple-minutes/

My Python Porting of TestScatterPlotMatrix for VTK

posted Apr 12, 2015, 6:25 AM by Teng-Yok Lee   [ updated Apr 12, 2015, 6:26 AM ]

This code is based on the sample: https://github.com/Kitware/VTK/blob/master/Charts/Core/Testing/Cxx/TestScatterPlotMatrix.cxx

I modified it to python version and load the iris dataset as a demo.

import csv;

import numpy as np;

import vtk;

import vtk.util.numpy_support as VN;


# Load the iris dataset.

# NOTE: It is downloaded from:

# http://aima.cs.berkeley.edu/data/iris.csv


csv_filepath = r'F:\data\multivariate\iris\iris.csv';

n_cols = 4;

csv_table = np.loadtxt(open(csv_filepath, "rb"), delimiter=",", usecols=xrange(n_cols));

# Convert into the table format for VTK.

vtk_table = vtk.vtkTable();

for columni in range(n_cols):

    # Convert numpy array to vtk array.

    # Note: https://pyscience.wordpress.com/2014/09/06/numpy-to-vtk-converting-your-numpy-arrays-to-vtk-arrays-and-files/

    array = VN.numpy_to_vtk(np.ascontiguousarray(csv_table[:, columni]), deep=1);

    array.SetName("%d" % (columni))

    vtk_table.AddColumn(array);


######################################################################

# REF:    http://fossies.org/dox/ParaView-v4.1.0-source/TestScatterPlotMatrix_8cxx_source.html

matrix = vtk.vtkScatterPlotMatrix();

# Fine tune the color if needed.

# matrix.SetPlotColor(matrix.SCATTERPLOT, vtk.vtkColor4ub(0, 0, 0, 1));

# matrix.SetPlotColor(matrix.ACTIVEPLOT, vtk.vtkColor4ub(0, 0, 0, 1));

matrix_view = vtk.vtkContextView();

matrix_view.GetScene().AddItem(matrix);

matrix.SetInput(vtk_table);

matrix.SetNumberOfBins(7);

matrix_view.Render();

matrix_view.GetInteractor().Start()




Use libSDF to read SDF format on Windows

posted Mar 20, 2015, 9:46 AM by Teng-Yok Lee   [ updated Mar 20, 2015, 9:47 AM ]

This is my memo to read the data for IEEE SciVis 2015 Contest. The file is in SDF format. There is a C/C++ library libSDF to open the files, but some instructions are unclear, especially for the porting on Windows. Also, I cannot find examples about its usage so I write one:

Prerequisites

To build libSDF for Visual Studio 2010, I use cygwin and mingw x64. The procedure to install them can be seen here:

http://www.recheliu.org/memo/suggestionstoavoidarpackcompilationerrors

Build libSDF for Windows x64 platform

  • Edit SDFfuncs.c: Remove the preprocessor USE_ALLOCA.
  • Edit utils.c: Change the function MPMU_Fopen() to open the file as binary (otherwise not all bytes can be read):
MPMYFile *MPMY_Fopen(const char *path, int mpmy_flags)
{
    MPMYFile *fp;
    int iomode = MPMY_SINGL;
    char mode[8] = {[0] = 'r'};

    Msgf("Fopen %s\n", path);
    if (mpmy_flags & MPMY_RDONLY) mode[0] = 'r'; /* MPMY_RDONLY is 0 since O_RDONLY is (stupidly) 0 */
    if (mpmy_flags & MPMY_WRONLY) mode[0] = 'w';
    if (mpmy_flags & MPMY_APPEND) mode[0] = 'a';
    // TEST-ADD-BEGIN
    switch(mode[0]) {
    case 'r':
        strcpy(mode, "r+b");
        break;
    case 'w':
        strcpy(mode, "w+b");
        break;
    case 'a':
        strcpy(mode, "a+b");
        break;
    }
    // TEST-ADD-END
    ...

  • Edit Makefile: Change CC from gcc to the one for mingw.
CC=/usr/bin/x86_64-w64-mingw32-gcc.exe
  • Use cygwin to build the library.
$ make libSDF.a
$ /usr/bin/x86_64-w64-mingw32-dllwrap.exe --export-all-symbols *.o -lm --output-def libSDF_x64.def -o libSDF_x64.dll

  • Use Visual Studio (64-bit) Command Prompt to build the .dll

lib.exe /machine:X64 /def:libSDF_x64.def

My quick example to read the array "x"

    #define LOG_VAR(x) cout<<x<<endl;

    char* szSdfFilepath = "F:/data/viscontest/scivis2015/ds14_scivis_0128_e4_dt04_0.0200";
    LOG_VAR(SDFissdf(szSdfFilepath));

    SDF *sdf = SDFopen(szSdfFilepath, "");
    SDFdebug(1); // Put it 0 to disable debug information.

    int64_t uNrOfRecs = SDFnrecs("x", sdf);
    LOG_VAR(uNrOfRecs);
    vector<float> vfData(uNrOfRecs);
    SDFrdvecs(sdf, "x", uNrOfRecs, vfData.data(), 0, NULL);
    SDFclose(sdf);

    for(size_t i = 0; i < uNrOfRecs; i++)
    {
        if( 0.0f != vfData[i] )
        {
            LOG_VAR(vfData[i]);
        }
    }


Suggestions to avoid ARPACK++ compilation errors

posted Jan 21, 2015, 7:24 AM by Teng-Yok Lee   [ updated Jan 21, 2015, 6:42 PM ]

I plan to put my patches to ARPACK++(http://www.ime.unicamp.br/~chico/arpack++/) to my Google Code repo at the end of Jan. 2015. Here I manually list the fixes for my applications. Note that my applications only use ardsmat.h and ardssym.h so there could be more errors, but I guess that the fixes for other parts should be similar.

arch.h

Comment arcomp.h (Otherwise, arcomple<float>/arcomple<double> will be declared inside a extern "C" closure, which is not allowed since C does not understand C++ template).

// #include "arcomp.h"

Also, replace the generic.h

#include <generic.h>

by the only needed macro name2:

// REF: http://www-d0.fnal.gov/KAI/doc/migrate/gnu_generic.h
#define name2(a,b) gEnErIc2(a,b)
#define gEnErIc2(a,b) a ## b

ardssym.h

Replace
DefineParameters(A.ncols(), nevp, &A, ARdsSymMatrix<FLOAT>::MultMv,
by
DefineParameters(A.ncols(), nevp, &A, &ARdsSymMatrix<FLOAT>::MultMv,

arerror.h

Replace
#include <iostream.h>

by
#include <iostream>

arpackf.h

Change the sentences at the end (Because arcomp.h is not included, '}' will be ignored and thus the entern closure is not completed).

}
#endif // ARCOMP_H

to
#endif // ARCOMP_H
} // extern "C" {


Build ARPACK on Windows for Visual Studio

posted Jan 19, 2015, 11:33 PM by Teng-Yok Lee   [ updated Jan 21, 2015, 7:27 AM ]

This is a revision of Jernej Barbic's steps to build ARPACK for Visual Studio. The original steps are in the URL below:

http://www-bcf.usc.edu/~jbarbic/arpack.html

I met issues when executing step 3 so I decide to put my modified procedure here. For 32-bit, you can try both gfortran or g77. For 64-bit, only gfortran can work.

My recommendation is to use the MinGW64 in CYGWIN64.

Use gfortran (Win32)

  1. Modify ARmake.inc in the source code root. I extracted the source code of ARPACK to D:\src\ARPACK. Also I cannot find f77 in the latest MinGW so I used gfortran instead. Thus the following variables should be changed accordingly:
    home = /d/src/ARPACK
    PLAT = win32
    FC      = gfortran
    FFLAGS    = -O
  2. Open a MSYS window and:
    cd /d/src/ARPACK.
  3. Compile the .f to .o:
    make lib
  4. Wrap the *.o to .dll. I need to add the library gfortran:
    dllwrap --export-all-symbols BLAS/*.o LAPACK/*.o SRC/*.o UTIL/*.o -lg2c -lgfortran --output-def arpack_win32.def -o arpack_win32.dll
  5. Open a Visual Studio Command Prompt and:
    cd d:\src\ARPACK
  6. Generate the library:
    lib.exe /machine:i386 /def:arpack_win32.def

Use g77 (Win32)

  1. Modify ARmake.inc in the source code root. I extracted the source code of ARPACK to D:\src\ARPACK. Also I cannot find f77 in the latest MinGW so I used gfortran instead. Thus the following variables should be changed accordingly:
    home = /d/src/ARPACK
    PLAT = win32
    FC      = g77
    FFLAGS    = -O
  2. Open a MSYS window and:
    cd /d/src/ARPACK.
  3. Compile the .f to .o:
    make lib
  4. Wrap the *.o to .dll.
    dllwrap --export-all-symbols BLAS/*.o LAPACK/*.o SRC/*.o UTIL/*.o -lg2c --output-def arpack_win32.def -o arpack_win32.dll
  5. Open a Visual Studio Command Prompt and:
    cd d:\src\ARPACK
  6. Generate the library:
    lib.exe /machine:i386 /def:arpack_win32.def

Use gfortran (Win 64) in MSYS

Before the procedure, install MinGW 64. In my case, it is installed to C:\Program Files (x86)\mingw-w64, and gfortran.exe is in C:\Program Files (x86)\mingw-w64\i686-4.9.2-posix-dwarf-rt_v3-rev1\mingw32\bin.
  1. Modify ARmake.inc in the source code root. I extracted the source code of ARPACK to D:\src\ARPACK. Also I cannot find f77 in the latest MinGW so I used gfortran instead. Thus the following variables should be changed accordingly:
    home = /d/src/ARPACK
    PLAT = x64
    FC = /c/Program\ Files\ \(x86\)/mingw-w64/i686-4.9.2-posix-dwarf-rt_v3-rev1/mingw32/bin/gfortran.exe
    FFLAGS    = -O
    RANLIB = /c/Program\ Files\ \(x86\)/mingw-w64/i686-4.9.2-posix-dwarf-rt_v3-rev1/mingw32/bin/ranlib.exe

  2. Open a MSYS window and:
    cd /d/src/ARPACK
  3. Extend PATH:
    export PATH=$PATH:/c/Program\ Files\ \(x86\)/mingw-w64/i686-4.9.2-posix-dwarf-rt_v3-rev1/mingw32/opt/bin
  4. Change UTIL/second. Remove the sentence: EXTERNAL           ETIME.
  5. Compile the .f to .o:
    make lib
  6. Wrap the *.o to .dll:
    /c/Program\ Files\ \(x86\)/mingw-w64/i686-4.9.2-posix-dwarf-rt_v3-rev1/mingw32/bin/dllwrap.exe --export-all-symbols BLAS/*.o LAPACK/*.o SRC/*.o UTIL/*.o -lgfortran --output-def arpack_x64.def -o arpack_x64.dll
  7. Open a Visual Studio (64-bit) Command Prompt and:
    cd d:\src\ARPACK
  8. Generate the library:
    lib.exe /machine:X64 /def:arpack_x64.def

Use gfortran (Win 64) in cygwin

Later I found later MinGW is also available in CYGWIN, which make the commands shorter.

Before the procedure, install mingw64-x86_64-gcc in CYGWIN 64.

  1. Modify ARmake.inc in the source code root. I extracted the source code of ARPACK to D:\src\ARPACK. Also I cannot find f77 in the latest MinGW so I used gfortran instead. Thus the following variables should be changed accordingly:
    home = /cygdrive/d/src/ARPACK
    PLAT = x64
    FC = /usr/bin/x86_64-w64-mingw32-gfortran.exe
    FFLAGS    = -O
    RANLIB = /usr/bin/x86_64-w64-mingw32-ranlib.exe

  2. Open a MSYS window and:
    cd /cygdrive/d/src/ARPACK
  3. Change UTIL/second. Remove the sentence: EXTERNAL           ETIME.
  4. Compile the .f to .o:
    make lib
  5. Wrap the *.o to .dll:
    /usr/bin/x86_64-w64-mingw32-dllwrap.exe --export-all-symbols BLAS/*.o LAPACK/*.o SRC/*.o UTIL/*.o -lm -lgfortran --output-def arpack_x64.def -o arpack_x64.dll
  6. Open a Visual Studio (64-bit) Command Prompt and:
    cd d:\src\ARPACK
  7. Generate the library:
    lib.exe /machine:X64 /def:arpack_x64.def

Link the lib

After building .lib and .dll, adding the directories of related dll. to your path. For CYGWIN, the path is

C:\cygwin64\usr\x86_64-w64-mingw32\sys-root\mingw\bin

For MinGW, it should be the path to the 64-bit.

If the .dll does not match to the linked library, it might show error The application was unable to start correctly (0xc000007b). I saw it when my system had both 32-bit and 64-bit versions of MinGW. After I uninstalled both and just kept CYGWIN 64, the problem was solved.

References

  1. External issue of etime: https://gcc.gnu.org/ml/fortran/2007-03/msg00305.html
  2. Use CYGWIN and MinGW64 to build Visual Studio libraries: https://github.com/arrayfire/arrayfire/wiki/CBLAS-for-Windows
  3. Cannot start my application: http://stackoverflow.com/questions/25124182/mingw-gcc-the-application-was-unable-to-start-correctly-0xc000007b

A dirty trick to install numpy via distutils without gcc 4.9

posted Dec 15, 2014, 12:17 AM by Teng-Yok Lee   [ updated Dec 15, 2014, 12:17 AM ]

After python 2.7.8, its distutils use gcc 4.9 as the C compiler, which introduces a new flag -fstack-protector-strong. Also, the AR executable is changed. However, on a system without gcc 4.9, this can fail. Consequently, numpy cannot be built.

While not all package management system has gcc 4.9 (e.g. old Ubuntu AMI from AWS), a quick trick is to modify the following file:

/usr/lib/python2.7/plat-x86_64-linux-gnu/_sysconfigdata_nd.py

First, remove -fstack-protector-strong in this file. Second, change the value of AR to 'ar'.


Export the link to pdf format of Google Doc

posted Nov 15, 2014, 4:12 PM by Teng-Yok Lee   [ updated Dec 15, 2014, 12:20 AM ]

As I use Google Doc to edit my CV, before I needed to download the file as pdf, and upload to my webpage. Nevertheless, the link to google doc is very simple: First, edit the file to get the URL. Ten change /edit in the url to /export?format=pdf.

NOTE: To make the links accessible to others, the doc link should be viewable to others.

REF: http://webapps.stackexchange.com/questions/8106/link-to-view-pdf-version-of-a-google-doc 

Disable GPU Acceleration on Firefox

posted Sep 14, 2014, 8:13 AM by Teng-Yok Lee   [ updated Dec 15, 2014, 12:19 AM ]

  1. Type about:config as the URL.
  2. Search for gfx.direct2d.disabled
  3. Double click to make it "true".

REF: https://support.mozilla.org/en-US/questions/922995

PyDev: My FAQ

posted Aug 15, 2014, 10:55 AM by Teng-Yok Lee   [ updated Aug 15, 2014, 1:57 PM ]

Q: What to do when receive the error messages "Failed to read server's response: Connection refused: connect"?
A: Increase the connection attempt.
REF: https://www.mail-archive.com/pydev-users@lists.sourceforge.net/msg03298.html

Generate histograms from bash commands

posted Jun 22, 2014, 8:01 PM by Teng-Yok Lee

REF: http://www.smallmeans.com/notes/shell-history/

For instance, to find the histogram of commands:
history|awk '{print $2}'|sort|uniq -c

1-10 of 101