Data Science Tips

Virtual environments under conda/mamba

Export conda environment yml file without the prefix:

conda env export | grep -v "^prefix: " > environment.yml

To create conda environment:

conda env create -f environment.yml

Note that conda is no longer free for commercial use. So, do this to change the default channel to conda forge: (https://tenpy.readthedocs.io/en/v0.8.0/install/conda.html)

conda config --add channels conda-forge

conda config --set channel_priority strict

Then, remove the default channel by

conda config --remove channels defaults

To check the channels, use the following

conda config --show channels

Virtual environments under pip

pip environment to requirement file:

pip freeze > requirements.txt

create pip environment:

python3 -m venv env_name

source env_name/bin/activate

pip install -r requirements.txt

In order to use the new environment in the jupyter system, you need to do one more step by:

python -m ipykernel install --user --name=env_name

Sqlite3 database

GUI tool: https://sqlitebrowser.org/

Copy a table from a data base to another database

sqlite3 old.db ".dump mytable" | sqlite3 new.db

Profiling codes in jupyter notebook

Using line_profiler: https://github.com/pyutils/line_profiler

First, follow the instruction on the link to install the line_profiler. Then, trigger it in one cell of the notebook by:

%load_ext line_profiler

Assume you have a python function f(x), you can profile the speed of each operations in the function by the following (note you need to supply the real data in x)

lprun -f f f(x) # note that you need to put the function name f first followed by f(x) with x as real data.

where, "lp" stands for line_profiler. You will see the analysis results after running the above command

Running tasks as system daemon through systemd

If you want to run a streamlit dashboard, you can run it through a screen. However, a better way is through a system daemon (process) using systemd. Here is a nice tutorial for using systemd (https://www.shubhamdipt.com/blog/how-to-create-a-systemd-service-in-linux/). I am adding more details here.

cd /etc/systemd/system

Create a file named epcaldash.service and include the following. Note that the WorkingDirectory is the directory that store your streamlit script. The ExecStart is the command that run the streamlit script. You must specify the path to the right streamlit function (e.g., in a virtual environment).

--------------------------------------------------

[Unit]

Description=EPCAL Dashboard

[Service]

User=ubuntu

WorkingDirectory=/home/ubuntu/epcal/epcal_dashboard

ExecStart=/home/ubuntu/miniconda3/envs/py39/bin/streamlit run epcal_dashboard.py

Restart=always

[Install]

WantedBy=multi-user.target

------------------------------------------------

The above works directly. However, if you want to run some additional command (e.g., set the port forwarding through IPtable), you may want to first create a shell script in your working directory and then run the shell script. Example is as follows

--------------------------------------------------

[Unit]

Description=EPCAL Dashboard

[Service]

User=ubuntu

WorkingDirectory=/home/ubuntu/epcal/epcal_dashboard

ExecStart=/bin/bash epcaldash_service.sh

Restart=always

[Install]

WantedBy=multi-user.target

------------------------------------------------

where, epcaldash_service.sh is as follows. You then need to chmod +x to make it executable.

-------------------------

#! /bin/bash

sudo iptables -A PREROUTING -t nat -p tcp --dport 80 -j REDIRECT --to-ports 8501

/home/ubuntu/miniconda3/envs/py39/bin/streamlit run /home/ubuntu/epcal/epcal_dashboard/epcal_dashboard.py

---------------------------

------------------------------------------------------------------------------------------------------

Reload the service files to include the new service.
sudo systemctl daemon-reload
Start your service
sudo systemctl start epcaldash.service
To check the status of your service
sudo systemctl status epcaldash.service
To enable your service on every reboot to start automatically
sudo systemctl enable epcaldash.service
To disable your service on every reboot to start automatically
sudo systemctl disable epcaldash.service
To check the logs
journalctl -u service-name.service
To check the most recent 1000 logs
journalctl -u service-name.service -n 1000

File descriptor limit

OS often set a maximum on open file descriptors. Sometime, the streamlit app may exceed this limit and lead to errors. The practical solution is:

first, check the current value (usually 1024) by: ulimit -n

second, check the maximum number ever used by the system by: cat /proc/sys/fs/file-nr

third, increase the limit by modifying the file /etc/security/limits.conf (using sudo) to add the following entries

* soft nofile 40000

* hard nofile 40000

Note that the number 40000 is changeable to other values.

See these two posts for more:

https://serverfault.com/questions/235356/open-file-descriptor-limits-conf-setting-isnt-read-by-ulimit-even-when-pam-limi

https://serverfault.com/questions/48717/practical-maximum-open-file-descriptors-ulimit-n-for-a-high-volume-system

Some useful Git Command

# Clone only one branch

git clone --single-branch -b branch-name your_repo_url.git

# list all remote branches

git Check