Data Science Tips

Virtual environments under conda/mamba

conda env export | grep -v "^prefix: " > environment.yml

 conda env create -f environment.yml

Note that conda is no longer free for commercial use. So, do this to change the default channel to conda forge: (https://tenpy.readthedocs.io/en/v0.8.0/install/conda.html)

conda config --add channels conda-forge

conda config --set channel_priority strict

Then, remove the default channel by

conda config --remove channels defaults

To check the channels, use the following

conda config --show channels

Virtual environments under pip

pip freeze > requirements.txt

python3 -m venv env_name 

source env_name/bin/activate 

pip install -r requirements.txt


In order to use the new environment in the jupyter system, you need to do one more step by:

python -m ipykernel install --user --name=env_name

Sqlite3 database

GUI tool: https://sqlitebrowser.org/

Copy a table from a data base to another database

sqlite3 old.db ".dump mytable" | sqlite3 new.db


Profiling codes in jupyter notebook

Using line_profiler: https://github.com/pyutils/line_profiler

First, follow the instruction on the link to install the line_profiler.  Then,  trigger it in one cell of the notebook by:

%load_ext line_profiler

Assume you have a python function f(x), you can profile the speed of each operations in the function by the following (note you need to supply the real data in x)

lprun -f f f(x) # note that you need to put the function name f first followed by f(x) with x as real data. 

where, "lp" stands for line_profiler. You will see the analysis results after running the above command

Running tasks as system daemon through systemd

If you want to run a streamlit dashboard, you can run it through a screen. However, a better way is through a system daemon (process) using systemd. Here is a nice tutorial for using systemd (https://www.shubhamdipt.com/blog/how-to-create-a-systemd-service-in-linux/). I am adding more details here.

cd /etc/systemd/system

--------------------------------------------------

[Unit]

Description=EPCAL Dashboard

[Service]

User=ubuntu

WorkingDirectory=/home/ubuntu/epcal/epcal_dashboard

ExecStart=/home/ubuntu/miniconda3/envs/py39/bin/streamlit run epcal_dashboard.py

Restart=always

[Install]

WantedBy=multi-user.target

------------------------------------------------

The above works directly. However, if you want to run some additional command (e.g., set the port forwarding through IPtable), you may want to first create a shell script in your working directory and then run the shell script. Example is as follows

--------------------------------------------------

[Unit]

Description=EPCAL Dashboard

[Service]

User=ubuntu

WorkingDirectory=/home/ubuntu/epcal/epcal_dashboard

ExecStart=/bin/bash epcaldash_service.sh

Restart=always

[Install]

WantedBy=multi-user.target

------------------------------------------------

where, epcaldash_service.sh is as follows. You then need to chmod +x  to make it executable. 

-------------------------

#! /bin/bash

sudo iptables -A PREROUTING -t nat -p tcp --dport 80 -j REDIRECT --to-ports 8501

/home/ubuntu/miniconda3/envs/py39/bin/streamlit run /home/ubuntu/epcal/epcal_dashboard/epcal_dashboard.py

---------------------------

------------------------------------------------------------------------------------------------------



File descriptor limit

OS often set a maximum on open file descriptors. Sometime, the streamlit app may exceed this limit and lead to errors. The practical solution is:

first, check the current value (usually 1024) by:  ulimit -n

second, check the maximum number ever used by the system by: cat /proc/sys/fs/file-nr

third, increase the limit by modifying the file /etc/security/limits.conf (using sudo) to add the following entries

* soft nofile 40000 

* hard nofile 40000

Note that the number 40000 is changeable to other values. 

See these two posts for more:

https://serverfault.com/questions/235356/open-file-descriptor-limits-conf-setting-isnt-read-by-ulimit-even-when-pam-limi

https://serverfault.com/questions/48717/practical-maximum-open-file-descriptors-ulimit-n-for-a-high-volume-system

Some useful Git Command


# Clone only one branch

git clone --single-branch -b branch-name your_repo_url.git


# list all remote branches

git Check