Data Science Tips
Virtual environments under conda/mamba
Export conda environment yml file without the prefix:
conda env export | grep -v "^prefix: " > environment.yml
To create conda environment:
conda env create -f environment.yml
Note that conda is no longer free for commercial use. So, do this to change the default channel to conda forge: (https://tenpy.readthedocs.io/en/v0.8.0/install/conda.html)
conda config --add channels conda-forge
conda config --set channel_priority strict
Then, remove the default channel by
conda config --remove channels defaults
To check the channels, use the following
conda config --show channels
Virtual environments under pip
pip environment to requirement file:
pip freeze > requirements.txt
create pip environment:
python3 -m venv env_name
source env_name/bin/activate
pip install -r requirements.txt
In order to use the new environment in the jupyter system, you need to do one more step by:
python -m ipykernel install --user --name=env_name
Sqlite3 database
GUI tool: https://sqlitebrowser.org/
Copy a table from a data base to another database
sqlite3 old.db ".dump mytable" | sqlite3 new.db
Profiling codes in jupyter notebook
Using line_profiler: https://github.com/pyutils/line_profiler
First, follow the instruction on the link to install the line_profiler. Then, trigger it in one cell of the notebook by:
%load_ext line_profiler
Assume you have a python function f(x), you can profile the speed of each operations in the function by the following (note you need to supply the real data in x)
lprun -f f f(x) # note that you need to put the function name f first followed by f(x) with x as real data.
where, "lp" stands for line_profiler. You will see the analysis results after running the above command
Running tasks as system daemon through systemd
If you want to run a streamlit dashboard, you can run it through a screen. However, a better way is through a system daemon (process) using systemd. Here is a nice tutorial for using systemd (https://www.shubhamdipt.com/blog/how-to-create-a-systemd-service-in-linux/). I am adding more details here.
login to your ec2 instance and go to the systemd directory by
cd /etc/systemd/system
Create a file named epcaldash.service and include the following. Note that the WorkingDirectory is the directory that store your streamlit script. The ExecStart is the command that run the streamlit script. You must specify the path to the right streamlit function (e.g., in a virtual environment).
--------------------------------------------------
[Unit]
Description=EPCAL Dashboard
[Service]
User=ubuntu
WorkingDirectory=/home/ubuntu/epcal/epcal_dashboard
ExecStart=/home/ubuntu/miniconda3/envs/py39/bin/streamlit run epcal_dashboard.py
Restart=always
[Install]
WantedBy=multi-user.target
------------------------------------------------
The above works directly. However, if you want to run some additional command (e.g., set the port forwarding through IPtable), you may want to first create a shell script in your working directory and then run the shell script. Example is as follows
--------------------------------------------------
[Unit]
Description=EPCAL Dashboard
[Service]
User=ubuntu
WorkingDirectory=/home/ubuntu/epcal/epcal_dashboard
ExecStart=/bin/bash epcaldash_service.sh
Restart=always
[Install]
WantedBy=multi-user.target
------------------------------------------------
where, epcaldash_service.sh is as follows. You then need to chmod +x to make it executable.
-------------------------
#! /bin/bash
sudo iptables -A PREROUTING -t nat -p tcp --dport 80 -j REDIRECT --to-ports 8501
/home/ubuntu/miniconda3/envs/py39/bin/streamlit run /home/ubuntu/epcal/epcal_dashboard/epcal_dashboard.py
---------------------------
------------------------------------------------------------------------------------------------------
Reload the service files to include the new service.
sudo systemctl daemon-reloadStart your service
sudo systemctl start epcaldash.serviceTo check the status of your service
sudo systemctl status epcaldash.serviceTo enable your service on every reboot to start automatically
sudo systemctl enable epcaldash.serviceTo disable your service on every reboot to start automatically
sudo systemctl disable epcaldash.serviceTo check the logs
journalctl -u service-name.serviceTo check the most recent 1000 logs
journalctl -u service-name.service -n 1000
File descriptor limit
OS often set a maximum on open file descriptors. Sometime, the streamlit app may exceed this limit and lead to errors. The practical solution is:
first, check the current value (usually 1024) by: ulimit -n
second, check the maximum number ever used by the system by: cat /proc/sys/fs/file-nr
third, increase the limit by modifying the file /etc/security/limits.conf (using sudo) to add the following entries
* soft nofile 40000
* hard nofile 40000
Note that the number 40000 is changeable to other values.
See these two posts for more:
Some useful Git Command
# Clone only one branch
git clone --single-branch -b branch-name your_repo_url.git
# list all remote branches
git Check