Matthias Pietsch's code nuggets

Bits of handy code that I picked up from watching the Meister at work (Matthias Pietsch, savvy Sys Administrator of ISfN at UKE by day and cool matrix rider dude by night) and from his wonderful copious detailed explanations (some jotted down verbatim below!) -- collected below so you don't need to go a-diggin'

File contents and file handling

bzip up all files recursively in 2 subfolders (analysis, data) in place, skipping those already zipped

any binary symbols in a file?

delete all files more than 10 years old

remove orphan links

Resources and processes

what ports are occupied and associated processes

list your processes, sorted by increasing number of threads

check wasted resources

wasted resources: display number of threads that did nothing (did not even run a second of CPU time)

File contents and file handling

bzip up all files recursively in 2 subfolders (analysis, data) in place, skipping those already zipped

Note that bzip compresses more efficiently than gzip (i.e. gives you smaller compressed file sizes).

find /projects/crunchie/toxo/{analysis,data} -type f ! -size -1k ! -name "*.bz2" ! -name "*.gz" ! -name "*.xz" ! -name "*.zip" -exec bzip2 -9[v] {} \;

any binary symbols in a file?

So handy for finding non-ASCII characters in Rmd or R scripts (which might keep RStudio from launching) or LaTeX/BibTex files (which seems to pose problems for pdflatex in the bbl files created by biber).

cat <file> | LC_ALL=c tr -d "[:alnum:][:punct:][:space:]"

Why do some tools/apps have problems with non-ASCII characters and what is with all this charset stuff? Matthias explains:

tr (and a few other tools from coreutils) are not multi-byte character set compatible (and probably never will be). So, a character = a byte. Even simple German umlauts -- unless encoded in Latin1 (aka iso_8859-1) -- are a problem here, e.g. get string length of stringlänge (11 characters):

$ echo -n stringlänge | wc -c

$ echo -n stringlänge | xxd -c1 | wc -l

or with tr:

$ echo -n stringlänge | tr -c '_' '_' | egrep -o . | uniq -c | awk '{print $1}'

but sed works:

$ echo -n stringlänge | sed -e s/./_/g | wc -c

You can see why this happens/the difference between how these tools react when you see what each recognizes as an alpha-numeric character -- here we substitute all such characters to an underscore, i.e. _

sed recognizes the umlaut with German locale and multibyte UTF-8 charset encoding:

$ echo -n stringlänge | LC_CTYPE=de_DE.UTF-8 sed -e s/[[:alnum:]]/_/g

___________

but not without the UTF-8 charset:

$ echo stringlänge | LC_CTYPE=de_DE sed -e s/[[:alnum:]]/_/g

________�___

or here, using the local hard-coded locale (C) which usually is not UTF-8 -- as confirmed here:

$ echo stringlänge | LC_CTYPE=C sed -e s/[[:alnum:]]/_/g

_______ä___

But for tr, the umlaut isn't recognized as a letter, regardless of whether we specify UTF-8 charset:

$ echo stringlänge | LC_CTYPE=de_DE.UTF-8 tr '[[:alnum:]]' _

_______ä___

$ echo stringlänge | LC_CTYPE=de_DE tr '[[:alnum:]]' _

________�___

$ echo stringlänge | LC_CTYPE=C tr '[[:alnum:]]' _

_______ä___

$ echo stringlänge | LC_CTYPE=de_DE.UTF-8 tr -c '^[[:alnum:]]\n' _

stringl__nge

because what tr actually sees is

$ echo stringlänge | LC_CTYPE=de_DE recode ..u8

stringlÃ¤nge

Since 'Ã_' (after the transformation) is not a valid UTF-8 sequence, you see the '�' in the output above for the 'Ã' prefix.

But at least with tr you get an error message when you do something illegal like trying to interpret something in Latin-1 as UTF-8:

$ echo stringlänge | LC_CTYPE=de_DE tr -c '^[[:alnum:]]\n' _ | recode ..u8

stringlrecode: Invalid input in step `CHAR..UTF-8'

delete all files more than 10 years old

USE VERY CAREFULLY -- better yet, list all files before testing

find -L $somePath -mtime +$((10 * 365 + 2)) -exec rm {} \;

remove orphan links

find -L path -type l -exec rm -f {} \;

Resources and processes

Under Windows, it is common practice to allocate a huge number of threads at the start because spawning processes (there is no such thing like forking) take extremely long. Under Linux, you can fork processes, which is so extremely cheap that threads are actually implemented as processes (with relaxed access permissions to memory pages) and -- except for rare cases -- preallocating them at start is rarely useful or necessary.

what ports are occupied and associated processes

Mnemonic: the Wu-Tang Clan Forever over the aeons

netstat -wutaeon

list your processes, sorted by increasing number of threads

ps -U "$USER" -o user,pid,nlwp,%mem,%cpu,time,comm | sort -nk3,3

check wasted resources

e.g. Find out what your matlabs were actually doing in the last 15 minutes, look at the number of concurrently running threads (under linux these are actually processes). Anything that exceeds the number of cores is just a waste of resources:

timeout 900s strace -fc $(pgrep -u <user> MATLAB | sed -re 's/^/-p /' | paste -sd' ')

wasted resources: display number of threads that did nothing (did not even run a second of CPU time)

e.g. for MATLAB threads

ps -L -o user,pid,lwp,%mem,%cpu,time,cmd $(pgrep -u <user> MATLAB) | egrep "\<00:00:00\>" | wc -l

Page updated

Report abuse