Bits of handy code that I picked up from watching the Meister at work (Matthias Pietsch, savvy Sys Administrator of ISfN at UKE by day and cool matrix rider dude by night) and from his wonderful copious detailed explanations (some jotted down verbatim below!) -- collected below so you don't need to go a-diggin'
Note that bzip compresses more efficiently than gzip (i.e. gives you smaller compressed file sizes).
find /projects/crunchie/toxo/{analysis,data} -type f ! -size -1k ! -name "*.bz2" ! -name "*.gz" ! -name "*.xz" ! -name "*.zip" -exec bzip2 -9[v] {} \;
So handy for finding non-ASCII characters in Rmd or R scripts (which might keep RStudio from launching) or LaTeX/BibTex files (which seems to pose problems for pdflatex in the bbl files created by biber).
cat <file> | LC_ALL=c tr -d "[:alnum:][:punct:][:space:]"
Why do some tools/apps have problems with non-ASCII characters and what is with all this charset stuff? Matthias explains:
tr (and a few other tools from coreutils) are not multi-byte character set compatible (and probably never will be). So, a character = a byte. Even simple German umlauts -- unless encoded in Latin1 (aka iso_8859-1) -- are a problem here, e.g. get string length of stringlänge (11 characters):
$ echo -n stringlänge | wc -c
12
or
$ echo -n stringlänge | xxd -c1 | wc -l
12
or with tr:
$ echo -n stringlänge | tr -c '_' '_' | egrep -o . | uniq -c | awk '{print $1}'
12
but sed works:
$ echo -n stringlänge | sed -e s/./_/g | wc -c
11
You can see why this happens/the difference between how these tools react when you see what each recognizes as an alpha-numeric character -- here we substitute all such characters to an underscore, i.e. _
sed recognizes the umlaut with German locale and multibyte UTF-8 charset encoding:
$ echo -n stringlänge | LC_CTYPE=de_DE.UTF-8 sed -e s/[[:alnum:]]/_/g
___________
but not without the UTF-8 charset:
$ echo stringlänge | LC_CTYPE=de_DE sed -e s/[[:alnum:]]/_/g
________�___
or here, using the local hard-coded locale (C) which usually is not UTF-8 -- as confirmed here:
$ echo stringlänge | LC_CTYPE=C sed -e s/[[:alnum:]]/_/g
_______ä___
But for tr, the umlaut isn't recognized as a letter, regardless of whether we specify UTF-8 charset:
$ echo stringlänge | LC_CTYPE=de_DE.UTF-8 tr '[[:alnum:]]' _
_______ä___
$ echo stringlänge | LC_CTYPE=de_DE tr '[[:alnum:]]' _
________�___
$ echo stringlänge | LC_CTYPE=C tr '[[:alnum:]]' _
_______ä___
$ echo stringlänge | LC_CTYPE=de_DE.UTF-8 tr -c '^[[:alnum:]]\n' _
stringl__nge
because what tr actually sees is
$ echo stringlänge | LC_CTYPE=de_DE recode ..u8
stringlänge
Since 'Ã_' (after the transformation) is not a valid UTF-8 sequence, you see the '�' in the output above for the 'Ã' prefix.
But at least with tr you get an error message when you do something illegal like trying to interpret something in Latin-1 as UTF-8:
$ echo stringlänge | LC_CTYPE=de_DE tr -c '^[[:alnum:]]\n' _ | recode ..u8
stringlrecode: Invalid input in step `CHAR..UTF-8'
USE VERY CAREFULLY -- better yet, list all files before testing
find -L $somePath -mtime +$((10 * 365 + 2)) -exec rm {} \;
find -L path -type l -exec rm -f {} \;
Under Windows, it is common practice to allocate a huge number of threads at the start because spawning processes (there is no such thing like forking) take extremely long. Under Linux, you can fork processes, which is so extremely cheap that threads are actually implemented as processes (with relaxed access permissions to memory pages) and -- except for rare cases -- preallocating them at start is rarely useful or necessary.
Mnemonic: the Wu-Tang Clan Forever over the aeons
netstat -wutaeon
ps -U "$USER" -o user,pid,nlwp,%mem,%cpu,time,comm | sort -nk3,3
e.g. Find out what your matlabs were actually doing in the last 15 minutes, look at the number of concurrently running threads (under linux these are actually processes). Anything that exceeds the number of cores is just a waste of resources:
timeout 900s strace -fc $(pgrep -u <user> MATLAB | sed -re 's/^/-p /' | paste -sd' ')
e.g. for MATLAB threads
ps -L -o user,pid,lwp,%mem,%cpu,time,cmd $(pgrep -u <user> MATLAB) | egrep "\<00:00:00\>" | wc -l