Text Processing with Linux

How to merge two lists, sort them and exclude repeated lines?

Option 1

cat file1name file2name | sort | uniq > outputfilename

this will sort ascending

to sort descending add -r option to sort:

cat file1name file2name | sort -r | uniq > outputfilename

Option 2:

sort file1name file2name | uniq -u > diffLines

How to find duplicates in two lists?

sort file1name file2name | uniq -d > duplicates

How to compare two files line by line?

comm -1 -2 <(sort first.txt) <(sort second.txt)

Count the number of lines in a file and get only the number?

wc -l myfile.txt | cut -d' ' -f1

or:

count=`wc -l myfile.txt | cut -d' ' -f1`

String manipulation, matching, etc.

sed substitute command: match a regular expression and replace it by something else

sed s/"<www"/"<http:\/\/www"/ mappingbased_properties_en.nt >mappingbased_properties_en_fixed.nt

String matching in Perl

changing "www" in the "mappingbased_properties_en.nt" file to "http://www" in the "mappingbased_properties_en_fixed.nt" file:

Replace all slashes by commas in a file

sed -e 's/\//,/g' results.csv > results-all.csv

How to split file into several with the fixed length?

split -l 1000 file.nt

will split files into separate ones of lenth 1000

How to add a character (or a string) at the end of each line of a file?

add ">" at the end of each line ($ is a regex for end of the line):

sed 's/$/>/' myfile

This will not modify the file. To modify the file add option -i:

sed -i 's/$/>/' myfile

save it to another file:

sed 's/$/>/' myfile > anotherfile

How to add a character (or a string) at the beginning of each line of a file?

add "<" at the beginning of each line (^ is a regex for end of the line):

sed 's/^/</' myfile

This will not modify the file. To modify the file add option -i:

sed -i 's/^/</' myfile

save it to another file:

sed 's/^/</' myfile > anotherfile

How to rename a set of files or add an additional extension?

for i in *.*; do mv "$i" "$i.n3"; done

How to add a fixed header at the top of each file in your dir?

merge content of file1 with all other files in your dir (e.g. add a fixed header with file1 content at the top of each file in your dir):

for i in *.n3; do cat file1 "$i" > "$i.withhead.n3"; done

Convert 7-bit ASCII representations to UTF-8 Unicode

for i in *.n3; do ascii2uni "$i" > "$i.utf8.n3"; done

Convert middle quotes in the four double quotes with single?

e.g. if you have a string with ...."..."..."..".....

and need to convert it to:

...."...'...'..".....

do this:

for i in *.n3; do sed 's/$[\"].*$[\"]$.*$[\"]$.*[\"]$/\1'\2'\3/g' "$i" > "$i.quotesfixed.n3"; done

Escape backslashes

for i in *.n3; do sed 's/$.*$[\\]$.*$/\1 \2/g' "$i" > "$i.bfixed.n3"; done

Convert middle quote in the three double quotes with single?

e.g. if you have a string with ...."..."..".....

and need to convert it to:

....".....".....

i.e. remove the middle one

for i in *.n3; do sed 's/$[\"].*$[\"]$.*[\"]$/\1 \2/g' "$i" > "$i.3qfixed.n3"; done

How many lines are there in each file?

count numer of lines of the files with extension .txt in your specified dir:

find /thepathtothedir -maxdepth 1 -name "*.txt" -print0 | xargs -0 -n 1 wc -l

to put that list into a file use:

find . -maxdepth 1 -name "*.txt" -print0 | xargs -0 -n 1 wc -l > lines.txt

Count the files with 0 lines:

grep "^0 " lines.txt | wc -l

or with 10 lines:

grep "^10" lines.txt | wc -l

Sed tutorial

http://www.tutorialspoint.com/unix/unix-regular-expressions.htm

Awk tricks

Input file: expected.constraints in a form:

name Person:0.9 Organisation:0.1

awk '{print $1,$2}' expected.constraints|tr ':' ' '|sort -k 3 -rn|column -t|less

Output:

name Person 0.9

How to Read CSV or convert it to TSV using bash on Mac OS X?

It is very simple to read a CSV file using AWK such as:

awk -F "\"*,\"*" '{print $1 "\t" $2 "\t" $3}' test.csv

where your test.csv file looks like this:

first,second,third

1,2,3

4,5,6

7,8,9

If instead as input you get only the first line:

first,second,third

it is very likely that you need to fix your line endings as they might be CR.

To check you line endings do:

file test.csv

If what you get is:

test.csv: ASCII text, with CR line terminators

you will need to remove CR line terminators.

You can use dos2unix package for that. If you don't have it installed, you can get it using brew:

sudo brew install dos2unix

Finally, you can remove them as following:

mac2unix roles.csv

mac2unix: converting file roles.csv to Unix format...

Now run your awk script again, and it will work.

---

awk -F'\t' 'BEGIN{OFS="\t"}{print 0,0,"somelabel",$2,$3}' input.tsv > output.tsv

awk -F'\t' '{print $3}' input.tsv |cut -d' ' -f 3-

Assume you have the format

0 label some text with words

The command above with print 'some text with words'.

How to remove trailing and leading spaces from a string in awk?

Use tr -d '[:blank:]'

for example, in a file that reads

TOPIC#: CARS

the below command will extract CARS without any spaces:

topicid=`grep "TOPIC#:" "$i" | awk -F':' '{print $2}' | tr -d '[:blank:]' `

Page updated

Google Sites

Report abuse