Python is excellent in manipulating textual data. There are many built in functions to manipulate text in Python. Let us see some of them.
See the index
0 1 2 3 4 5 6 7 8 9 10 11
H e l l o P y t h o n
-12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1
str1='cell' str1.replace('c', 's') ## To replace ‘c’ with ‘s’ str1[:2] ## To print first two letters str1[1:] ## To print letters after first letter str1.count('l') ## To count letter ‘l’ str1.find(‘l’) ## To find letter ‘l’ str1.lower() ## To convert all letters into small letters str1.upper() ## To convert all letters into capital letters str1.title() ## To convert all letters into title case letters str1.rjust ( 20 ) ## To move string to 20 pt right str1.ljust ( 20 ) ## To move string to 20 pt left
Python also support format strings, by which you can pass a number by using operator %. This is called formatting operator. %s is used to pass string and %d is used to pass a digit.
"Today is %s January, of %d" % (‘first’, 2013)
Regular expression module in Python adds much more functionality to Python.The regular expression module can be imported in Python by using the command ‘import re’
import re pattern = 'Year' text = 'Happy New Year' match = re.search(pattern, text) s = match.start() e = match.end() print 'Found "%s" in "%s" from %d to %d ("%s")' % (match.re.pattern, match.string, s, e, text[s:e])
The output of the above program is
Found "Year" in "Happy New Year" from 10 to 14 ("Year")
Here match.re.pattern refers to the pattern which is to be matched and match.string refers to the match obtained after the search. The commands - match.start()and match.end()gives the start and end position of the string where the match is found. Also we can easily extract the string from the text by using the syntax text[s:e]
Now let us see how a pattern from a long text can be find by using the Python regular expression. Try the following program:
import re text = 'abbabababbbbaaaaa' pattern = 'ab' for match in re.findall(pattern, text): print 'Found "%s"' % match
See another program which finds wherever the symbol percentage (%) is present
import re data= 'Reservation is 30% not in 13%' find_percentage = re.compile('\S+%') print find_percentage.findall(data)
In the above program, re.compile, compiles the regular expression pattern into a regular expression object, which can be used for matching pattern by ‘ match()’, ‘search ()’ and ‘findall()’ methods.
data='abc’ ## Try different values for data ‘cde’,‘fgh’etc Regex1 = re.compile('^(abc|cde|fgh|ijk|ml)+$') print Regex1.findall(data)
The above program searches the text ‘abc’ or ‘cde’or ‘fgh’ or ‘ijk’ or ‘ml’ in the string which is in ….. data
In each of the above examples, we have used different regular expression commands- re.match, re.search(), re.findall and re.compile. There is some difference between these commands. For a given string and a pattern re.match checks if there is some matching pattern in the string from the beginning of the string, whereas re.search() checks for a match anywhere in the string. The expression, re.findall returns a list of matches, from the documentation.
The power of regular expression is you can do complicated search and string processing by using two or three lines of codes. The very basic application of regular expression is matching a single character, which is the simplest of the kind. That is to find whether a particular letter is present in a sentence. We can also match particular set of alphabets by using square brackets (For example, [a-z], this is called matching "character classes.")
We can compare with more complex pattern by using particular symbols and operators. Also grouping of strings by using braces are done for better abstraction. The regular expression for zero or more is “*”; and one or more is denoted by "+"; zero or one is indicated by “?”
import re string= "hello11111world" if re.search("hello([d-w]*\d\d?)+world", string): print "Match!"
In the above program, we are interested only on the first and last part of the string
To validate an email ID, different methods have been used. Let us use regular expression to find whether a particular Email ID is valid or not.
import re def validate(email): if re.match("^.+\\@(\\[?)[a-zA-Z0-9\\-\\.]+\\.([a-zA-Z]{2,3}|[0-9]{1,3})(\\]?)$", email): return "This is a correct E-mail ID" return "This is not a correct E-mail ID"
Now after running the program, try
validate("dsfklsd@gmail.com")
Try with some other string
Regular expressions can be extended to find a particular string in a huge text file, or find pattern of text. Linguists and Bioinformaticians (regular users of regular expression) can exploit the potential of Python and regular expression modules.
Regular Usage
expression
^ To match the beginning of a string
$ To match the end of a string
* To match whether the pattern is repeated zero or more
+ To match whether the pattern is repeated one or more
\b To match a word boundary
\d To match any numeric digit
\D To match any non-numeric character
\s To match anywhitespace character (blank space, tab, etc.)
\S To matchany non-whitespace character
\w To match any alphanumeric character and the underscore
(a|b|c) To match exactly one of a, b or c
{n} Match Exactly n times
x{n,m} To match character x, at least n times, but not more than m times.