conocimiento de la computadora:

computer knowledge

OOB (Object Oriented Programming)

正则表达式语法(In Python 3)

traducirlo del ingles al chino:translating from English into Chinese by Sophie hanfen Zang

This module provides regular expression matching operations similar to those found in Perl.

Both patterns and strings to be searched can be Unicode strings (str) as well as 8-bit strings (bytes). However, Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match a Unicode string with a byte pattern or vice-versa; similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string.

这个模块提供了正则表达式匹配类似于在Perl中发现的操作符。既是模式和字符串被搜寻能是Unicode字符串(str)以及8-位字符串(bytes).然而,Unicode字符串和8-位字符串不能被混合使用:即,若你能匹配一个Unicode字符用一个字节模式或者反之亦然;类似地,当要求一个替代,替代字符必须是相同的类型既是模式相同又是搜索字符相同才行。

Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals(字符串); for example, to match a literal backslash, one might have to write '\\\\' as the pattern string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal. Also, please note that any invalid escape sequences(转义序列) in Python’s usage of the backslash in string literals now generate a DeprecationWarning and in the future this will become a SyntaxError. This behaviour will happen even if it is a valid escape sequence for a regular expression. 正则表达式使用这个 反斜线符号(backslash)('\')去意味着特殊形式或者去允许特殊字符被使用而没有引发其特殊的意义。这个与Python的相同字符为了相同目的在字符串中冲突了;例如为了匹配‘\’,某人必须写‘\\\\' 为模式串,因为正则表达式必须是\\,每一个反斜线必须表达为\\在正则表达式内部。也请注意到任何无效的转义序列(escape sequences)在PYthon的使用的反斜线在字符串中现在生成一个DeprecationWarning错误且在未来这将变成一个语法错误(syntaxError).这个行为将发生甚至若它是一个有效地转义序列对于一个正则表达式。

The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.这个在使用Python的原始字符串标记法用于正则表达式的解决方案;反斜线不被处理成以任何的特殊方式在字符串前缀为“r”中。因此,r"\n"是两个字符包含了'\'反斜线和'n',而'\n'是一个字符串包含一个新行。通常模式将被在python中被表达使用原始字符串标记。

It is important to note that most regular expression operations are available as module-level functions and methods on compiled regular expressions. The functions are shortcuts that don’t require you to compile a regex object first, but miss some fine-tuning parameters.注意大部分的正则表达式操作符可获得作为模块水平函数和方法基于’编辑的正则表达式‘是重要的。

See also The third-party regex module, which has an API compatible with the standard library re module, but offers additional functionality and a more thorough Unicode support. 也注意第三方的'regex'模块,有一个API可兼容的标准库re模块,但是提供额外的函数和一个更加全面的Unicode支持。

Regular Expression Syntax 正则表达式语法

Repetition qualifiers (*, +, ?, {m,n}, etc) cannot be directly nested. This avoids ambiguity with the non-greedy modifier suffix ?, and with other modifiers in other implementations. To apply a second repetition to an inner repetition, parentheses may be used. For example, the expression (?:a{6})* matches any multiple of six 'a' characters.重复限定符(Repetition qualifiers)(*,+,?,{m,n},等)不能被直接地嵌套。这个避免了模糊与非贪婪修饰前缀(?),以及其它修饰符在其它地应用中。为了应用第二个重复到一个内部地重复,括号可以被使用。例如(?:a{6})*匹配了任何6个'a'的重复。

The special characters are特殊字符如下为:

.

(Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.点在缺省模式中,这个匹配任何字符除了一个新行。若这个DOTALL flag已经被具体化了,这个匹配任何字符包含一个新行。

^

(Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline. ^符号匹配字符的开始,在多行模式下也立即地匹配在每一个新行后。

$

Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in 'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in 'foo\n' will find two (empty) matches: one just before the newline, and one at the end of the string. 美元符号匹配字符串地结尾或者新行地刚刚开始。

*

Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.

+

Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.

?

Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.

*?, +?, ??

The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against '<a> b <c>', it will match the entire string, and not just '<a>'. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.*?> will match only '<a>'.

{m}

Specifies that exactly m copies of the previous RE should be matched; fewer matches cause the entire RE not to match. For example, a{6} will match exactly six 'a' characters, but not five.

{m,n}

Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. For example, a{3,5} will match from 3 to 5 'a' characters. Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound. As an example, a{4,}b will match 'aaaab' or a thousand 'a' characters followed by a 'b', but not 'aaab'. The comma may not be omitted or the modifier would be confused with the previously described form.

{m,n}?




Because backslashes in regular expressions could be mistaken for escape sequences (like \n), best to use Python’s raw string notation for regular expression patterns, else pytest will warn with DeprecationWarning: invalid escape sequence. Just as format strings are prefixed with f, so are raw strings prefixed with r. For instance, instead of "harvard\.edu", use r"harvard\.edu". 因为’反斜杠‘在正则表达式中可能是错误的对于转义序列(像\n),最好去使用Python的原始字符标记对于正则表达式模式,另外pytest将警告用DeprecationWarning:无效转义序列。正如格式字符被前缀用f,因此是原始前缀r。例如,取而代之'harvard\.edu',我们使用r"harvard\.edu"。


Match.groups(default=None) 匹配许多个(即小组)Match.group

Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. The default argument is used for groups that did not participate in the match; it defaults to None.返回一个元组包含所有的匹配的子组,然而从1直到是在这个模式中的许多组。这个缺省参数被用于组确实并不参与匹配;它缺省为None.

For example:例子

>>> m = re.match(r"(\d+)\.(\d+)", "24.1632")

>>> m.groups()

('24', '1632')

其中,\d表示为decimal digit(十进制数字),’+‘,指一个或者更多的相同的,前面讲过了r,


If we make the decimal place and everything after it optional, not all groups might participate in the match. These groups will default to None unless the default argument is given:如果我们把小数点后面和后面的所有东西都是可选的 ,并不是所有的组或许参与在匹配中。这些组将缺省为None除非缺省参数被给出。

>>>

>>> m = re.match(r"(\d+)\.?(\d+)?", "24")

>>> m.groups() # Second group defaults to None.

('24', None)

>>> m.groups('0') # Now, the second group defaults to '0'.

('24', '0')


Match.groupdict(default=None)

Return a dictionary containing all the named subgroups of the match, keyed by the subgroup name. The default argument is used for groups that did not participate in the match; it defaults to None. For example:这个返回一个字典包含所有的被匹配的被命名的子组,被子组名字为关键词。缺省参数被用于小组缺省没有参与匹配的;它缺省为None.例如下面例子:

>>> m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Malcolm Reynolds")

>>> m.groupdict()

{'first_name': 'Malcolm', 'last_name': 'Reynolds'}


Match.start([group])

Match.end([group])

Return the indices of the start and end of the substring matched by group; group defaults to zero (meaning the whole matched substring). Return -1 if group exists but did not contribute to the match. For a match object m, and a group g that did contribute to the match, the substring matched by group g (equivalent to m.group(g)) is

m.string[m.start(g):m.end(g)]


Note that m.start(group) will equal m.end(group) if group matched a null string. For example, after m = re.search('b(c?)', 'cba'), m.start(0) is 1, m.end(0) is 2, m.start(1) and m.end(1) are both 2, and m.start(2) raises an IndexError exception.

An example that will remove remove_this from email addresses:

>>>

>>> email = "tony@tiremove_thisger.net"

>>> m = re.search("remove_this", email)

>>> email[:m.start()] + email[m.end():]

'tony@tiger.net'