Corpus Query Language

Short link to this page: http://bit.ly/MAELT_CQL

These notes detail creating detailed searches of the British National Corpus (BNC) when using the Word Sketch Engine. The BNC uses the tag set called CLAWS. Most other corpora use Penn Tree Bank. In fact, the Sketch Engine, now has a version of the BNC tagged with Penn Tree Bank tags - this has eliminated the "anomalies" that are described at the bottom of this page.

This page originally appeared here.

Operators

The following operators can be used in all fields: lemma, phrase, word form, CQL.

Capitals

If you type Doctor in lemma, you only get Doctor.

However, doctor in lemma gives all possibilities, upper and lower case.

To have doctor only, type it into word form and choose Match Case.

[top]

Corpus Query Language

In the Word Sketch Engine, queries can be automatically generated by typing words into the lemma, phrase, left context etc., fields. Typing your own queries gives you greater control over what you search for.

Each element of a query is enclosed in square brackets: [ ] and you can type a long string of elements.

Specific search items, usually words and tags, are enclosed in quotation marks: " "

Here is an example of a query that includes many of the elements that are illustrated below. It searches the BNC for the lemma "bias" followed by either "towards" or "torward" which are followed by a noun within three words.

This is typed into the CQL field: [lemma = "bias"] [word = "towards|toward"] []{1,3}[tag= "NN."]

Click the picture to see the result of the search.

[picture missing => see original page]

[top]

A word

Create a query searching for a particular word. Use lower case.

[word = "untoward"]

To search for more than one word, use vertical bar.

[word = "amid|amidst"]

[word = "struggle|battle|fight"]

[top]

Part of speech

Create a query searching for a particular part of speech (POS). Use UPPER CASE.

A list of POS can be found here: CLAWS, Penn Tree bank

Here

[tag = " "]

Note: parts of tags can be substituted with a full stop. All verb tags, for example, start with V.

The second element is B for the verb to be, H for to have, D for to do, M for modals and V for lexical verbs.

The third element is B for base form, D for past tense, N for past participle, G for ing form, Z for third person singular.

For example:

[tag = "V.."] searches for all verbs in all forms

[tag = "VV."] searches for all lexical verbs in all forms

[tag = "VD."] searches for all forms of the verb to do

[tag = "V.N"] searches for the past participle forms of all verbs

[top]

Lemma

Create a query searching for a particular lemma ...

[lemma = "impact"]

... or particular lemmas.

[lemma = "struggle|battle|fight"]

[top]

Combining elements

"impact" is a noun and a verb. To search for the lemma with a specific POS we use ampersand.

[lemma = "impact" & tag = "V.."]

[lemma = "criterion" & tag = "NN2"] finds the noun criterion in the plural

Be careful to

    • put the search item between quotation marks

    • words in lower case

    • tags in upper case

[top]

Strings of elements

What prepositions follow impact?

[lemma = "impact"] [tag = "PRP"]

What preposition follows the noun impact?

[lemma = "impact" & tag = "N.."] [tag = "PRP"]

What prepositions follow these near synonyms?

[lemma = "struggle|battle|fight"] [tag = "PRP"]

[top]

How to allow space between elements

Words often appear between your target elements. For example, nouns are often proceeded by determiners and adjectives, phrasal and delexical verb groups often have other elements between the components, all sorts of phrases and structures permit variation.

The empty brackets allow any one word to appear inbetween.

[lemma = ""][] [lemma = "approach"]

The number between the braces {} indicates the number of words permitted inbetween. This query asks for three words between make and success.

[lemma = "make"][]{3}[lemma ="success"]

Using {1,3} gives the range - from one to three. This query asks for one or two or three words between let and down.

[lemma = "let"][]{1,3}[word ="down"]

This query asks for up to five words separating whether and or not.

[word = "whether"][]{1,5}[word ="or"][word ="not"]

This query asks for up to three words between approach and singular or plural noun followed by an infinitive with to.

[lemma ="approach"] []{1,3}[tag="NN."] [tag = "TO0"][tag = "VVI"]

[top]

How to exclude elements

The exclamation mark preceding the equals sign means does not equal. The following query will find fast as a noun, verb and adverb, but not as an adjective.

[lemma="fast" & tag != "AJ0"]

The next example finds dream followed by anything but about.

[lemma="dream"] [word !="about"]

The next examples find all forms of break followed by five words and then smile not as a verb.

[lemma = "break"] []{5} [lemma="smile" & tag !="V.."]

[top]

Searching for punctuation

As some punctuation serves as query codes, it is necessary to escape them by using the forward slash \. The first example here searches for which preceded by a comman. The second example without.

[word = "\,"][word = "which"]

[word != "\,"][word = "which"]

BNC Anomalies

One word equals two (and more)

You can search for don't as a lemma but it returns the noun form as in the do's and don'ts. To search for contractions, ..... The full list of these forms and their tags can be found at the BNC site.

Two words (and more) equal one

Many multi-word units (MWU) have been tagged as single words. For example, in case, every so often, out of touch with. This will affect the results of your search if, for example, your search for the preposition preceding case or touch was formed by looking for these key words. Try it and see!

The full list of these MWUs and their tags can be found at the BNC site.

[top]