Introduction to Corpora and
Concordancing and Data Driven Learning (DDL)
Query commands in Collins
Cobuild Corpus Concordancer
|
Simple Queries |
Type in a single word. This node word will be shown in the
middle of the screen in a max-imum of forty contexts. |
Ultimately
Nature
Absolve
Fellow
mob |
|
Adjacent words |
must be joined with + and without spaces. |
Building+block
Have+long+been
a+has+been
spoon+knife+fork |
|
Non
adjacent words |
add
the maximum number of intervening words |
Dog+4bark
Take+2for
|
|
Inflected Forms |
show the lemma using @ |
Fall@
Have@+fall@+in+love+with |
|
Trailing wildcard |
show words that start with this, using * |
Adjust*
Ander*
Im*
Un* |
|
Word sets |
display all these words in the centre using | |
Although|though
Which|that
Home|house |
|
Part-of-speech tags |
search for a word as a particular POS, using /CAPITAL LETTERS
|
House/VERB
Holiday/VERB |
|
|
NOUN
VERB
NN
NNP
JJ
RB
VB
VBN
VBG
VBD |
Stands for any noun tag
Stands for any verb tag
Common noun
noun plural
adjective
adverb
base-form verb
past participle verb
-ing
form verb
past tense verb |
|
Register |
Choice of three registers allows you some flexibility in American
and English, spoken and written. |
|
Combining |
these can all be combined to form a great range of possible
searches. |
find
passives:
be@+VBD
discontinuous phrasal verbs:
let@+2in+on
Or
this:
·
is|has+VBN
·
would|should|could+5if
·
if+5would|should|could |
| |
|
|
|
|
Lemma =
full set of inflected forms, eg fall, falls, fell, fallen, etc
·
Sample version of Collins Cobuild Corpus Concordancer at
http://titania.cobuild.collins.co.uk/form.html . This sample
version of the concordancer gives a maximum of 40 lines per search.
·
What can we say vs what do we say? (Hypothesis, evidence, conclusion.
Cognitive processing. Discovery learning.)
A basic premise of data driven learning
If we
accept grammar as FACTS, PATTERNS and CHOICES, finding multiple examples
of them can provide meaningful teaching material and assist the learner
develop a fuller view of the language item being studied by providing
multiple immediate contexts.
Note:
With only 40 line concordances in the Cobuild sampler, there is no further
need for machine intervention. This is not without its advantages. Here
are some things you might like to research using the concordancer:
How words behave (or
misbehave)
fast
as an adjective and adverb (and noun and verb)
record
as a noun and verb
base
as a noun, verb, adjective
more+adj
vs more+adv
Passive
e.g.
observing the difference between get and be passive (using |, or doing
separate searches)
Phrasal verbs
·
let+2in+on
·
wake+up+to – sometimes literal, sometimes figurative.
·
to give
up something give@+up
·
to give
something up
give@+2up
Can you
find these combinations in the concordancer?
For more
on delexical verbs:
http://www.netlanguages.com/demo/samples/level7/unit5/04_1.htm,
for example.
American vs English
·
Different
from and different than.
·
Dived or
dove? Also, incidentally, Dove/VERB vs dove/NOUN
·
Momentarily Smart Fancy Football
·
For more
on American and English English, try
http://www.americansc.org.uk/berube.htm
and
http://members.tripod.com/~Duermueller/ESL2.html, for example.
Usage
·
Can can not be written as sep words? Yes, but is it? Should there
be an apostrophe in the 1970s?
·
How differently are the words fact and facts used?
Spoken and Written
English
·
Are
moreover and whereas used in speech, or do they belong to the
written language?
·
would
have thought
– is this chunk used in written English?
·
question
tags
Gender
·
he+fall@+in+love+with
– what do men fall in love with?
·
she fall
in love with – what do women fall in love with?
·
the boss
·
he looks
vs she looks
·
my …
husband
or boyfriend etc what adjectives turn up here?
·
my … wife
or
girlfriend etc
Collocations
Which
adverbs describe smiling, running, studying?
What’s
the difference between faulty and broken and defective?
eye vs
eyes
degree vs
degrees
The
collocates in the Cobuild Sampler are automatically set at four to the
left and four to the right of the node word. The statistics that appear
with the lists of are:
·
Raw freq
often picks out the obvious collocates ("post office" "side effect") but
you have no way of distinguishing these objectively from frequent
non-collocations (like "the effect" "an effect" "effect is" "effect it"
etc).
·
MI
(Mutual Information) will highlight the technical terms, oddities,
weirdos, totally fixed phrases, etc ("post mortem" "Laurens van der Post"
"post-menopausal" "prepaid post"/"post prepaid" "post-grad").
·
T-score
will get you significant collocates which have occurred frequently ("post
office" "Washington Post" "post-war", "by post" "the post").
Note:
If a collocate appears in the top of both MI and t-score lists it is
clearly a humdinger of a collocate, rock-solid, typical, frequent,
strongly associated with its node word, recurrent, reliable, etc etc etc.
(This
information comes from the end of a more detailed description of the
statistics which you can read by clicking on the column headings on the
collocations page in Cobuild on-line).
Using the World Wide Web
as your corpus.
The
Cobuild sampler does not let us see the whole sentence, let alone the
larger context of its concordances, unlike non-sample software. When
studying discourse markers, for example, regarding, as for, furthermore
etc, a larger context is generally desirable. You can search for such
items in a normal www search and get millions of hits, which you can
reduce by including some topic words such as environment tidal energy
to get you your discourse markers within a genre. And instead of opening
on the article by clicking on the blue underlined title, click on
cache and the search words will be highlighted (or highlit? – ask
a corpus).
Some
Resources
1.
The home page of Tim Johns, of Data Driven Learning fame:
http://web.bham.ac.uk/johnstf/.
2.
Mike Scott
http://www.liv.ac.uk/~ms2928/homepage.html
co-authored Microconcord (a DOS concordancer with 2 million corpus – still
good and very fast) with TJ and then the more sophisticated Wordsmith
Tools (for Windows). This doesn't come with any corpus.
3.
Tom Cobb
http://132.208.224.131/ The
Compleat Lexical Tutor – various applications of corpus work especially
for vocabulary teaching. An article by him about using concordance
software to provide learners with a rich language learning experience can
be found at
http://pages.infinit.net/jaguar3/lounge/concord/default.htm
4.
Spaceless
http://www.spaceless.com/concord/
This concordancer takes the text of a web page and creates a list of
sentences that contain the search term.
5.
The VLC concordancer.
http://vlc.polyu.edu.hk/scripts/concordance/WWWConcapp.htm
6.
Corpora in
the Teaching of Languages and Linguistics
by Tony McEnery and Andrew Wilson This site contains the authors’ summary
of their book
http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus4/4fra1.htm
7.
An article by
Vance Stevens Concordancing with Language Learners: Why? When? What?
http://www.ruf.rice.edu/~barlow/stevens.html