 |
Note: click on any link
on this page to see the corpus data, and then
click on the "BACK" image (see left) at the top of the page to come back to
this page. |
The Corpus del Español that was released in 2016 (Web-Dialects; CdE:New)
contains about two billion words of data, which is about 100 times
as much data as in the 1900s portion of the previous Corpus del
Español (History-Genres; CdE:Old) . As a result, it provides much richer data on a
wide range of phenomena. The following are just a few examples.
Lexical There are
422
verbs with a lemma frequency of between 300 and 500 in CdE:New. The
following shows how many times these same verbs appear in CdE:Old.
Of the 422 verbs in CdE:New, nearly three out of four verbs (74%) have ten tokens or less in CdE:Old, which really isn't enough to say anything about the verbs.
And only 7 / 422 (about 2%) have 50 tokens or more.
Frequency CdE:Old |
# verbs |
% verbs |
Examples |
50 tokens or more |
7 |
2% |
mascullar, petrificar, rezongar |
11-49 tokens |
101 |
24% |
guarnecer, crepitar, ahuecar |
1-10 tokens |
177 |
42% |
fardar, precintar, trasuntar |
0 tokens |
136 |
32% |
vandalizar, aperturar, erupcionar |
Semantic
Without enough tokens of a given word, it is impossible to look at collocates
("nearby words") to say much about the meaning and usage of a word. For example,
we have chosen (almost at random) a verb, noun, adjective, and adverb from
CdE:New, to show how many different collocates occur with this word (at least
three times as a lemma, between four words to the left and four words to the
right of the node word) in CdE:New and CdE:Old. (You might need to manually
reset the SEC 1 value to just the 1900s for the CdE:Old to get the correct type
count.) As we see, CdE:New provides much
better data to examine the meaning and usage of words.
lemma (PoS node:collocate) |
CdE:New |
CdE:Old |
taladrar (VERB : NOUN) |
169 |
1 |
bufanda (NOUN : NOUN) |
323 |
3 |
puñetero (ADJ : NOUN) |
296 |
1 |
intencionalmente (ADV : VERB) |
419 |
1 |
Syntactic
Because CdE:New is about 100 times as large as the 1900s portion of the CdE:Old,
it provides many more tokens for lower frequency syntactic constructions. The
following shows the number of tokens in the two corpora for a number of
different constructions. (You might need to manually reset the SEC 1 value to
just the 1900s for the CdE:Old to get the correct type count.)
CdE:New |
CdE:Old |
search string |
explanation |
example(s) |
591 |
12 |
la|las [hacer] [v*] el|la|los|las |
Accusative case for (FEM) agent in causative
construction (see
#68, 69, and #71) |
la hizo ver el verdadero sentido |
852 |
5 |
parecen que [v*3p*] |
"Split subject raising" (see
#64 and #65) |
parecen que tienen un diseño
moderno |
242 |
4 |
[anhelar] *r.[v*] se |
Clitic se does not climb with anhelar (see
#61 and #70) |
anhelaba sentirse menos exigida |
826 |
5 |
para ella|ellas [vr*] |
Lexical subject of infinitive (see
#52) |
es fácil para ella hacer esta danza |
42,887 |
207 |
[estar]
siendo [vps*] |
Progressive +
passive |
las ciudades están
siendo fragmentadas |
|