el corpus del español

el corpus del español


Note: click on any link on this page to see the corpus data, and then click on the "BACK" image (see left) at the top of the page to come back to this page.

The Corpus del Español that was released in 2016 (Web-Dialects; CdE:New) contains about two billion words of data, which is about 100 times as much data as in the 1900s portion of the previous Corpus del Español (History-Genres; CdE:Old) . As a result, it provides much richer data on a wide range of phenomena. The following are just a few examples.

Lexical

There are 422 verbs with a lemma frequency of between 300 and 500 in CdE:New. The following shows how many times these same verbs appear in CdE:Old. Of the 422 verbs in CdE:New, nearly three out of four verbs (74%) have ten tokens or less in CdE:Old, which really isn't enough to say anything about the verbs. And only 7 / 422 (about 2%) have 50 tokens or more.
 

Frequency CdE:Old # verbs % verbs Examples
50 tokens or more 7 2% mascullar, petrificar, rezongar
11-49 tokens 101 24% guarnecer, crepitar, ahuecar
1-10 tokens 177 42% fardar, precintar, trasuntar
0 tokens 136 32% vandalizar, aperturar, erupcionar

Semantic

Without enough tokens of a given word, it is impossible to look at collocates ("nearby words") to say much about the meaning and usage of a word. For example, we have chosen (almost at random) a verb, noun, adjective, and adverb from CdE:New, to show how many different collocates occur with this word (at least three times as a lemma, between four words to the left and four words to the right of the node word) in CdE:New and CdE:Old. (You might need to manually reset the SEC 1 value to just the 1900s for the CdE:Old to get the correct type count.) As we see, CdE:New provides much better data to examine the meaning and usage of words.
 

lemma (PoS node:collocate) CdE:New CdE:Old
taladrar (VERB : NOUN) 169 1
bufanda (NOUN : NOUN) 323 3
puñetero (ADJ : NOUN) 296 1
intencionalmente (ADV : VERB) 419 1

Syntactic

Because CdE:New is about 100 times as large as the 1900s portion of the CdE:Old, it provides many more tokens for lower frequency syntactic constructions. The following shows the number of tokens in the two corpora for a number of different constructions. (You might need to manually reset the SEC 1 value to just the 1900s for the CdE:Old to get the correct type count.)
 

CdE:New CdE:Old search string explanation example(s)
591 12 la|las [hacer] [v*] el|la|los|las Accusative case for (FEM) agent in causative construction (see #68, 69, and #71) la hizo ver el verdadero sentido
852 5 parecen que [v*3p*] "Split subject raising" (see #64 and #65) parecen que tienen un diseño moderno
242 4 [anhelar] *r.[v*] se Clitic se does not climb with anhelar (see #61 and #70) anhelaba sentirse menos exigida
826 5 para ella|ellas [vr*] Lexical subject of infinitive (see #52) es fácil para ella hacer esta danza
42,887 207 [estar] siendo [vps*] Progressive + passive las ciudades están siendo fragmentadas