el corpus del espaņol

el corpus del espaņol


There are three sets of large corpora of Spanish that have been released in the last five years or so. The following are comparisons of these corpora with the Corpus del Espaņol: Web / Dialects.
 

CORPES (Real Academia Espaņola) GOOD: The textual corpus for CORPES seems to be quite good, including some nice text categorization. There is more fiction than in either the CdE (BYU) or the very large corpora. The corpus has also been tagged and lemmatized quite well.

NOT SO GOOD: The corpus is only one tenth the size of the CdE. Perhaps most seriously, it uses a fairly rudimentary web interface, which really limits what can be done with concordances, collocates, and frequency lists. In other words, the good textual data is "trapped" behind a poor interface, and is inaccessible to end users.

See full comparison
Very large corpora like Sketch Engine and Corpora from the Web (COW) GOOD: Size, size, and size. The web interfaces are also quite nice, especially the collocates-based "word sketches" (and comparisons between word sketches) in Sketch Engine. The CQP syntax also allows for very powerful searches.

NOT SO GOOD: The corpora were apparently created by people who didn't know any Spanish. The part of speech tagging and especially the lemmatization (e.g. assigning word forms to "dictionary form", e.g. dice, dijo, diremos = decir) is very bad, making the corpus almost unusable for some purposes.

See full comparison
 

See also previous comparisons have been published for using the older (historical / genres) Corpus del Espaņol:

Finally, compare the new two billion word (web-dialects) corpus to the original (historical-genre) corpus, in terms of size and the data that is available from the corpora.