There are three sets of large
corpora of Spanish that have been released in the last five years or
so. The following are comparisons of these corpora with the
Corpus
del Espaņol: Web / Dialects.
CORPES (Real Academia Espaņola) |
GOOD: The textual
corpus for CORPES seems to be quite good, including some
nice text categorization. There is more fiction than in
either the CdE (BYU) or the very large corpora. The corpus
has also been tagged and lemmatized quite well. NOT SO
GOOD: The corpus is only one tenth the size of the CdE.
Perhaps most seriously, it uses a fairly rudimentary web
interface, which really limits what can be done with
concordances, collocates, and frequency lists. In other
words, the good textual data is "trapped" behind a
poor interface, and is inaccessible to end users. |
See full comparison |
Very large corpora
like
Sketch Engine and
Corpora from the Web (COW) |
GOOD: Size, size,
and size. The web interfaces are also quite nice, especially
the collocates-based "word sketches" (and comparisons
between word sketches) in Sketch Engine. The CQP syntax also
allows for very powerful searches. NOT SO GOOD: The
corpora were apparently created by people who didn't know
any Spanish. The part of speech tagging and especially the
lemmatization (e.g. assigning word forms to "dictionary
form", e.g. dice, dijo, diremos = decir) is
very bad, making the corpus almost unusable for some
purposes. |
See full comparison |
See also previous comparisons have been published for using the older (historical
/ genres) Corpus del Espaņol:
Finally, compare the new two billion word (web-dialects) corpus
to the original (historical-genre) corpus, in terms of
size and the data that is available from the
corpora.
|
|