el corpus del español

While the Corpus del Español (Web / Dialects) has about two billion words of data, there are much larger web-based corpora. For example, Sketch Engine has a 9.6 billion word corpus of Spanish, and Corpora from the Web (COW) has a Spanish corpus that is almost twice as large as our corpus. Why not just use these other corpora instead?

The reason why is that size is not everything. Once the corpus is created, it is annotated for part of speech and lemma (e.g. dice, dijo, and diremos are all forms of the lemma decir). While it's easy to create a large corpus from the web for any language nowadays, it's much harder to annotate it correctly and accurately. And without good annotation, the corpus is almost unusable, at least for some purposes.

To see what types of problems have resulted from the inaccurate tagging and lemmatization, take a look at the following spreadsheet.

Spanish lemmas

This spreadsheet shows words starting with s- in the Sketch Engine corpus. (Since COW uses the same tagger and since it hasn't been corrected either, its output would be essentially the same. Search for some of the "lemmas" in these lists in COW, and you'll see that the same problems are there as well.) The spreadsheets group words by lemma and part of speech (noun, verb, adjective, adverb), and it shows all lemmas that occur 20 times or more in the corpus. Potential "problem" words are highlighted in yellow.

You will notice that the lists start out well. For example, the top ten verbs are ser, saber, seguir, salir, señalar, sentir, servir, solicitar, suponer, sacar -- all verbs. So far, so good. But down around word #1000, we find the following lemmas -- one after another: satifacer, siempore, sako, simone, sómos, seguió, sperar, substituído, supply, safó, sardinada, subiamos, subway, sobrescribe, soñabamos, secion, subredondear, santalucía, scripta, scuba, selecionada, sostenian, surfea, sarpado, satisfacion, sorpendido, suguiere, semibatir, september, seva. Virtually none of these "verbs" are really lemmas. Either they are forms (or near forms) of lemmas -- but not the actual lemmas (somos, soñabamos, sugiere, substituído, subiamos, sostenian), or they are from another language (supply, subway, scuba, september), or they are just "weird" (simone, santalucía, seva).

And this is near the top of the list, where someone could have presumably corrected the first 1000 verbs or so -- had they known Spanish. Things get much stranger further down the list, e.g. around verb #3200: salienron, salomé, sangree, scarce, scrooge, sdfr, sebita, seeeeeeeeeee, separació, serásn, sexan, shay, shúper, silicone, simos, siome, ske, sommer, sorcerer, spaña, swear, self-care. None of these are verbal lemmas, and none of them have been corrected in any way.

If you're going to create word frequency data or language learning tools, you need to carefully review thousands upon thousands of words -- looking at their context, fixing lemmas and part of speech, etc. Apparently none of this was done for these larger Spanish corpora and so they are -- as we have mentioned -- almost unusable for many purposes.

With our corpus, we have reviewed each and every lemma (for the top 40,000 lemmas in the corpus), to make sure that the lemma and the part of speech are correct. It's a lot of work, and it took several months to compete. But now that it's done, we believe that we have the only large (> one billion words) and reliable corpus of Spanish.