When we collected the two million web pages, we relied on Google's identification of the country for the web page. This was more difficult when, for example, it was a .COM site (e.g. www.felicidad.com). One might wonder how Google knew what country this was from.

To test how well Google did, we looked up a number of words and constructions in John Lipski's Latin American Spanish (supplemented from other resources on the Web), where a particular word or construction was supposedly more common in a given country or region. The fact that the following words and phrases do appear much more frequently in that country suggests that Google's categorization is quite good.

Lexical

Caribean
Puerto Rico ay bendito, chavos, chiringa, mahones, habichuela (+DR), zafacón (+DR) Cuba guajiro, jimaguas, babalao, bitongo, pedir botella Rep Dom mangú, fucú, tutumpote, mangulina, mofongo (+PR)

México and Central America
México ándale, híjole, órale, güero, (muy) padre, chamaco (CAm/Car), pinche (NOUN), popote, charola Guatemala huipil, canche, muchá, patojo, chafa (+HN), chirmol, canche El Salvador cipote, chero, pupusa, cuilio, bayunco, piscucha Honduras catracho, papada Nicaragua chavalo, maje (+CAm), pinol, pinolillo, chigüín, vigorón, gallo pinto (+CR), idiay (+CR) Pánama fulo, chombo, guandul Costa Rica chinear, guila, chunche

South America
Colombia cachaco, cachifo, verraquera, estar mamado, guandoca, biche Venezuela bojote, coroto, catire, gafo, macundales, arepa, cachapa, cambur, caraotas, jojoto Ecuador chumar, chulla, montuvio, omoto Perú anticucho, jebe, chupe, pisco, jora, chompa (+CL/EC), choclo (+CL/EC) Bolivia opa, colla, chuño, lagua Chile pololo*, pololear, achuntar, bencina, bacán, fome, huaso Paraguay ñembo, ñanduti, karai, yopará, mitai Uruguay tropero, hacer * sota, con fritas Argentina pibe, fiaca, morfar, falopa, sobre el pucho, falluto, cafishio

España ordenador, aparcar, enfadar, gafas, zumo, chulo, guay, coger, bolígrafo, patata, melocotón, echar de menos, vale

Note that oftentimes, the corpus shows that a word or phrase is more common in an entire region, rather than just one specific country. For example, the following words are more frequent in Central America: chele, guaro, estar bolo, chimar, chingo, chompipe, tiste, molote, chichipate, barrilete, pisto (+HN/SV) and the following are more frequent in Argentina and Uruguay: che !, laburo, lunfardo.

Syntactic and morphological

Of course the corpus can be used to look at syntactic and morphological differences between dialects as well. The following are just a few examples (with a short sample, and the country or zone in which it is most common):

qué tú VERB (¿qué tú quieres?): Carib
PREP SUBJ VERB (para ella entender): Carib
más nada .|, : Carib
ART POSS NOUN (una mi amiga): GT
mero VERB: GT
te [v*2s*] tu NOUN (te rompiste tu pierna): MX
vos sos (voseo): Cono Sur, CAm
teneís (vosotros): ES
la|las GUSTAR (laísmo; la gusta el chocolate): ES
qué tan ADJ (¿qué tan importante es eso?): not ES
cuanto más VERB (ES) / por más que VERB / entre más VERB / mientras más VERB