The corpus is composed of
2.0 billion
words in 20 million web pages from 21 Spanish-speaking
countries. The web pages were collected in late 2015, using the following
process:
1. The list of web pages was
created by running hundreds of high frequency n-grams from
the Corpus del Espaņol (e.g. de lo que, y no es) against
Google to generate essentially "random" web pages
(e.g. presumably there would be no AdSense entries or meaningful
page rankings for phrases like de lo que).
2. We repeated this process for each of
the 21 different countries (e.g. Mexico, Colombia, Peru,
Spain), and limited the country by using Google "Advanced Search"
[Region] function. The question, of course, is how well Google knows
which country a page comes from, if it isn't marked by a top-level
domain (e.g. CR for Costa Rica).
As Google explains,
"we'll rely largely on the site's
1
country domain (.ca, .de, etc.). If an international domain (.com,
.org, .eu, etc) has been used, we'll rely on several signals,
including 2 IP address,
3
location information on the page,
4
links to the page, and 5 any relevant
information from Google Places."
For example, for a .com address
(where no top-level domain is listed), it will try to use the IP address
(which shows where the computer is physically located). But even if that fails, Google
could still see that 95% of the visitors to the site
come from Costa Rica, and that 95% of the links to that page are from
Costa Rica (and remember that Google knows both of these things),
and it would then guess that the site is probably from Costa Rica. It isn't perfect, but it's
very, very good, as is shown in the results from the
dialect-oriented searches.
3. In addition, besides doing 21
different sets of searches (for each of the 20 countries) with "General"
Google searches (all web pages), we also repeated this with Google "Blog"
searches (using the Advanced / Region searches in both cases). The
blog searches are obviously just blogs, and the "General" searches
also included some blogs as well.
4. We then downloaded all of the two
million unique web pages using
HTTrack.
5. After this, we ran all of the two
million web pages through
JusText
to remove boilerplate material (e.g. headers, footers, sidebars).
(Thanks,
Michael Bean, for helping to set this up)
6. Finally, we used n-gram matching to
eliminate the remaining duplicate texts. Even more difficult was the
removal of duplicate "snippets" of text on multiple web pages (e.g.
legal notices or information on the creator of a blog or a newspaper
columnist), which JusText didn't eliminate. While there are
undoubtedly still some duplicate texts,
we are continuing to work on this.
While the
categorization by country is very good overall, there is one
exception: the texts from the United States. The problem is that
when Google didn't know what country a text (or domain) was from,
they then categorized it as the United States (as kind of a
"default"). So most of the texts (and domains) that are supposedly
from the United States are probably from another country.
|