other uses of the data
I've often thought that this sort of collation of data could be very useful for language learners.
There are plenty of basic things that scanning corpora like this can turn up. You can have some basic stuff like collocations that existing in the target language (eg, "take" and "bath" form a collocation in English) and distinguish that sort of association from more conceptual linkages. For example, when "president" appears, you're likely to see more vocabulary related to countries, laws, government, debates and so on, as well as particular current events or issues. More or less what the article says about "spaghetti" appearing more often with "food" than "shoe".
Besides being able to group new vocabulary and presenting related words to be learned together, in context, a computer-aided learning tool could use the data in a lot more ways, eg:
- grade vocab (and reading material) by frequency, to avoid overloading the learner
- build up a profile of what a person knows (and how well), including both vocab/grammar patterns and general knowledge (eg, "Trump" is a "president")
- automatically generate all sorts of review/comprehension questions based on the material
- be a lot more user-directed, letting them follow up areas or reading material that they're more interested in
- maybe even generate synthetic reading/teaching/testing material using events/grammar/vocab/common knowledge that exists within the corpus (eg, simple sentences stories or scenarios)
Maybe it's too much to expect a machine learning system to do all of this unsupervised, but still, you could have it at least generate different kinds of material and use crowd-sourcing to weed out errors or re-train the thing. Lots of ways to have a hybrid human/computer system.
The other big use that I've often thought about is automatic classification of documents. I've got tons of PDF files downloaded from the net, but no actual filing system for them. One simple way of clustering similar documents together is to do a frequency analysis of the words in the document and then to get rid of all the most common words from the language (like "it", "for", "and", "the", etc.). The remaining top ten words, say, should help to give a very good idea about the topic of that document. Basic statistical clustering like this should help a lot to find relevant/related documents on a given topic, but there seems to be so much more that could be done with AI/machine learning techniques.