Turnitin
How is this different than Turnitin? Other than it is free...
A group of Adelaide researchers has released an open-source tool that helps identify document authorship by comparing texts. While their own test cases – and therefore the headlines – concentrated on identifying the authors of historical documents, it seems to The Register that any number of modern uses of such a tool might …
Isn't the point that Turnitin identifies when texts are the same (it identifies when text A is pretty much the same as text B, probably because it's been ripped off) but that this software identifies the author of two different texts? In other words, it'll tell you if the same person (probably) wrote Twelfth Night and As You Like It, but not whether As You Like It by T Mangrove is the same as As You Like It by W Shakespeare.
Does anyone have any experience of working with Turnitin etc? Does it work?
In my experience, Turnitin is a steaming pile of horse manure that spews false positives.
Apart from describing my work as plagiarism for using the same page numbering template in the header of a word document as a student from Manchester University it enjoys highlighting my use of three common words together as a form of intellectual theft.
Anyway, I believe they are different technologies. Turnitin checks for roughly the same content, structure, sentences etc. to establish originality whereas this project assesses the style of a text to see if it matches samples of an author's known texts to establish authorship.
I have yet to be really convinced of the use of Turnitin. For most of the work it is asked to do in universities, there is always going to be a high level of correlation, because the answer you want, and the references to support it, are going to be the same as thousands of other answers on the same topic. I have found it useful three times for spotting essays that are (literally) nothing but stitched-together collections of copy-and-pasted sections from available web-sources (but I'm well-enough acquainted with my subject to have recognised many of them anyway), and for giving evidence to students that they *must* reference what they have used from elsewhere (though Turnitin doesn't recognise when something has been referenced, so both a properly and improperly reference piece of work will get the same similarity score).
Where it might be useful is for more freehand topics, such as dissertations, but I haven't done that yet.
really great writers are masters at generating their own style ... especially in English, where someone who bothered to study it (rare, I admit) can choose from *at least* 3 sources to achieve an effect (celtic, saxon, latin, french). It can be great fun to rewrite prose, changing Latin words to Saxon, or Saxon to French.
Gonna b using; grammar, and punctuation also... not just vocabulary. innit?
Even so, I agree that it's almost certainly utterly trivial to defeat, should one feel inclined. That doesn't necessarily make it completely useless. The kiddies would probably find it easier to do their own homework than carefully transcribe a friend's. I can also envisage plenty of filtering applications where the author wouldn't be be attempting to conceal their identity.
"Generating their own style" is precisely what improves the accuracy of the classification models this sort of approach uses - or would be, if that characterization wasn't naive to the point of fallacy.[1]
The point of this sort of work, which is by no means new,[2] is to build a classifier for traits thought to be relatively invariant for a given writer. This gives you a feature vector which, to the extent the model is sensitive and accurate,[3] uniquely identifies an author with a given probability.
It's basically what I Write Like does.
[1] Prose styles are epiphenomena heavily determined by culture, education, and personal experience; even writers who have expressly set out to create "new" styles (more common in poetry than in prose) can and have been shown to be substantially influenced by identifiable sources and full of intertextual references. This is true even of the most strikingly novel styles, as of the high modernists. Take a look at a scholarly edition of Joyce's Ulysses, for example. Really, I understand the widespread resistance to poststructuralist theory among the middlebrow, but has even structuralism failed to penetrate? I suppose it has.
[2] The heuristic identification of authorship by textual evidence is one of the oldest areas of textual studies. It was widely researched and practiced back in the days of the philologists around the turn of the twentieth century, and variations on it go back at least as far as the scholastics in the European tradition. Applying IT to the problem is also a well-established field.
[3] More precisely, the ability of the model to identify authorship with probability P is a function of the model's "recall" (essentially the reciprocal of the false-negative rate) and "precision" (the reciprocal of the false-positive rate). These are typically condensed into a single measure such as f_0, which weighs recall and precision equally.
Paul can write a letter and, because he's a popular guy, someone may translate it for another audience. Fast forward a couple thousand years and the original may no longer be around, only the translated copy. That doesn't mean he's not responsible for the translated version.
I suspect that this would indeed pose a threat to anonymity on the Internet - it depends a bit how much data the tool needs to arrive at a sufficiently acceptable probability (having said that, if the TSA's use of probability is any measure, 1% is probably enough).
However, the real fun comes from defeating such analysis. I have no idea how that would be done, but it strikes me as an interesting exercise. Not for any nefarious reason (although it's easy to dream up some), just because :).
the technique which will preserve your anonymity and allow you to preserve all your sock puppets (at least for the time being) is to create your draft in your native language, mince it through one or more translators and then back into your native language. Correct the errors. Post. That's how I did the other posts on this page without anyone spotting me. Oops.
On a more semi serious note, has anyone got around to running Shakespeare's texts through this software to see if Christopher Marlow (or any other contenders) show up as suspects?
has anyone got around to running Shakespeare's texts through this software to see if Christopher Marlow (or any other contenders) show up as suspects?
Your local research library should have shelves full of books of textual scholarship on Shakespeare. Using this new software - which may not even employ any novel approaches; I haven't read the original article - would be a drop in the bucket.
In any case, contributing to the debate on the authorship of Shakespearean apocrypha like Arden of Faversham would probably be more interesting, though with most of those works (as with many attributed to Shakespeare) the canonical versions are probably the work of multiple people.