Skip to content
Show report in:

UMINF 16.06

Syntactic methods for topic-independent authorship attribution

The efficacy of syntactic features for topic-independent authorship attribution is evaluated, taking a feature set of frequencies of words and punctuation marks as baseline. The features are 'deep' in the sense that they are derived by parsing the subject texts, in contrast to 'shallow' syntactic features for which a part-of-speech analysis is enough. The experiments are conducted on a corpus of novels written around the year 1900 by 20 different authors, and cover two tasks. In the first task, text samples are taken from books by one author, and the goal is to pair samples from the same book. In the second task, text samples are taken from several authors, but only one sample from each book, and the goal is to pair samples from the same author. In the first task, the baseline feature set outperformed the syntax-based feature set, but for the second task, the outcome was the opposite. This suggests that, compared to lexical features such as vocabulary and punctuation, syntactic features are more robust to changes in topic.


Authorship attribution, syntactic features


Johanna Björklund and Niklas Zechner

Back Edit this report
Entry responsible: Johanna Bjorklund

Page Responsible: Frank Drewes