A modification of the tool described in this paper: https://graphics.cs.wisc.edu/Papers/2014/AKVWG14/Preprint.pdf
We expect to implement it for these datasets:
- Religious Texts: https://www.kaggle.com/metron/public-files-of-religious-and-spiritual-texts
- Wikipedia Articles: https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia
- Hillary Clinton Emails: https://www.kaggle.com/kaggle/hillary-clinton-emails
- News articles: https://www.kaggle.com/snapcrack/all-the-news
We'll are using LDA and Word2Vec for topic modelling and text comparison.
Previous implementation can be found here: https://github.com/uwgraphics/SerendipSlim Or you can visit their website: http://vep.cs.wisc.edu/serendip/