ICWSM 2011 spinn3r
The dataset used in this work is the ICWSM 2011 spinn3r dataset. The documentation shows that there are large amounts of social media posts and online blogs, as well as news articles. The data size is gigantic (3TB) and we expect to use it from the cluster. The dataset was the subject of a Data Challenge and we read the only paper that participated in that challenge. We found that there is more room to improve on their analysis, as they did not tackle the popular demands, the non-English-language data, and sentiment analysis did not bring useful insights. To process the textual data, we use NLP and Information Retrieval methods to filter words and get highly frequent topics.
More informations on SPINN3R website