Code & Datasets


SIGIR 2009 Paper

The partial dataset used in the paper “A classification-based approach to question answering in discussion boards,” in SIGIR ‘09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, New York, NY, USA, 2009, pp. 171-178.


Warning: 1) This dataset is not maintained, which means that you need to clean and transform the dataset to a more usable format. 2) The question detection part data for ubuntu forum has missing. I will try my best to find it however there is no guarantee for that part.


The dataset consists of 5 files:
1) ubuntu_threads — content of threads crawled from Ubuntu Forum
2) ubuntu_answer_label — the label of which level (post) is the answer for a particular thread
3) photograph_threads — content of threads crawled from Photograph Forum
4) photograph_answer_label — the label of which level(post) is the answer for a particular thread
5) photograph_question_label — the label of whether the thread is a question thread or not (1 represents question and 2 represents non-questions)

Example format in “ubuntu_threads” and “photograph_threads”

THREAD:t-13471 (thread ID, correspond to label files)
LEVEL:1 (the position in the thread or you can think this is the ID for posts)