This document provides all details needed to have access to the research collection reported in the paper David Losada, Fabio Crestani. “A Test Collection for Research on Depression and Language use”. In Experimental IR Meets Multilinguality, Multimodality, and Interaction 7th International Conference of the CLEF Association, CLEF 2016, Évora, Portugal, September 5-8, 2016 . Bibtex: bibtex file

Any scientific publication derived from the use of this collection should explicitly refer to this CLEF 2016 paper.

The collection is available for research purposes under proper user agreements.

Data

The collection contains textual interactions (posts or comments) from 892 users. 137 subjects have explicitly declared that they have been diagnosed with depresssion, and the remaining 755 subjects are a control group. For each subject, a (usually long) history of writings (posts or comments from a social networking site) is available. This is stored as a XML file (one per subject) with the following structure:

<INDIVIDUAL>
<ID> ... </ID>
<WRITING>
<TITLE> ...   </TITLE>
<DATE> ... </DATE>
<INFO> ... </INFO>
<TEXT> ...  </TEXT>
</WRITING>
<WRITING>
<TITLE> ... </TITLE>
<DATE> ... </DATE>
<INFO> ... </INFO>
<TEXT> ... </TEXT>
</WRITING>
....
</INDIVIDUAL>

ID: contains the anonymised id of the subject

TITLE: title of the post if available (if it is a comment then TITLE is empty)

INFO: additional info about the writing (source of the post/comment)

TEXT: body of the post or comment

In the CLEF 2016 paper cited above, some initial experiments were reported. In such experiments, an early risk detection task was performed and baseline runs are provided (for a number of basic detection strategies). To facilitate comparison against these runs, we will also provide details on the training-test splits (training set contains 486 users -83 positive+403 negative- and test set contains 406 users -54 positive+352 negative-) of the collection.

User agreement

This collection can only be used for research purposes. If you are interested in having access to this data, please fill the following user agreement and send it to david.losada@usc.es .

Our code

We will also be happy to share our (Python) code with other research teams. If you are interested in reproducing our early risk experiments please feel free to contact us.