Data

The collection contains textual interactions (posts or comments) from 892 users. 137 subjects have explicitly declared that they have been diagnosed with depresssion, and the remaining 755 subjects are a control group. For each subject, a (usually long) history of writings (posts or comments from a social networking site) is available. This is stored as a XML file (one per subject) with the following structure:

<INDIVIDUAL>
<ID> ... </ID>
<WRITING>
<TITLE> ...   </TITLE>
<DATE> ... </DATE>
<INFO> ... </INFO>
<TEXT> ...  </TEXT>
</WRITING>
<WRITING>
<TITLE> ... </TITLE>
<DATE> ... </DATE>
<INFO> ... </INFO>
<TEXT> ... </TEXT>
</WRITING>
....
</INDIVIDUAL>

ID: contains the anonymised id of the subject

TITLE: title of the post if available (if it is a comment then TITLE is empty)

INFO: additional info about the writing (source of the post/comment)

TEXT: body of the post or comment

In the CLEF 2016 paper cited above, some initial experiments were reported. In such experiments, an early risk detection task was performed and baseline runs are provided (for a number of basic detection strategies). To facilitate comparison against these runs, we will also provide details on the training-test splits (training set contains 486 users -83 positive+403 negative- and test set contains 406 users -54 positive+352 negative-) of the collection.

User agreement

This collection can only be used for research purposes. If you are interested in having access to this data, please fill the following user agreement and send it to david.losada@usc.es .

Our code

We will also be happy to share our (Python) code with other research teams. If you are interested in reproducing our early risk experiments please feel free to contact us.

A Test Collection for Research on Depression and Language use

Data

User agreement

Our code