This document provides all details needed to reproduce the experiments reported in the paper D. Losada, J. Parapar, A. Barreiro. “When to Stop Making Relevance Judgments? A Study of Stopping Methods for Building Information Retrieval Test Collections”. Journal of the Association for Information Science and Technology, 2018 (to appear).

Any scientific publication derived from the use of this software should explicitly refer to this publication.

Next, we explain the data used for experimentation and provide our R code, which implement all stopping methods.

Data

We used the following TREC collections (http://trec.nist.gov): four adhoc (TREC5, TREC6, TREC7, TREC8) and three Clinical Decision Support Track (CT14, CT15, CT16).

NIST kindly provided the runs that contributed to the pools of these tasks (http://trec.nist.gov/data/intro_eng.html).

The pooled runs are archived by NIST within a password protected area. If you want to reproduce our experiments you need to request access to the protected area (follow the instructions given at http://trec.nist.gov/results.html).

TREC5
- 101 runs in the pool (77 adhoc + 24 other).
  
  The 77 adhoc runs are: input.anu5aut1 input.anu5aut2 input.anu5man4 input.anu5man6 input.brkly15 input.brkly16 input.brkly17 input.brkly18 input.city96a1 input.city96a2 input.CLCLUS input.CLTHES input.colm1 input.colm4 input.Cor5A1se input.Cor5A2cr input.Cor5M1le input.Cor5M2rf input.Ctifr1 input.Ctifr2 input.DCU961 input.DCU962 input.DCU963 input.DCU964 input.DCU969 input.DCU96C input.DCU96D input.erliA1 input.ETHal1 input.ETHas1 input.ETHme1 input.fsclt3 input.fsclt4 input.genrl1 input.genrl2 input.genrl3 input.genrl4 input.glair4 input.gmu96au1 input.gmu96au2 input.gmu96ma1 input.gmu96ma2 input.ibmgd1 input.ibmgd2 input.ibmge1 input.ibmge2 input.ibms96a input.ibms96b input.INQ301 input.INQ302 input.KUSG2 input.KUSG3 input.LNaDesc1 input.LNaDesc2 input.LNmFull1 input.LNmFull2 input.mds001 input.mds002 input.mds003 input.Mercure-al input.Mercure-as input.MONASH input.pircsAAL input.pircsAAS input.pircsAM1 input.pircsAM2 input.sdmix1 input.sdmix2 input.umcpa1 input.uncis1 input.uncis2 input.UniNE7 input.UniNE8 input.uwgcx0 input.uwgcx1 input.vtwnA1 input.vtwnB1
  
  The other 24 runs are: input.anu5mrg0 input.anu5mrg1 input.anu5mrg7 input.CLATMC input.CLATMN input.CLPHR0 input.CLPHR1 input.CLPHR2 input.fsclt3m input.genlp1 input.genlp2 input.genlp3 input.genlp4 input.MTRa961 input.sbase1 input.sbase2 input.UniNE0 input.UniNE9 input.xerox_nlp1 input.xerox_nlp2 input.xerox_nlp3 input.xerox_nlp4 input.xerox_nlp5 input.xerox_nlp6
- 50 queries: TREC topics #251-#300 (available at http://trec.nist.gov/data/topics_eng/topics.251-300.gz). But the text of the queries is not needed for the pooling experiments. You just need the relevance judgments, qrels, which are available at http://trec.nist.gov/data/qrels_eng/qrels.251-300.parts1-5.tar.gz.
TREC6
- 46 runs in the pool (31 adhoc + 15 other).
  
  The 31 adhoc runs are: input.aiatB1 input.anu6min1 input.att97ac input.Brkly23 input.city6al input.CLREL input.Cor6A3cll input.csiro97a1 input.DCU97lnt input.fsclt6 input.gerua1 input.glair64 input.gmu97au1 input.harris1 input.ibmg97b input.ibms97a input.INQ401 input.ispa1 input.iss97man input.jalbse0 input.jhuapln input.LNaShort input.mds601 input.Mercure2 input.nmsu2 input.nsasg1 input.pirc7Aa input.umcpa197 input.unc6ma input.uwmt6a0 input.VrtyAH6a
  
  The other 15 runs are: input.Cor6HP1 input.Cor6HP2 input.Cor6HP3 input.DCU97HP input.genlp1 input.Gla6DS1 input.otc1 input.otc2 input.otc3 input.pirc7Ha input.pirc7Hd input.pirc7Ht input.uwmt6h0 input.uwmt6h1 input.uwmt6h2
- 50 queries: TREC topics #301-#350 (available at http://trec.nist.gov/data/topics_eng/topics.301-350.gz). But the text of the queries is not needed for the pooling experiments. You just need the relevance judgments, qrels, which are available at http://trec.nist.gov/data/qrels_eng/qrels.trec6.adhoc.parts1-5.tar.gz.
TREC7
- 84 runs in the pool (77 adhoc + 7 other).
  
  The 77 adhoc runs are: input.acsys7al input.acsys7mi input.AntHoc01 input.APL985LC input.APL985SC input.att98atdc input.att98atde input.bbn1 input.Brkly25 input.Brkly26 input.CLARIT98CLUS input.CLARIT98COMB input.Cor7A1clt input.Cor7A3rrf input.dsir07a01 input.dsir07a02 input.ETHAC0 input.ETHAR0 input.FLab7ad input.FLab7at input.fsclt7a input.fsclt7m input.fub98a input.fub98b input.gersh1 input.gersh2 input.harris1 input.ibmg98a input.ibmg98b input.ibms98a input.ibms98b input.ic98san3 input.ic98san4 input.iit98au1 input.iit98ma1 input.INQ501 input.INQ502 input.iowacuhk1 input.iowacuhk2 input.jalbse011 input.jalbse012 input.KD70000 input.KD71010s input.kslsV1 input.lanl981 input.LIArel2 input.LIAshort2 input.LNaTitDesc7 input.LNmanual7 input.mds98t input.mds98td input.MerAdRbtnd input.MerTetAdtnd input.nectitech input.nectitechdes input.nsasgrp3 input.nsasgrp4 input.nthu1 input.nthu2 input.nttdata7Al0 input.nttdata7Al2 input.ok7am input.ok7ax input.pirc8Aa2 input.pirc8Ad input.ScaiTrec7 input.t7miti1 input.tno7exp1 input.tno7tw4 input.umd98a1 input.umd98a2 input.unc7aal1 input.unc7aal2 input.uoftimgr input.uoftimgu input.uwmt7a1 input.uwmt7a2
  
  The other 7 runs are: input.acsys7hp input.Cor7HP1 input.Cor7HP2 input.Cor7HP3 input.pirc8Ha input.uwmt7h1 input.uwmt7h2
- 50 queries: TREC topics #351-#400 (available at http://trec.nist.gov/data/topics_eng/topics.351-400.gz). But the text of the queries is not needed for the pooling experiments. You just need the relevance judgments, qrels, which are available at http://trec.nist.gov/data/qrels_eng/qrels.trec7.adhoc.parts1-5.tar.gz.
TREC8
- 71 runs in the pool (all adhoc).
  
  The runs are: input.1 input.8manexT3D1N0 input.acsys8alo input.acsys8amn input.AntHoc1 input.apl8c221 input.apl8n input.att99atdc input.att99atde input.cirtrc82 input.CL99SD input.CL99XT input.disco1 input.Dm8Nbn input.Dm8TFbn input.Flab8as input.Flab8atdn input.fub99a input.fub99tf input.GE8ATDN1 input.ibmg99a input.ibmg99b input.ibms99a input.ibms99b input.ic99dafb input.iit99au1 input.iit99ma1 input.INQ603 input.INQ604 input.isa25 input.isa50 input.kdd8ps16 input.kdd8qe01 input.kuadhoc input.mds08a3 input.mds08a4 input.Mer8Adtd1 input.Mer8Adtd2 input.MITSLStd input.MITSLStdn input.nttd8ale input.nttd8alx input.ok8alx input.ok8amxc input.orcl99man input.pir9Aatd input.pir9Attd input.plt8ah1 input.plt8ah2 input.READWARE input.READWARE2 input.ric8dpx input.ric8tpx input.Sab8A1 input.Sab8A2 input.Scai8Adhoc input.surfahi1 input.surfahi2 input.tno8d3 input.tno8d4 input.umd99a1 input.unc8al32 input.unc8al42 input.UniNET8Lg input.UniNET8St input.UT810 input.UT813 input.uwmt8a1 input.uwmt8a2 input.weaver1 input.weaver2
- 50 queries: TREC topics #401-#450 (available at http://trec.nist.gov/data/topics_eng/topics.401-450.gz). But the text of the queries is not needed for the pooling experiments. You just need the relevance judgments, qrels, which are available at http://trec.nist.gov/data/qrels_eng/qrels.trec8.adhoc.parts1-5.tar.gz.
CT14
- 102 runs in the pool.
  
  The runs are: input.atigeo1 input.atigeo2 input.atigeo3 input.atigeo4 input.atigeo5 input.bacon input.BiTeMSIBtex1 input.BiTeMSIBtex2 input.BiTeMSIBtex3 input.BiTeMSIBtex4 input.BiTeMSIBtex5 input.BM25 input.BM25EXP input.bolgogi input.DAIICTdqep input.DAIICTdqer8 input.DAIICTf input.DAIICTsqer8 input.DAIICTzf input.descript50ex input.ecnuBig input.ecnuIndex input.ecnuSmall input.ecnuWeight input.GuHNegProxL input.GuHSINeg input.GuHSINegL input.GuHSNegProxH input.GuHSNegProxL input.hltcoe5drf input.hltcoe5s input.hltcoe5srf input.hltcoewsrf input.icd input.icdqe input.InL2c1 input.InL2c1EXP input.IRGURUN1 input.IRGURUN2 input.IRGURUN3 input.IRGURUN4 input.IRGURUN5 input.KISTI01 input.KISTI02 input.KISTI03 input.KISTI04 input.KISTI05 input.manual input.MERCK1 input.MERCK2 input.MERCK3 input.mesh input.MIIjmab input.MIIjmboost input.MIIjmignore input.MIItfauto input.MIItfman input.myrun input.NOVASEARCH1 input.NOVASEARCH2 input.NOVASEARCH3 input.NOVASEARCH4 input.NOVASEARCH5 input.ohsuAbstDef input.ohsuBodyDef input.ohsuOrigAbst input.ohsuOrigBody input.origexp input.prise1 input.prna1 input.Run1BoWC input.Run2MeSHDi input.Run3MeSHDiCa input.Run4BoWDiCa input.Run5BoWDiCaS input.runSystem2 input.samgyupsal input.SNUMedinfo1 input.SNUMedinfo2 input.SNUMedinfo3 input.SNUMedinfo4 input.SNUMedinfo6 input.summary50ex input.tudorComb1 input.tudorComb2 input.tudorComb3 input.tudorComb4 input.tudorCombm input.TUW1 input.TUW2 input.TUW3 input.TUW4 input.TUW5 input.UDInfoCDS1 input.UDInfoCDS2 input.UDInfoCDS3 input.UDInfoCDS4 input.UDInfoCDS5 input.UTD0BL input.UTD1QE input.UTD2LDA input.UTD3W2VE
- 30 queries: #1-#30 (available at http://www.trec-cds.org/2014.html#topics). But the text of the queries is not needed for the pooling experiments. You just need the relevance judgments, qrels, which are available at http://trec.nist.gov/data/clinical/qrels-treceval-2014.txt.
CT15
- 102 runs in the pool.
  
  The runs are: input.artificial input.auto input.BtBase1 input.BtBase3 input.BtCleanAll4 input.CAMspud1 input.CAMspud3 input.CAMspud5 input.cbnu0 input.cbnu1 input.cbnu2 input.DAIICTrun1 input.DAIICTrun2 input.DuthMmMt16f input.DuthMmMt16s input.DuthStef input.ecnu1 input.ecnu2 input.ECNUPB input.EMSEasmer input.EMSElsi input.EMSErm3 input.EPBRNBM25R2 input.EPBRNRSVMR3 input.EPBRNTFIDFR1 input.FDUAuto1 input.FDUAuto2 input.FDUManual input.FORTHICSd0 input.FORTHICSd2 input.FORTHICSs0 input.FrameAFinal input.FusionAuto input.FusionMAll input.FusionManual input.GRIUMenRun1 input.GRIUMenRun2 input.HipocratAr1 input.HipocratAr2 input.HipocratAr3 input.hltcoe4srf input.hltcoe5srf input.hltcoewsrf input.hybrid input.KISTI001 input.KISTI002 input.KISTI003 input.lamdarun01 input.lamdarun02 input.lamdarun03 input.LIMSIrun1BoW input.LIMSIrun2MSH input.LIMSIrun3SmF input.manual input.NOVASEARCH1 input.NOVASEARCH2 input.NOVASEARCH3 input.nuuuuncDFML input.nuuuuncHAKT input.nuuuuncMDRUN input.PL2c10 input.PL2c28 input.PL2c6 input.PPR input.prna1 input.prna2 input.prna3 input.QFB input.run1 input.Run1DBpSimp input.run2 input.Run2DBpComb input.Run4HLM input.runindri input.runindriML input.SCIAILuceneA input.SH1 input.SIBTEX2CITIN input.SIBTEX3CTOUT input.SIBTEX5COMBO input.SNUMedinfo1 input.SNUMedinfo2 input.SNUMedinfo3 input.TUW1 input.TUW2 input.TUW3 input.udelArun1 input.udelArun2 input.udelArun3 input.UMLS input.utdhltrikcv input.utdhltril2r input.utdhltripar input.UWCPL2 input.UWCSolrBM25 input.UWCSolrTerr input.UWMUO1 input.UWMUO2 input.UWMUO3 input.wsuirdaa input.wsuirdma input.wsuirsaa
- 30 queries: #1-#30 (available at http://www.trec-cds.org/2015.html#topics). But the text of the queries is not needed for the pooling experiments. You just need the relevance judgments, qrels, which are available at http://trec.nist.gov/data/clinical/qrels-treceval-2015.txt.
CT16
- 115 runs in the pool.
  
  The runs are: input.AutoDes input.AutoNote input.AutoSummary input.AutoSummary1 input.cbnun1 input.cbnus1 input.cbnus2 input.CCNUDESR2 input.CCNUNOTER1 input.CCNUNOTER2 input.CCNUNOTER3 input.CCNUSUMR1 input.CSIROdSum input.CSIROmeta input.CSIROmnul input.CSIROnote input.CSIROsumm input.d2vCombIrit input.d2vDescIrit input.dacmmf input.DAdescTM input.DAnote input.DAnoteRoc input.DAnoteTM input.DAsummTM input.DDPHBo1CM input.DDPHBo1MWRe input.descUIOWAS2 input.DUTHaaRPF input.DUTHmaRPF input.DUTHsaRPF input.ECNUmanual input.ECNUrun1 input.ECNUrun3 input.ECNUrun4 input.ECNUrun5 input.ETHDescRR input.ETHNote input.ETHNoteRR input.ETHSumm input.ETHSummRR input.lsbn input.lssbd input.lssbn input.lssbs input.LucBase input.LucNote input.LucNoteFrame input.LucWeight input.LucWghtFrame input.ManualRun input.mayoad input.mayoan input.mayoas input.mayomd input.mayomn input.MRKPrfNote input.MRKSumCln input.MRKUmlsSolr input.MrkUmlsXgb input.nacmmf input.NDPHBo1C input.NDPHBo1CM input.nkuRun1 input.nkuRun2 input.nkuRun3 input.nkuRun4 input.nkuRun5 input.NLMrun1 input.NLMrun2 input.NLMrun3 input.NLMrun4 input.NLMrun5 input.NoteES input.noteUIOWAS1 input.prna1sum input.prna2desc input.prna3note input.prna4note input.prna5note input.RONE input.run1 input.run2 input.run3 input.run4 input.run5 input.sacmmf input.SDPHBo1NE input.SumClsRerank input.SumCmbRank input.SumES input.summUIOWAS3 input.UDelInfoCDS1 input.UDelInfoCDS2 input.UDelInfoCDS3 input.UDelInfoCDS4 input.UDelInfoCDS5 input.udelNB input.udelNRef input.udelSB input.udelSDI input.udelSRef input.UNTIIANA input.UNTIIANM input.UNTIIANMERG input.UNTIIASA input.UNTIIASMERG input.UWM0 input.UWM1 input.UWM2 input.WHUIRGroup1 input.WHUIRGroup2 input.WHUIRGroup4 input.WHUIRGroup5 input.WHUIRGroup6
- 30 queries: #1-#30 (available at http://www.trec-cds.org/2016.html#topics). But the text of the queries is not needed for the pooling experiments. You just need the relevance judgments, qrels, which are available at http://trec.nist.gov/data/clinical/qrels-treceval-2016.txt.

R CODE

This section provides the R code needed for experimentation.

The document adjudication strategy, Hedge, is implemented in hedge.R.

Furthermore, we provide another script, process_multiple_queries_jasist.R, which implements an example on how to process multiple queries. Instructions about processing multiple queries are provided here.

Besides an auxiliary function to determine the relevance value of each document, hedge.R contains the following R function:

pooling_hedge. Implements judgment ordering by Hedge

inputs:

    query: query whose pool is gonna be ordered
    pool_depth: maximum number of docs from each ranking that will be pooled
    run_rankings: a list containing the rankings of all pooled runs. 
              It is a standard R list with as many entries as runs in the pool. 
              Each entry in the list contains the ranking of a run, which is stored as a        
              dataframe with the following column names: "QUERY","LABEL","DOC_ID","RANK","SCORE","RUN".  
    qrels:   a data frame containing the qrels. The data frame has the following column names: "QUERY","DUMMY","DOC_ID","REL".


output:

    a vector of DOCIDs, which is the sequence of documents as they must be judged

Example of usage: (given two runs, "runA" and "runB", stored in your working folder, and a qrel file, "qrels")

               run_rankings=list()
               df = read.table("runA",header=FALSE)
               names(df)=c("QUERY","LABEL","DOC_ID","RANK","SCORE","RUN")
               run_rankings[[1]]=df  
               df = read.table("runB",header=FALSE)
               names(df)=c("QUERY","LABEL","DOC_ID","RANK","SCORE","RUN")
               run_rankings[[2]]=df  
               qrels_df= read.table("qrels",header=FALSE)
               names(qrels_df)=c("QUERY","DUMMY","DOC_ID","REL")

               listIDs = pooling_hedge(251,100,run_rankings,qrels_df)

The code above orders the pool for query 251 with pool depth 100

Multiple queries

Instructions for experimenting with multiple queries (the example below is included into the file process_multiple_queries_jasist.R).

Steps:

Store all pooled runs into a local folder (pool_folder).
Store the official qrel file into another folder (qrels_path)
Call process_multiple_queries(pool_folder,qrels_path,output_file)

(output_file is the path to the file where the list of judgments for all queries will be stored)

The function process_multiple_queries processes all queries and dumps the judgment list to output_file.

process_multiple_queries <- function(pool_folder, qrels_path, output_file)
{
  # reads the qrel file into an R dataframe with appropriate column names
  qrels_df= read.table(qrels_path,header=FALSE)
  names(qrels_df)=c("QUERY","DUMMY","DOC_ID","REL")
  
  print(paste("Qrel file...",qrels_path,"...",nrow(qrels_df)," judgments."))
    
  # reads "input*" files from pool_folder and stores them into a list of data frames (run_rankings)
  files <- list.files(path=pool_folder, pattern = "input")
  print(paste("Processing...",pool_folder,"...",length(files)," run files",sep=""))
  
  run_rankings=list()

  for (f in files){
    filepath=paste(pool_folder,f,sep="/")  
    df = read.table(filepath,header=FALSE)
    names(df)=c("QUERY","LABEL","DOC_ID","RANK","SCORE","RUN")
    run_rankings[[length(run_rankings)+1]]=df
  } # files
  
  print(paste(length(run_rankings),"runs in the pool"))
  
  pool_depth=100
  
  queries= unique(qrels_df$QUERY)
  
  judgments_lists=list()  
  
  iq=1
    
  for (q in queries)
  {
  
  judgments = pooling_hedge(q, pool_depth, run_rankings,qrels_df)
  
  #store the judgment list into judgments_lists
  judgments_lists[[iq]] = judgments
  
  iq = iq +1
  } # for q in queries
  
  save(judgments_lists,file=output_file )

}

Given the file with the judgment lists, different stopping methods are implemented in different R files:

stop_after_n_rels. Stops after judging n relevant docs

inputs:

    judgement_list_path: path to the file with the judgment lists (created by process_multiple_queries)
    qrels_path: path to the qrel file
    n: (integer) number of relevant docs required to stop the judgment process

outputs:    

    subqrel_file_path: path to the file where the resulting subqrels will be created.

stop_after_n_judgments. Stops after judging n docs

inputs:

    judgement_list_path: path to the file with the judgment lists (created by process_multiple_queries)
    qrels_path: path to the qrel file
    n: (integer) number of docs required to stop the judgment process

outputs:    

    subqrel_file_path: path to the file where the resulting subqrels will be created.

stop_after_n_non_rels. Stops after judging n non-relevant docs

inputs:

    judgement_list_path: path to the file with the judgment lists (created by process_multiple_queries)
    qrels_path: path to the qrel file
    n: (integer) number of non-relevant docs required to stop the judgment process

outputs:    

    subqrel_file_path: path to the file where the resulting subqrels will be created.

stop_after_n_consecutive_non_rels. Stops after judging n consecutive non-relevant docs

inputs:

    judgement_list_path: path to the file with the judgment lists (created by process_multiple_queries)
    qrels_path: path to the qrel file
    n: (integer) number of consecutive non-relevant docs required to stop the judgment process

outputs:    

    subqrel_file_path: path to the file where the resulting subqrels will be created.

stop_after_pool_proportion. Stops after judging a given proportion of the pooled docs

inputs:

    judgement_list_path: path to the file with the judgment lists (created by process_multiple_queries)
    qrels_path: path to the qrel file
    proportion: (number in [0,1]) proportion of the pool to be judged

outputs:    

    subqrel_file_path: path to the file where the resulting subqrels will be created.

The rest of the methods work with estimations of recall and, therefore, you need to first invoke the training process that produces the fits and performance statistics of the training queries. This is implemented by stopping_fit_training_queries:

     inputs:
    
        judgement_list_path: path to the file with the judgment lists of the training collection (created by process_multiple_queries)
        qrels_path: path to the qrel file of the training collection
        
    outputs:    
        
        output_file_path: path to the file where the resulting fits and performance statistics will be stored

Given the file produced by stopping_fit_training_queries from the training data, the following methods implement other stopping methods:

stopping_stop_if_bearish_crossover.R. Stops following the moving-average method described in the paper.

inputs:

    judgement_list_path: path to the file with the judgment lists (created by process_multiple_queries)
    qrels_path: path to the qrel file
    x: size of the moving window (integer > 1)
    training_curves_and_perfs_file: file with the training fits and performance statistics (created by stopping_fit_training_queries)
    perf_code: perf_code=0 means that perf@n is computed as p@n, perf_code=1 means that perf@n is computed by AvgP

outputs:    

    subqrel_file_path: path to the file where the resulting subqrels will be created.

stopping_stop_if_fall_below_max.R. Stops following the stop_if_fall_below_max method described in the paper.

inputs:

    judgement_list_path: path to the file with the judgment lists (created by process_multiple_queries)
    qrels_path: path to the qrel file
    training_curves_and_perfs_file: file with the training fits and performance statistics (created by stopping_fit_training_queries)
    p: parameter of the method (proportion under the max, p in [0,1])
    perf_code: perf_code=0 means that perf@n is computed as p@n, perf_code=1 means that perf@n is computed by AvgP

outputs:    

    subqrel_file_path: path to the file where the resulting subqrels will be created.

stopping_stop_if_no_better_expectations. Stops following the stop_if_no_better_expectations method described in the paper.

inputs:

    judgement_list_path: path to the file with the judgment lists (created by process_multiple_queries)
    qrels_path: path to the qrel file
    training_curves_and_perfs_file: file with the training fits and performance statistics (created by stopping_fit_training_queries)
    perf_code: perf_code=0 means that perf@n is computed as p@n, perf_code=1 means that perf@n is computed by AvgP

outputs:    

    subqrel_file_path: path to the file where the resulting subqrels will be created.

Pooling - Stopping

Data

R CODE

Multiple queries