This document provides all details needed to reproduce the experiments reported in the paper D. Losada, J. Parapar, A. Barreiro. “When to Stop Making Relevance Judgments? A Study of Stopping Methods for Building Information Retrieval Test Collections”. Journal of the Association for Information Science and Technology, 2018 (to appear).

Any scientific publication derived from the use of this software should explicitly refer to this publication.

Next, we explain the data used for experimentation and provide our R code, which implement all stopping methods.

Data

We used the following TREC collections (http://trec.nist.gov): four adhoc (TREC5, TREC6, TREC7, TREC8) and three Clinical Decision Support Track (CT14, CT15, CT16).

NIST kindly provided the runs that contributed to the pools of these tasks (http://trec.nist.gov/data/intro_eng.html).

The pooled runs are archived by NIST within a password protected area. If you want to reproduce our experiments you need to request access to the protected area (follow the instructions given at http://trec.nist.gov/results.html).

R CODE

This section provides the R code needed for experimentation.

The document adjudication strategy, Hedge, is implemented in hedge.R.

Furthermore, we provide another script, process_multiple_queries_jasist.R, which implements an example on how to process multiple queries. Instructions about processing multiple queries are provided here.

Besides an auxiliary function to determine the relevance value of each document, hedge.R contains the following R function:

Multiple queries

Instructions for experimenting with multiple queries (the example below is included into the file process_multiple_queries_jasist.R).

Steps:

  1. Store all pooled runs into a local folder (pool_folder).
  2. Store the official qrel file into another folder (qrels_path)
  3. Call process_multiple_queries(pool_folder,qrels_path,output_file)

(output_file is the path to the file where the list of judgments for all queries will be stored)

The function process_multiple_queries processes all queries and dumps the judgment list to output_file.

process_multiple_queries <- function(pool_folder, qrels_path, output_file)
{
  # reads the qrel file into an R dataframe with appropriate column names
  qrels_df= read.table(qrels_path,header=FALSE)
  names(qrels_df)=c("QUERY","DUMMY","DOC_ID","REL")
  
  print(paste("Qrel file...",qrels_path,"...",nrow(qrels_df)," judgments."))
    
  # reads "input*" files from pool_folder and stores them into a list of data frames (run_rankings)
  files <- list.files(path=pool_folder, pattern = "input")
  print(paste("Processing...",pool_folder,"...",length(files)," run files",sep=""))
  
  run_rankings=list()

  for (f in files){
    filepath=paste(pool_folder,f,sep="/")  
    df = read.table(filepath,header=FALSE)
    names(df)=c("QUERY","LABEL","DOC_ID","RANK","SCORE","RUN")
    run_rankings[[length(run_rankings)+1]]=df
  } # files
  
  print(paste(length(run_rankings),"runs in the pool"))
  
  pool_depth=100
  
  queries= unique(qrels_df$QUERY)
  
  judgments_lists=list()  
  
  iq=1
    
  for (q in queries)
  {
  
  judgments = pooling_hedge(q, pool_depth, run_rankings,qrels_df)
  
  #store the judgment list into judgments_lists
  judgments_lists[[iq]] = judgments
  
  iq = iq +1
  } # for q in queries
  
  save(judgments_lists,file=output_file )

}

Given the file with the judgment lists, different stopping methods are implemented in different R files:

  • stop_after_n_rels. Stops after judging n relevant docs

    inputs:
    
        judgement_list_path: path to the file with the judgment lists (created by process_multiple_queries)
        qrels_path: path to the qrel file
        n: (integer) number of relevant docs required to stop the judgment process
    
    outputs:    
    
        subqrel_file_path: path to the file where the resulting subqrels will be created. 
  • stop_after_n_judgments. Stops after judging n docs

    inputs:
    
        judgement_list_path: path to the file with the judgment lists (created by process_multiple_queries)
        qrels_path: path to the qrel file
        n: (integer) number of docs required to stop the judgment process
    
    outputs:    
    
        subqrel_file_path: path to the file where the resulting subqrels will be created. 
  • stop_after_n_non_rels. Stops after judging n non-relevant docs

    inputs:
    
        judgement_list_path: path to the file with the judgment lists (created by process_multiple_queries)
        qrels_path: path to the qrel file
        n: (integer) number of non-relevant docs required to stop the judgment process
    
    outputs:    
    
        subqrel_file_path: path to the file where the resulting subqrels will be created. 
  • stop_after_n_consecutive_non_rels. Stops after judging n consecutive non-relevant docs

    inputs:
    
        judgement_list_path: path to the file with the judgment lists (created by process_multiple_queries)
        qrels_path: path to the qrel file
        n: (integer) number of consecutive non-relevant docs required to stop the judgment process
    
    outputs:    
    
        subqrel_file_path: path to the file where the resulting subqrels will be created. 
  • stop_after_pool_proportion. Stops after judging a given proportion of the pooled docs

    inputs:
    
        judgement_list_path: path to the file with the judgment lists (created by process_multiple_queries)
        qrels_path: path to the qrel file
        proportion: (number in [0,1]) proportion of the pool to be judged
    
    outputs:    
    
        subqrel_file_path: path to the file where the resulting subqrels will be created. 

The rest of the methods work with estimations of recall and, therefore, you need to first invoke the training process that produces the fits and performance statistics of the training queries. This is implemented by stopping_fit_training_queries:

     inputs:
    
        judgement_list_path: path to the file with the judgment lists of the training collection (created by process_multiple_queries)
        qrels_path: path to the qrel file of the training collection
        
    outputs:    
        
        output_file_path: path to the file where the resulting fits and performance statistics will be stored

Given the file produced by stopping_fit_training_queries from the training data, the following methods implement other stopping methods:

  • stopping_stop_if_bearish_crossover.R. Stops following the moving-average method described in the paper.

    inputs:
    
        judgement_list_path: path to the file with the judgment lists (created by process_multiple_queries)
        qrels_path: path to the qrel file
        x: size of the moving window (integer > 1)
        training_curves_and_perfs_file: file with the training fits and performance statistics (created by stopping_fit_training_queries)
        perf_code: perf_code=0 means that perf@n is computed as p@n, perf_code=1 means that perf@n is computed by AvgP
    
    outputs:    
    
        subqrel_file_path: path to the file where the resulting subqrels will be created. 
  • stopping_stop_if_fall_below_max.R. Stops following the stop_if_fall_below_max method described in the paper.

    inputs:
    
        judgement_list_path: path to the file with the judgment lists (created by process_multiple_queries)
        qrels_path: path to the qrel file
        training_curves_and_perfs_file: file with the training fits and performance statistics (created by stopping_fit_training_queries)
        p: parameter of the method (proportion under the max, p in [0,1])
        perf_code: perf_code=0 means that perf@n is computed as p@n, perf_code=1 means that perf@n is computed by AvgP
    
    outputs:    
    
        subqrel_file_path: path to the file where the resulting subqrels will be created. 
  • stopping_stop_if_no_better_expectations. Stops following the stop_if_no_better_expectations method described in the paper.

    inputs:
    
        judgement_list_path: path to the file with the judgment lists (created by process_multiple_queries)
        qrels_path: path to the qrel file
        training_curves_and_perfs_file: file with the training fits and performance statistics (created by stopping_fit_training_queries)
        perf_code: perf_code=0 means that perf@n is computed as p@n, perf_code=1 means that perf@n is computed by AvgP
    
    outputs:    
    
        subqrel_file_path: path to the file where the resulting subqrels will be created. 

© David E. Losada, 2018