Objectives: Complex searches are sometimes derived from tests against a predefined “validation set” of items, a robust but resource-intensive approach. One approach for simplifying the construction of such a validation set relies on selecting from randomly derived results from an initial “seed” search. Larger validation sets are likely better to reflect results from the database as a whole, but searcher effort increases with the number of results screened. A balance between search fidelity and usability is required.
Methods: To determine the optimal balance between usability and search fidelity, two approaches were used. In the first, several simple searches were programmatically performed in PubMed using the NCBI’s eutils API, and randomly selected results were extracted from each. These sets of random results were then subsampled into sets of different sizes, and each set was searched against a variety of test conditions. The proportion of each set selected by those conditions were then compared between differently sized sets. In the second approach, human participants were asked to visit a web application designed to show them randomly selected items from a PubMed search (matched with a selection rubric) and allow them to manually sort those results into “good” and “bad” sets. The resulting sets of items for each rubric were compared between participants.
Results: In the first arm of the study, the proportions of items found for each sample search/test condition did vary between randomly selected sets of results. However, it was observed that the amount of this variance decreased markedly as the size of the random set increased to 50. In the second arm, participants completed a total of 23 set-construction tasks. On average, participants took 34 minutes to complete each task (min: 13 minutes, max: 119 minutes) and screened 2.56 items for each “good” one (min: 1.15, max: 3.85). There was substantial variance in the number of “bad” items for a given question between respondents, but testing showed statistical similarities between their results in at least one case.
Conclusions: These results do not yet yield a single answer to the required size for a validation set, but they do suggest that this approach has promise. More research is needed.