Papers and Talks
2025
Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval 2025
The Anserini IR toolkit has come a long way since efforts began in 2015. Although the goals of the project - to bridge research and practice in information retrieval, and to provide reproducible, easy-to-use baselines - have remained constant, the world has...
2024
Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia 2024
This paper examines the integration of images into Wikipedia articles by evaluating image–text retrieval tasks in multimedia content creation, focusing on developing retrieval-augmented tools to enhance the creation of high-quality multimedia articles. Desp...
arXiv preprint arXiv:2408.01363 2024
Vision–Language Models (VLMs) have demonstrated success across diverse applications, yet their potential to assist in relevance judgments remains uncertain. This paper assesses the relevance estimation capabilities of VLMs, including CLIP, LLaVA, and GPT-4V...
Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval 2024
BEIR is a benchmark dataset originally designed for zero-shot evaluation of retrieval models across 18 different domain/task combinations. In recent years, we have witnessed the growing popularity of models based on representation learning, which naturally ...
2023
Proceedings of the 46th international ACM SIGIR conference on research and development in information retrieval 2023
Neural retrievers have been shown to be effective for math-aware search. Their ability to cope with math symbol mismatches, to represent highly contextualized semantics, and to learn effective representations are critical to improving math information retri...
Proceedings of the 46th International ACM SIGIR conference on research and development in information retrieval 2023
This paper presents the AToMiC (Authoring Tools for Multi media Content) dataset, designed to advance research in image/text cross-modal retrieval. While vision–language pretrained transformers have led to significant improvements in retrieval effectiveness...
arXiv preprint arXiv:2306.07471 2023
BEIR is a benchmark dataset for zero-shot evaluation of information retrieval models across 18 different domain/task combinations. In recent years, we have witnessed the growing popularity of a representation learning approach to building retrieval models, ...
arXiv preprint arXiv:2304.01019 2023
The advent of multilingual language models has generated a resurgence of interest in cross-lingual information retrieval (CLIR), which is the task of searching documents in one language with queries from another. However, the rapid pace of progress has led ...
Text REtrieval Conference 2023
This paper presents an exploration of evaluating image–text retrieval tasks designed for multimedia content creation, with a particular focus on the dynamic interplay among various modalities, including text and images. The study highlights the pivotal role...
Text REtrieval Conference 2023
This publication page is generated from bibliography/papers.bib.Edit the BibTeX entry and run uv run python scripts/generate_publications.py to update it.
2022
Findings of the Association for Computational Linguistics: EMNLP 2022 2022
With the recent success of dense retrieval methods based on bi-encoders, studies have applied this approach to various interesting downstream retrieval tasks with good efficiency and in-domain effectiveness. Recently, we have also seen the presence of dense...
Proceedings of the 2022 SIAM International Conference on Data Mining (SDM) 2022
Default analysis plays an essential role in financial markets because it narrows the information gap between borrowers and lenders. Of late, machine learning-based methods have found their way to default analysis and typically view it as a risk classificati...
2021
arXiv preprint arXiv:2112.09628 2021
Sparse lexical representation learning has demonstrated much progress in improving passage retrieval effectiveness in recent models such as DeepImpact, uniCOIL, and SPLADE. This paper describes a straightforward yet effective approach for sparsifying lexica...
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021
This paper describes a compact and effective model for low-latency passage retrieval in conversational search based on learned dense representations. Prior to our work, the state-of-the-art approach uses a multi-stage pipeline comprising conversational quer...
ACM Transactions on Information Systems (TOIS) 2021
Conversational search plays a vital role in conversational information seeking. As queries in information seeking dialogues are ambiguous for traditional ad hoc information retrieval (IR) systems due to the coreference and omission resolution problems inher...
Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021) 2021
We present an efficient training approach to text retrieval with dense representations that applies knowledge distillation using the ColBERT late-interaction ranking model. Specifically, we propose to transfer the knowledge from a bi-encoder teacher to a st...
Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval 2021
Chatty Goose is an open-source Python conversational search framework that provides strong, reproducible reranking pipelines built on recent advances in neural models. The framework comprises extensible modular components that integrate with popular librari...
Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval 2021
Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. It aims to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-...
Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval 2021
Recently, much progress in natural language processing has been driven by deep contextualized representations pretrained on large corpora. Typically, the fine-tuning on these pretrained models for a specific downstream task is based on single-view learning,...
Proceedings of the 44th international ACM SIGIR conference on research and development in information retrieval 2021
A vital step towards the widespread adoption of neural retrieval models is their resource efficiency throughout the training, indexing and query workflows. The neural IR community made great advancements in training effective dual-encoder dense retrieval (D...
arXiv preprint arXiv:2102.10073 2021
Pyserini is an easy-to-use Python toolkit that supports replicable IR research by providing effective first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance j...
2020
Proceedings of the 28th International Conference on Computational Linguistics 2020
While internalized “implicit knowledge” in pretrained transformers has led to fruitful progress in many natural language understanding tasks, how to most effectively elicit such knowledge remains an open question. Based on the text-to-text transfer transfor...
arXiv preprint arXiv:2010.11386 2020
We present an approach to ranking with dense representations that applies knowledge distillation to improve the recently proposed late-interaction ColBERT model. Specifically, we distill the knowledge from ColBERT’s expressive MaxSim operator for computing ...
arXiv preprint arXiv:2004.01909 2020
This paper presents an empirical study of conversational question reformulation (CQR) with sequence-to-sequence architectures and pretrained language models (PLMs). We leverage PLMs to address the strong token-to-token independence assumption made in the co...
arXiv preprint arXiv:2003.08380 2020
We applied the T5 sequence-to-sequence model to tackle the AI2 WinoGrande Challenge by decomposing each example into two input text strings, each containing a hypothesis, and using the probabilities assigned to the “entailment” token as a score of the hypot...
TREC 2020
This notebook describes our participation (h2oloo) in TREC CAsT 2020. We first illustrate our multi-stage pipeline for conversational search: sequence-to-sequence query reformulation followed by an ad hoc text ranking pipeline; then, detail our proposed met...
arXiv preprint arXiv:2005.02230 2020
Passage retrieval in a conversational context is essential for many downstream applications; it is however extremely challenging due to limited data resources. To address this problem, we present an effective multi-stage pipeline for passage ranking in conv...
2019
TREC 2019
In this paper, we present our methods, experimental analysis, and final submissions for the Conversational Assistance Track (CAsT) at TREC 2019. In addition to language understanding, extracting knowledge from historical dialogues (eg, previous queries, sea...
2018
Proceedings of the 12th ACM conference on recommender systems 2018
Recommender systems are vital ingredients for many e-commerce services. In the literature, two of the most popular approaches are based on factorization and graph-based models; the former approach captures user preferences by factorizing the observed direct...