Jheng-Hong (Matt) Yang

Hi, I'm Matt.

I’m Jheng-Hong (Matt) Yang, an engineer and researcher building agentic systems around data systems. My recent work explores how strong language models can search, inspect evidence, and reason more effectively, which led to Pi-serini, a minimal search agent for building reliable, cost-efficient deep-research workflows.

I previously pursued doctoral research in computer science at the University of Waterloo, where I worked on information retrieval, neural ranking, conversational search, and multimodal retrieval. Before that, I conducted research at Academia Sinica on recommender systems and data-driven modeling. My earlier training was in physics and electrical engineering: I received a B.S. in Electrophysics and an M.Sc. in Electrical Engineering from National Chiao Tung University, then worked in TSMC R&D on semiconductor device modeling.

I am currently building Stencilzeit for consulting, engineering, and technical services around data systems and agentic AI. I enjoy building systems that are secure, fast, scalable, delightful to use, and ethically designed to serve the greater good.

Feel free to reach out at jhyang [at] stencilzeit [dot] com with ideas or collaborations.

Papers and Talks

2025

Gosling Grows Up: Retrieval with Learned Dense and Sparse Representations Using Anserini

The Anserini IR toolkit has come a long way since efforts began in 2015. Although the goals of the project - to bridge research and practice in information retrieval, and to provide reproducible, easy-to-use baselines - have remained constant, the world has...

2024

Retrieval Evaluation for Long-Form and Knowledge-Intensive Image–Text Article Composition

This paper examines the integration of images into Wikipedia articles by evaluating image–text retrieval tasks in multimedia content creation, focusing on developing retrieval-augmented tools to enhance the creation of high-quality multimedia articles. Desp...

Resources for Brewing BEIR: Reproducible Reference Models and Statistical Analyses

BEIR is a benchmark dataset originally designed for zero-shot evaluation of retrieval models across 18 different domain/task combinations. In recent years, we have witnessed the growing popularity of models based on representation learning, which naturally ...

2023

One Blade for One Purpose: Advancing Math Information Retrieval Using Hybrid Search

Neural retrievers have been shown to be effective for math-aware search. Their ability to cope with math symbol mismatches, to represent highly contextualized semantics, and to learn effective representations are critical to improving math information retri...

AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia Content Creation

This paper presents the AToMiC (Authoring Tools for Multi media Content) dataset, designed to advance research in image/text cross-modal retrieval. While vision–language pretrained transformers have led to significant improvements in retrieval effectiveness...

Simple Yet Effective Neural Ranking and Reranking Baselines for Cross-Lingual Information Retrieval

The advent of multilingual language models has generated a resurgence of interest in cross-lingual information retrieval (CLIR), which is the task of searching documents in one language with queries from another. However, the rapid pace of progress has led ...

TREC 2023 AToMiC Overview

This paper presents an exploration of evaluating image–text retrieval tasks designed for multimedia content creation, with a particular focus on the dynamic interplay among various modalities, including text and images. The study highlights the pivotal role...

TREC 2023-H2Oloo in the Product Search Challenge

This publication page is generated from bibliography/papers.bib.Edit the BibTeX entry and run uv run python scripts/generate_publications.py to update it.

2022

Evaluating Token-Level and Passage-Level Dense Retrieval Models for Math Information Retrieval

With the recent success of dense retrieval methods based on bi-encoders, studies have applied this approach to various interesting downstream retrieval tasks with good efficiency and in-domain effectiveness. Recently, we have also seen the presence of dense...

Multiperiod Corporate Default Prediction Through Neural Parametric Family Learning

Default analysis plays an essential role in financial markets because it narrows the information gap between borrowers and lenders. Of late, machine learning-based methods have found their way to default analysis and typically view it as a risk classificati...

2021

Sparsifying Sparse Representations for Passage Retrieval by Top-k Masking

Sparse lexical representation learning has demonstrated much progress in improving passage retrieval effectiveness in recent models such as DeepImpact, uniCOIL, and SPLADE. This paper describes a straightforward yet effective approach for sparsifying lexica...

Contextualized Query Embeddings for Conversational Search

This paper describes a compact and effective model for low-latency passage retrieval in conversational search based on learned dense representations. Prior to our work, the state-of-the-art approach uses a multi-stage pipeline comprising conversational quer...

Multi-Stage Conversational Passage Retrieval: An Approach to Fusing Term Importance Estimation and Neural Query Rewriting

Conversational search plays a vital role in conversational information seeking. As queries in information seeking dialogues are ambiguous for traditional ad hoc information retrieval (IR) systems due to the coreference and omission resolution problems inher...

In-Batch Negatives for Knowledge Distillation with Tightly-Coupled Teachers for Dense Retrieval

We present an efficient training approach to text retrieval with dense representations that applies knowledge distillation using the ColBERT late-interaction ranking model. Specifically, we propose to transfer the knowledge from a bi-encoder teacher to a st...

Chatty Goose: A Python Framework for Conversational Search

Chatty Goose is an open-source Python conversational search framework that provides strong, reproducible reranking pipelines built on recent advances in neural models. The framework comprises extensible modular components that integrate with popular librari...

Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. It aims to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-...

Text-to-Text Multi-View Learning for Passage Re-Ranking

Recently, much progress in natural language processing has been driven by deep contextualized representations pretrained on large corpora. Typically, the fine-tuning on these pretrained models for a specific downstream task is based on single-view learning,...

Efficiently Teaching an Effective Dense Retriever with Balanced Topic-Aware Sampling

A vital step towards the widespread adoption of neural retrieval models is their resource efficiency throughout the training, indexing and query workflows. The neural IR community made great advancements in training effective dual-encoder dense retrieval (D...

2020

Designing Templates for Eliciting Commonsense Knowledge from Pretrained Sequence-to-Sequence Models

While internalized “implicit knowledge” in pretrained transformers has led to fruitful progress in many natural language understanding tasks, how to most effectively elicit such knowledge remains an open question. Based on the text-to-text transfer transfor...

Distilling Dense Representations for Ranking Using Tightly-Coupled Teachers

We present an approach to ranking with dense representations that applies knowledge distillation to improve the recently proposed late-interaction ColBERT model. Specifically, we distill the knowledge from ColBERT’s expressive MaxSim operator for computing ...

Tackling WinoGrande Schemas

We applied the T5 sequence-to-sequence model to tackle the AI2 WinoGrande Challenge by decomposing each example into two input text strings, each containing a hypothesis, and using the probabilities assigned to the “entailment” token as a score of the hypot...

TREC 2020 Notebook: CAsT Track

This notebook describes our participation (h2oloo) in TREC CAsT 2020. We first illustrate our multi-stage pipeline for conversational search: sequence-to-sequence query reformulation followed by an ad hoc text ranking pipeline; then, detail our proposed met...

Query Reformulation Using Query History for Passage Retrieval in Conversational Search

Passage retrieval in a conversational context is essential for many downstream applications; it is however extremely challenging due to limited data resources. To address this problem, we present an effective multi-stage pipeline for passage ranking in conv...

2019

Query and Answer Expansion from Conversation History

In this paper, we present our methods, experimental analysis, and final submissions for the Conversational Assistance Track (CAsT) at TREC 2019. In addition to language understanding, extracting knowledge from historical dialogues (eg, previous queries, sea...

2018

HOP-Rec: High-Order Proximity for Implicit Recommendation

Best Paper Runner-Up

Recommender systems are vital ingredients for many e-commerce services. In the literature, two of the most popular approaches are based on factorization and graph-based models; the former approach captures user preferences by factorizing the observed direct...

Projects