Trip+: Benchmarking Agents in Personalized Interactive Travel Planning

A personalized, multi-turn travel-planning benchmark for language agents

[ Paper ] · [ Code & Data ] · [ Benchmark ] · [ Results ] · [ BibTeX ]

Code and data are available at junle-chen/trip-plus.

∗ ∗ ∗

Abstract

Interactive travel planning is a popular use case for language models: agents must manage evolving preferences and unexpected disruptions over many turns, making complex, profile-conditioned decisions. Existing benchmarks tend to evaluate feasibility, personalization, or interaction in relatively isolated settings. We introduce Trip+ to measure whether agents can plan travel holistically. Given traveler profiles and dynamic interactions, agents must generate and revise minute-level itineraries. End-to-end traveler experience is evaluated through an LLM-based simulator, enabling subjective metrics such as fatigue. Scenarios range from simple request resolutions to complex environment-driven replanning. Evaluating 18 language models, we find a consistent gap in experiential quality: models favor technically feasible but exhausting itineraries that diverge sharply from profiled traveler preferences.

Positioning of travel-planning benchmarks — **Figure 1.** Positioning travel-planning benchmarks along two axes — *personalization richness* (constraints → profiles → experience) and *interaction richness* (one-shot → targeted → diverse). Trip+ targets the joint frontier of rich personalization and diverse long-horizon interaction.

∗ ∗ ∗

Overview

As language agents move toward real-world products, travel planning has emerged as a representative task that goes well beyond one-shot execution. Itinerary design unfolds through multi-turn interaction: travelers refine preferences, introduce constraints, resolve conflicts, and react to changing conditions. This makes travel planning an ideal testbed for personalized agents — requiring itineraries that are simultaneously executable, profile-aligned, and consistent with accumulated user intent.

The underexplored frontier is the joint handling of rich personalization and diverse long-horizon interaction. Profile-aware benchmarks typically evaluate static preference matching and neglect stateful, cross-turn evaluation; interactive replanning benchmarks often focus on isolated patterns (asking a clarifying question, incorporating one piece of feedback, or a single replan). Trip+ unifies them by treating the active user state, profile-derived suitability rules, expected response mode, and the activity-level experience trace as jointly constructed oracle fields.

**Interactive Pipeline.** A browser-native presentation view for explaining how Trip+ constructs benchmark tasks, runs an agent planning loop, and evaluates each turn.

Overall design of Trip+ — **Figure 2.** The overall design of Trip+. A fixed 40-city travel-data sandbox and 11 traveler profiles drive four diverse multi-turn interaction archetypes. At every turn the agent chooses a response mode (**Plan / Clarification / NoSolution**), and minute-level itineraries are checked by a four-layer verifier: itinerary feasibility, requirement satisfaction, profile-conditioned user simulation, and stateful multi-turn evaluation.

∗ ∗ ∗

Key Numbers

153

multi-turn instances

570

user turns

traveler profiles

cities in sandbox

domain-specific tools

interaction archetypes

LMs evaluated

7.2M

in-city transit records

The sandbox additionally contains 309K train rows, 38K flight rows, and thousands of attractions, restaurants, hotels, subway stations, and weather-day records.

Dataset statistics of Trip+ — **Figure 3.** Dataset statistics: (a) instances per traveler profile (8–23 each), (b) 40-city sandbox data volume, and (c) expected response modes by interaction type. Planning dominates; clarification and no-solution cases appear mainly in request-resolution and long-horizon interactions.

∗ ∗ ∗

How Trip+ Is Built

Trip+ is constructed in three stages, using a state-first pipeline: each turn first gets a structured hidden state (state delta, expected response mode, evaluation target) and is only then rendered into a natural-language user utterance.

1 · Travel-data sandbox

A normalized travel-data sandbox over 40 diverse Chinese cities covering different destination types, seasons, and local conditions. It provides reproducible data for POIs, hotels, restaurants, weather, local mobility, and intercity transport, exposed through 11 OpenAI-compatible tools (train/flight query, hotel, attractions, restaurants, location search, road route, city transport plan, city weather).

2 · Eleven traveler profiles

Each profile template varies long-term user context — party composition, budget sensitivity, mobility constraints, pace, interests, accommodation style, and transport preferences. An observable profile is given to the agent as long-term user memory; the same cues activate hidden profile-derived rule IDs used only for soft-preference scoring. Profiles include Backpacker, Honeymoon Couple, Three-Generation Family, Family with Child, Slow-Paced Senior, Cultural Explorer, Budget Student, Business Traveler, Food-First, Nature Lover, and Friend Group.

3 · Four multi-turn interaction archetypes

IUser-State Evolution

User needs change across turns — party composition, budget, schedule, added must-visit attractions, dietary restrictions. (4 turns)

IIRequest Resolution

The agent must clarify or resolve under-specified, conflicting, or infeasible requests, choosing the correct response mode. (3 turns)

IIIEnvironment-Driven Replanning

External disruptions — weather risk, crowding, traffic peaks, closures, availability changes — require itinerary revisions while preserving prior constraints. (3 turns)

IVLong-Horizon Alignment

Multiple updates are handled in sequence while preserving all earlier commitments — the hardest scenario as constraints accumulate. (5 turns)

Three response modes

At each turn the agent must strategically navigate its action space:

Plan

Generate itinerary

Return a complete minute-level itinerary: transport, lodging, meals, local movement, timing, and costs — for complete requests, normal updates, and tool-verifiable revisions.

Clarification

Ask a question

Used only for unresolved blocking ambiguity: a missing edit target, a hard-constraint conflict, or a conflict with hard profile facts.

NoSolution

Report infeasibility

Returned only when tool evidence proves hard constraints are unsatisfiable and the user explicitly asks for an impossibility judgment.

∗ ∗ ∗

Four-Layer Evaluation Protocol

Every turn is evaluated against its hidden state — the expected response mode, active hard requirements, profile-derived expectations, environment conditions, and items to preserve across turns. Evaluation proceeds in two stages: first the response-mode gate, then, for eligible plan turns, the remaining itinerary-level layers.

Response-Mode Gating. Checks whether the chosen mode matches the expectation. A mismatch is recorded as a mode error and itinerary metrics are skipped.
Itinerary Feasibility. Deterministic atomic checks for structural completeness, entity grounding, temporal coherence, venue opening hours, supported transfers, and cost arithmetic — averaged over structure, evidence, and operability.
Requirement Satisfaction. Separately measures hard-constraint satisfaction (dates, destinations, party size, budget, required lodging/dining/transport) and soft-preference satisfaction (pace, walking tolerance, budget sensitivity, comfort, interests), both via deterministic rules.
Profile-Conditioned User Simulation. An LLM simulator replays the minute-level activity sequence from the traveler's perspective, scoring 1–5 with rationales across physical, schedule, environmental, budget comfort, and preference dimensions.

A stateful multi-turn layer judges each response against accumulated state: request fulfillment (are new turn changes incorporated?) and intent preservation (do ongoing constraints, preferences, and environment conditions remain satisfied?). Reliability is supported by four profile-conditioned judges (Qwen, Claude, Gemini, GPT families, median-aggregated) and human verification over 50 sampled cases spanning 1,825 activities.

**Figure 8.** Reliability and verification of the four user-simulation judges used for median aggregation. **(a)** Pairwise Spearman rank alignment across the Qwen, Gemini, GPT, and Claude judges; **(b)** overall ensemble reliability — Cronbach's α = 0.833 ("good reliability") across 4 judges and 18 models; **(c)** human rationale verification, where judge rationales agree with human review on ≥89.8% of cases per dimension (96.5% for preference satisfaction), over 50 sampled cases covering 1,825 activity-level evaluations. Reproduced from the paper appendix.

∗ ∗ ∗

Main Results

We evaluate 18 agentic models under the same lightweight OpenAI-compatible function-calling scaffold. Plan Avg. averages the four valid-plan quality metrics; Win(%) is the share of non-aggregate metrics on which a model ranks first. Bold = best, underline = second-best in each column.

Two headline findings: (1) current LLM agents remain unreliable in realistic multi-turn planning — they still err in deciding whether to plan, clarify, or report infeasibility, and often fail to preserve earlier user needs. (2) feasible itineraries are not necessarily user-aligned: even the strongest model scores only ~0.64 on soft preferences, and the best user-simulation score (GPT-5.4) is just 0.518. Satisfying explicit constraints is insufficient for matching implicit, evolving preferences.

∗ ∗ ∗

In-Depth Analysis

We focus on Gemini-3.1-Pro-Preview, the strongest overall model, to diagnose two remaining gaps: unreliable interaction and unsuitable plans.

Interaction reliability across scenarios — **Figure 4.** Interaction reliability of Gemini-3.1-Pro-Preview across the four scenario types. The model handles early updates but struggles as requirements accumulate; clarification remains a bottleneck, and long-horizon alignment is hardest — at the ambiguous Turn 3, request fulfillment plummets to 0.41.

Takeaway 1. Better interaction requires revising plans reliably as constraints grow — unreliability comes mainly from state-consistent revision, not response-mode selection.

Performance scores and error analysis — **Figure 5.** Component scores (left) and error rates (right) for valid plans. Hard constraints like dates, hotels, and party size score ≥0.95, but itinerary-level choices (transport, attractions, meals) are weaker. **Pace is the most systematic personalization failure** — comfort & pace scores only 0.39, and 99% of valid plans contain pacing burden. Traveler fatigue appears in 97% of plans, environmental exposure in 95%.

Takeaway 2. Better personalization depends on pacing and burden control — many feasible itineraries are still exhausting or environmentally unsuitable.

Inference cost vs plan quality — **Figure 6.** Turn-1 inference cost (avg. LLM / tool calls per first turn) vs. Plan Avg. for frontier models. There is no monotonic relationship: Gemini Pro achieves the highest Turn-1 Plan Avg. with moderate usage, while heavier-calling models (e.g. DS-V4, Kimi) do not pull ahead.

Takeaway 3. Better planning depends on effective evidence use, not more calls.

∗ ∗ ∗

How Trip+ Compares

Existing travel-planning benchmarks fall short of comprehensive evaluation. Profile-aware benchmarks evaluate static preference matching; interactive replanning benchmarks focus on isolated patterns. Trip+ is the only benchmark supporting grounding, profiles, interaction, fine-grained itineraries, feasibility, stateful evaluation, user simulation, and open sourcing together.

Table 1: comparison of travel-planning benchmarks — **Table 1.** Comparison of travel-planning benchmarks across task construction, evaluation, and resource dimensions (✓ supported · ~ partial / indirect · ✗ not supported). Trip+ is the only benchmark that jointly supports grounding, profiles, interaction, fine-grained itineraries, feasibility, stateful evaluation, user simulation, and open sourcing. Reproduced from the paper.

∗ ∗ ∗

Conclusion

Trip+ evaluates travel-planning agents on generating feasible, traveler-suitable, and intent-consistent itineraries under dynamic user needs and environments. Our evaluation reveals a clear gap: while current models satisfy basic feasibility and explicit constraints, they struggle with stateful revisions, user alignment, and effective evidence use. We hope Trip+ drives the development of agents that plan adaptive, profile-aligned experiences — rather than merely executable trips.

∗ ∗ ∗

BibTeX

@misc{chen2026tripbenchmarkingagentspersonalized,
      title={Trip+: Benchmarking Agents in Personalized Interactive Travel Planning},
      author={Junle Chen and Wei Chen and Yehong Xu and Zhengjun Huang and Yuqian Wu and Zhoujin Tian and Kai Wang and Lei Wang and Xiaofang Zhou},
      year={2026},
      eprint={2606.21169},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.21169},
}

arXiv preprint. Code and data are available at junle-chen/trip-plus.

Trip+: Benchmarking Agents in Personalized Interactive Travel Planning

Abstract

Overview

How Trip+ Constructs, Plans, and Evaluates

1Benchmark Construction

User-State Evolution

Request Resolution

Environment Replanning

Long-Horizon Alignment

2Agent Planning

3Every-Turn Evaluation

Key Numbers

How Trip+ Is Built

1 · Travel-data sandbox

2 · Eleven traveler profiles

3 · Four multi-turn interaction archetypes

IUser-State Evolution

IIRequest Resolution

IIIEnvironment-Driven Replanning

IVLong-Horizon Alignment

Three response modes

Generate itinerary

Ask a question

Report infeasibility

Four-Layer Evaluation Protocol

Main Results

In-Depth Analysis

How Trip+ Compares

Conclusion

BibTeX