Page 100%

Trip+: Benchmarking Agents in Personalized Interactive Travel Planning

A personalized, multi-turn travel-planning benchmark for language agents

Code and data are available at junle-chen/trip-plus.
∗   ∗   ∗

Abstract

Interactive travel planning is a popular use case for language models: agents must manage evolving preferences and unexpected disruptions over many turns, making complex, profile-conditioned decisions. Existing benchmarks tend to evaluate feasibility, personalization, or interaction in relatively isolated settings. We introduce Trip+ to measure whether agents can plan travel holistically. Given traveler profiles and dynamic interactions, agents must generate and revise minute-level itineraries. End-to-end traveler experience is evaluated through an LLM-based simulator, enabling subjective metrics such as fatigue. Scenarios range from simple request resolutions to complex environment-driven replanning. Evaluating 18 language models, we find a consistent gap in experiential quality: models favor technically feasible but exhausting itineraries that diverge sharply from profiled traveler preferences.

Positioning of travel-planning benchmarks
Figure 1. Positioning travel-planning benchmarks along two axes — personalization richness (constraints → profiles → experience) and interaction richness (one-shot → targeted → diverse). Trip+ targets the joint frontier of rich personalization and diverse long-horizon interaction.
∗   ∗   ∗

Overview

As language agents move toward real-world products, travel planning has emerged as a representative task that goes well beyond one-shot execution. Itinerary design unfolds through multi-turn interaction: travelers refine preferences, introduce constraints, resolve conflicts, and react to changing conditions. This makes travel planning an ideal testbed for personalized agents — requiring itineraries that are simultaneously executable, profile-aligned, and consistent with accumulated user intent.

The underexplored frontier is the joint handling of rich personalization and diverse long-horizon interaction. Profile-aware benchmarks typically evaluate static preference matching and neglect stateful, cross-turn evaluation; interactive replanning benchmarks often focus on isolated patterns (asking a clarifying question, incorporating one piece of feedback, or a single replan). Trip+ unifies them by treating the active user state, profile-derived suitability rules, expected response mode, and the activity-level experience trace as jointly constructed oracle fields.

How Trip+ Constructs, Plans, and Evaluates

Autoplay or click a step.

100%

1Benchmark Construction

Sandbox Construction 40 cities · 11 tools
Transport
Hotels
POIs
Meals
Weather
Mobility
Traveler Profile 11 profiles
Backpacker
Couples
Senior
Three-Gen
Family
Student
Food-first
Friends
Businessman
Task Construction
Hong Kong
4 days
Disneyland
Flight
Budget 10k
Base Query

Three-generation family · 4-day Hong Kong trip · must visit Disneyland.

flight preferred
budget 10,000
family meals
I

User-State Evolution

Budget, schedule, and must-visits evolve over turns.
4-turn scenario
II

Request Resolution

Resolve conflicts, ambiguity, or missing edit targets.
3-turn scenario
III

Environment Replanning

Replan after weather, closure, crowd, or traffic changes.
3-turn scenario
IV

Long-Horizon Alignment

Preserve earlier intent while later turns accumulate.
5-turn scenario

2Agent Planning

QueryInput
Three-generation family · Hong Kong · 4 days · Disneyland · flight · budget 10,000
flight-first transport
budget 10,000
senior + child comfort
Tool CallsSandbox
transport
mobility
POI
hotel
meal
weather
Flightdirect route OK
POIDisneyland open
Hotel2 rooms available
Weatherrain after 14:00
ModeDecision
Plan
Clarify
NoSolution
Minute PlanOutput
09:00hotel -> attraction taxiroute
09:30Disneyland visit blockPOI
12:10walk to nearby restaurantmeal
14:00revise for rain / fatiguestate
...
Call sequence
tool calls
LLM calls
choose mode
Turn StateFor Evaluation
response targetPlan / Clarify / NoSolution
active requirementsdestination, date, POI, budget
state changesnew requirement, weather change
Every-turn verifiermode gate · feasibility · requirements · user simulation

3Every-Turn Evaluation

1 Response Modegate first
Plan
Clarification
NoSolution
2 Itinerary Feasibilitydeterministic checks
Structure
Entities
Timing
Transport
Cost
3 Requirement Satisfactionhard + soft
Harddate · route · party · budget · places
Softpace · walking · interest · comfort
4 User Simulationexperience score
Physical
12345
Schedule
12345
Environment
12345
Budget
12345
Preference
12345
Stateful Multi-Turn Evaluationfulfill + preserve
Request fulfillmentnew or revised need solved
Intent preservationprior constraints still kept
mode correctness
plan quality
state reliability
sandbox profile task agent evaluate
Step 1 / 11

Start from the travel-data sandbox: routes, POIs, hotels, meals, mobility, and weather.

Interactive Pipeline. A browser-native presentation view for explaining how Trip+ constructs benchmark tasks, runs an agent planning loop, and evaluates each turn.
Overall design of Trip+
Figure 2. The overall design of Trip+. A fixed 40-city travel-data sandbox and 11 traveler profiles drive four diverse multi-turn interaction archetypes. At every turn the agent chooses a response mode (Plan / Clarification / NoSolution), and minute-level itineraries are checked by a four-layer verifier: itinerary feasibility, requirement satisfaction, profile-conditioned user simulation, and stateful multi-turn evaluation.
∗   ∗   ∗

Key Numbers

153
multi-turn instances
570
user turns
11
traveler profiles
40
cities in sandbox
11
domain-specific tools
4
interaction archetypes
18
LMs evaluated
7.2M
in-city transit records

The sandbox additionally contains 309K train rows, 38K flight rows, and thousands of attractions, restaurants, hotels, subway stations, and weather-day records.

Dataset statistics of Trip+
Figure 3. Dataset statistics: (a) instances per traveler profile (8–23 each), (b) 40-city sandbox data volume, and (c) expected response modes by interaction type. Planning dominates; clarification and no-solution cases appear mainly in request-resolution and long-horizon interactions.
∗   ∗   ∗

How Trip+ Is Built

Trip+ is constructed in three stages, using a state-first pipeline: each turn first gets a structured hidden state (state delta, expected response mode, evaluation target) and is only then rendered into a natural-language user utterance.

1 · Travel-data sandbox

A normalized travel-data sandbox over 40 diverse Chinese cities covering different destination types, seasons, and local conditions. It provides reproducible data for POIs, hotels, restaurants, weather, local mobility, and intercity transport, exposed through 11 OpenAI-compatible tools (train/flight query, hotel, attractions, restaurants, location search, road route, city transport plan, city weather).

2 · Eleven traveler profiles

Each profile template varies long-term user context — party composition, budget sensitivity, mobility constraints, pace, interests, accommodation style, and transport preferences. An observable profile is given to the agent as long-term user memory; the same cues activate hidden profile-derived rule IDs used only for soft-preference scoring. Profiles include Backpacker, Honeymoon Couple, Three-Generation Family, Family with Child, Slow-Paced Senior, Cultural Explorer, Budget Student, Business Traveler, Food-First, Nature Lover, and Friend Group.

3 · Four multi-turn interaction archetypes

IUser-State Evolution

User needs change across turns — party composition, budget, schedule, added must-visit attractions, dietary restrictions. (4 turns)

IIRequest Resolution

The agent must clarify or resolve under-specified, conflicting, or infeasible requests, choosing the correct response mode. (3 turns)

IIIEnvironment-Driven Replanning

External disruptions — weather risk, crowding, traffic peaks, closures, availability changes — require itinerary revisions while preserving prior constraints. (3 turns)

IVLong-Horizon Alignment

Multiple updates are handled in sequence while preserving all earlier commitments — the hardest scenario as constraints accumulate. (5 turns)

Three response modes

At each turn the agent must strategically navigate its action space:

Plan

Generate itinerary

Return a complete minute-level itinerary: transport, lodging, meals, local movement, timing, and costs — for complete requests, normal updates, and tool-verifiable revisions.

Clarification

Ask a question

Used only for unresolved blocking ambiguity: a missing edit target, a hard-constraint conflict, or a conflict with hard profile facts.

NoSolution

Report infeasibility

Returned only when tool evidence proves hard constraints are unsatisfiable and the user explicitly asks for an impossibility judgment.

∗   ∗   ∗

Four-Layer Evaluation Protocol

Every turn is evaluated against its hidden state — the expected response mode, active hard requirements, profile-derived expectations, environment conditions, and items to preserve across turns. Evaluation proceeds in two stages: first the response-mode gate, then, for eligible plan turns, the remaining itinerary-level layers.

  1. Response-Mode Gating. Checks whether the chosen mode matches the expectation. A mismatch is recorded as a mode error and itinerary metrics are skipped.
  2. Itinerary Feasibility. Deterministic atomic checks for structural completeness, entity grounding, temporal coherence, venue opening hours, supported transfers, and cost arithmetic — averaged over structure, evidence, and operability.
  3. Requirement Satisfaction. Separately measures hard-constraint satisfaction (dates, destinations, party size, budget, required lodging/dining/transport) and soft-preference satisfaction (pace, walking tolerance, budget sensitivity, comfort, interests), both via deterministic rules.
  4. Profile-Conditioned User Simulation. An LLM simulator replays the minute-level activity sequence from the traveler's perspective, scoring 1–5 with rationales across physical, schedule, environmental, budget comfort, and preference dimensions.

A stateful multi-turn layer judges each response against accumulated state: request fulfillment (are new turn changes incorporated?) and intent preservation (do ongoing constraints, preferences, and environment conditions remain satisfied?). Reliability is supported by four profile-conditioned judges (Qwen, Claude, Gemini, GPT families, median-aggregated) and human verification over 50 sampled cases spanning 1,825 activities.

Reliability and verification of the four user-simulation judges
Figure 8. Reliability and verification of the four user-simulation judges used for median aggregation. (a) Pairwise Spearman rank alignment across the Qwen, Gemini, GPT, and Claude judges; (b) overall ensemble reliability — Cronbach's α = 0.833 ("good reliability") across 4 judges and 18 models; (c) human rationale verification, where judge rationales agree with human review on ≥89.8% of cases per dimension (96.5% for preference satisfaction), over 50 sampled cases covering 1,825 activity-level evaluations. Reproduced from the paper appendix.
∗   ∗   ∗

Main Results

We evaluate 18 agentic models under the same lightweight OpenAI-compatible function-calling scaffold. Plan Avg. averages the four valid-plan quality metrics; Win(%) is the share of non-aggregate metrics on which a model ranks first. Bold = best, underline = second-best in each column.

Table 2: main results on Trip+
Table 2. Main results on Trip+, with models grouped into frontier and lightweight families. Plan Avg. averages the four valid-plan quality metrics; Win(%) is the share of non-aggregate metrics on which a model ranks first. Bold = best, underline = second-best in each column. Reproduced from the paper.

Two headline findings: (1) current LLM agents remain unreliable in realistic multi-turn planning — they still err in deciding whether to plan, clarify, or report infeasibility, and often fail to preserve earlier user needs. (2) feasible itineraries are not necessarily user-aligned: even the strongest model scores only ~0.64 on soft preferences, and the best user-simulation score (GPT-5.4) is just 0.518. Satisfying explicit constraints is insufficient for matching implicit, evolving preferences.

∗   ∗   ∗

In-Depth Analysis

We focus on Gemini-3.1-Pro-Preview, the strongest overall model, to diagnose two remaining gaps: unreliable interaction and unsuitable plans.

Interaction reliability across scenarios
Figure 4. Interaction reliability of Gemini-3.1-Pro-Preview across the four scenario types. The model handles early updates but struggles as requirements accumulate; clarification remains a bottleneck, and long-horizon alignment is hardest — at the ambiguous Turn 3, request fulfillment plummets to 0.41.
Takeaway 1. Better interaction requires revising plans reliably as constraints grow — unreliability comes mainly from state-consistent revision, not response-mode selection.
Performance scores and error analysis
Figure 5. Component scores (left) and error rates (right) for valid plans. Hard constraints like dates, hotels, and party size score ≥0.95, but itinerary-level choices (transport, attractions, meals) are weaker. Pace is the most systematic personalization failure — comfort & pace scores only 0.39, and 99% of valid plans contain pacing burden. Traveler fatigue appears in 97% of plans, environmental exposure in 95%.
Takeaway 2. Better personalization depends on pacing and burden control — many feasible itineraries are still exhausting or environmentally unsuitable.
Inference cost vs plan quality
Figure 6. Turn-1 inference cost (avg. LLM / tool calls per first turn) vs. Plan Avg. for frontier models. There is no monotonic relationship: Gemini Pro achieves the highest Turn-1 Plan Avg. with moderate usage, while heavier-calling models (e.g. DS-V4, Kimi) do not pull ahead.
Takeaway 3. Better planning depends on effective evidence use, not more calls.
∗   ∗   ∗

How Trip+ Compares

Existing travel-planning benchmarks fall short of comprehensive evaluation. Profile-aware benchmarks evaluate static preference matching; interactive replanning benchmarks focus on isolated patterns. Trip+ is the only benchmark supporting grounding, profiles, interaction, fine-grained itineraries, feasibility, stateful evaluation, user simulation, and open sourcing together.

Table 1: comparison of travel-planning benchmarks
Table 1. Comparison of travel-planning benchmarks across task construction, evaluation, and resource dimensions (✓ supported · ~ partial / indirect · ✗ not supported). Trip+ is the only benchmark that jointly supports grounding, profiles, interaction, fine-grained itineraries, feasibility, stateful evaluation, user simulation, and open sourcing. Reproduced from the paper.
∗   ∗   ∗

Conclusion

Trip+ evaluates travel-planning agents on generating feasible, traveler-suitable, and intent-consistent itineraries under dynamic user needs and environments. Our evaluation reveals a clear gap: while current models satisfy basic feasibility and explicit constraints, they struggle with stateful revisions, user alignment, and effective evidence use. We hope Trip+ drives the development of agents that plan adaptive, profile-aligned experiences — rather than merely executable trips.

∗   ∗   ∗

BibTeX

@misc{chen2026tripbenchmarkingagentspersonalized,
      title={Trip+: Benchmarking Agents in Personalized Interactive Travel Planning},
      author={Junle Chen and Wei Chen and Yehong Xu and Zhengjun Huang and Yuqian Wu and Zhoujin Tian and Kai Wang and Lei Wang and Xiaofang Zhou},
      year={2026},
      eprint={2606.21169},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2606.21169},
}

arXiv preprint. Code and data are available at junle-chen/trip-plus.