podhmo/TanStack-Start-開発ライブラリとドキュメント.md

Last active December 2, 2025 11:23

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/podhmo/e9c8638c85db90d39baddf4290a0213d.js"></script>
Save podhmo/e9c8638c85db90d39baddf4290a0213d to your computer and use it in GitHub Desktop.

Download ZIP

Uploaded via Gist Uploader - 2025-12-02T11:11:03.268Z

Raw

article.md

TanStack Start開発に向けたドキュメント検索基盤の構築ログ

🤔 TanStack Startで開発を始めたいが、情報源はどこにある？

TanStack Startを使って開発しようと思い立った。Next.jsやRemixの競合となるこのフレームワークは、「Type-Safe（型安全）」を最優先にしているらしい。エコシステム全体を理解する必要がありそうだ。

まずは必要なライブラリと、見るべきドキュメントを整理しよう。Startは単体で動くというより、TanStackファミリーの集合体のようなものだ。

Core (必須)

TanStack Start: フレームワーク本体。SSRやServer Functionsを提供する。
TanStack Router: これが土台。Startは実質「サーバー機能付きのRouter」と言える。
Vite: ビルドツール。

Standard Stack (推奨)

TanStack Query: 非同期データ管理。RouterのLoaderと統合されている。
TanStack Form: フォーム管理。型安全性が高く、Server Functionsとの相性が良い。
Zod: スキーマバリデーション。RouterのSearch Paramsの型定義に必須。

Backend / UI (推奨)

Drizzle ORM: DB操作。Prismaより軽量でTypeScriptとの親和性が高い。
Tailwind CSS: スタイリング。

学習順序としては、Routerの理解が最優先（8割を占める）。次にStartのサーバー機能、そしてQueryやFormといった周辺ライブラリという流れになるだろう。GitHubのExamplesを見るのが一番の近道かもしれない。

💡 ドキュメント検索用のMCPサーバーを作ろう

開発を進める上で、これらのドキュメントを行ったり来たりするのは大変だ。そこで、ドキュメント検索用のMCPサーバー[^1]を作って、効率的に情報を引き出せるようにしたい。

そのためには、まずクロール対象となるドキュメントのURLリストが必要だ。ReactやTypeScriptの基礎情報も含めて、網羅的なリストを作っておく。

[
  "https://react.dev/",
  "https://www.typescriptlang.org/docs/",
  "https://developer.mozilla.org/ja/docs/Web/API",
  "https://tanstack.com/start/latest/docs",
  "https://tanstack.com/router/latest/docs",
  "https://tanstack.com/query/latest/docs",
  "https://tanstack.com/form/latest/docs",
  "https://zod.dev/",
  "https://vite.dev/guide/",
  "https://vite.dev/config/"
]

MDNを含めたのは、StartのServer Functionsが標準Web API（Request/Response）に準拠しているからだ。Node.js固有のAPIではない点に注意が必要。

⚙️ 集合知による宣言的なスクレイピング設定

ドキュメントサイトごとにHTML構造は異なる。これを解析してMarkdownに変換するロジックを、Pythonスクリプトで個別に書くのはメンテナンスコストが高すぎる。

そこで、「ドキュメントの構造定義をJSON設定ファイルに外出しする」アプローチを取ることにする。これならGitHub上で設定ファイルを管理でき、サイト構造が変わってもJSONを修正するだけで済む。いわば「集合知」でメンテナンスするシステムだ。

chrome-devtools-mcp[^2] を使って、実際のサイト構造を解析し、このJSON設定ファイルを生成させよう。

必要なツールのインターフェースイメージ：

tools = [
  "search_document",  # 検索して該当ドキュメントのリストを返す
  "read_document",    # 指定されたドキュメントの中身をMarkdownで返す
  "recommend_docs"    # コンテキストに合わせて読むべきドキュメントを提案する
]

この設定ファイル生成のために、Claude Codeに渡すプロンプトを考える必要がある。

🧐 設定生成プロンプトの設計：ReadとSearchの分離

単純にURLを渡して「解析して」と言うだけでは不十分だ。read_document（抽出）と search_document（探索）では、必要な情報が異なる。

1. Read用（記事単体の抽出） 記事本文を綺麗に抜き出すための設定。

{
  "extraction": {
    "title": "h1.text-3xl",
    "content": "main article",
    "exclude": [".carbon-ads", "button[aria-label='Edit page']"]
  }
}

2. Search/Recommend用（サイト全体のナビゲーション） サイドバーや次のページへのリンクを解析する設定。

{
  "navigation": {
    "sidebar_items": "nav#sidebar ul li a",
    "pagination_next": "a.pagination-next"
  }
}

3. Pagination用（順次読み込み） ドキュメントを順番に学習させるために、「次の記事」への遷移ルールも必要だ。 Zodのような「1ページ完結型（Single Page）」と、TanStackのような「複数ページ型（Multi Page）」を識別する必要がある。

{
  "pagination": {
    "type": "linear_next_button",
    "next_link": "a.nav-next"
  }
}

これらを別々のプロンプトで解析させ、最終的に一つのJSON設定ファイルにマージするのが良さそうだ。

🚧 予測される困難とその対策：セレクタの脆弱性と「取りこぼし」

このアプローチにはいくつかの落とし穴がある。

1. セレクタの脆弱性 (Selector Fragility) Tailwind CSSなどのハッシュ付きクラス名は頻繁に変わる。 -> 対策: セレクタを配列で持ち、上から順に試す「フォールバック戦略」を採用する。

"title": [
  "h1.text-3xl",       // Specific (高精度だが壊れやすい)
  "article h1",        // Semantic (構造的)
  "h1"                 // Generic (最終手段)
]

2. SPA/CSR対応 単純なHTTP GETではコンテンツが空の場合がある。 -> 対策: Playwrightなどでブラウザレンダリング後のDOMを取得する。

3. 全文取得できないリスク フォールバック戦略の弱点として、誤って「記事の一部だけの要素」にマッチして終了してしまう可能性がある。例えば、本当は main 全体が欲しいのに、冒頭の .intro だけ取得してしまうケース。

-> 対策A: 「引き算」のアプローチ content セレクタは、極力「大きな枠（main, article）」を狙う。その上で、exclude で広告やナビゲーションを削る。

-> 対策B: コンテンツ量によるバリデーション ヒットした要素の文字数が少なすぎる場合（例: 200文字以下）、それは「全文」ではないと判断し、次のセレクタを試行するロジックをMCPサーバー側に実装する。

最終的なJSON設定のイメージはこうなる。

{
  "domain": "tanstack.com",
  "extraction": {
    // 広い範囲（親）から順にではなく、信頼できる順に並べる
    // ただし、誤って部分要素を拾わないよう、実装側で文字数チェックを行う
    "content": [
      "div[class^='DocSearch-content']",
      "article",
      "main"
    ],
    "exclude": [
      ".carbon-ads",
      "button[aria-label='Edit this page']",
      "nav",
      "footer"
    ]
  },
  // ...navigation settings
}

よし、これで「集合知設定ファイル」をベースにした堅牢なドキュメント検索システムの設計ができた。まずは手始めにTanStack Routerのドキュメントから設定ファイルを生成してみよう。


[^1]: Model Context Protocol。AIモデルと外部ツール（データベース、API、ブラウザなど）を接続するための標準規格。
[^2]: ブラウザのDevToolsをMCP経由で操作できるようにするツール。これを使うことでAIが実際にウェブページにアクセスし、DOM構造を解析できる。

Raw

TanStack-Start-開発ライブラリとドキュメント.md

対話履歴

ユーザー: tanstack startを使って開発したいと思います。どのようなライブラリを使う必要があるかなどを検討しその上でどのようなドキュメントをみる必要があるかを整理してください

AIの思考プロセス

Refining Initial Approach

I'm now zeroing in on the best way to introduce TanStack Start development. Initially, I needed to pin down the core goal - building an application with this framework. I'm focusing on defining the key areas to address and how to best explain them to a user who is new to the development environment.

Mapping Ecosystem Dependencies

I've been mapping the critical dependencies within the TanStack Start ecosystem. Now, I'm focusing on clarifying which libraries are essential and which are merely recommended, particularly for different parts of the application. Highlighting the type-safe nature of libraries such as TanStack Query and Form will be vital. The choice between ORMs is another point of consideration, as well as outlining the role of Tailwind CSS and popular UI libraries.

Prioritizing Library Grouping

I'm now zeroing in on a structured library breakdown, emphasizing the "must-have" core libraries for TanStack Start. The aim is to showcase how TanStack Router forms the framework's foundation, and to highlight essential components such as TanStack Query for handling asynchronous state. I'm prioritizing the presentation of these core libraries before diving into recommended "T3 Stack" equivalent libraries like TanStack Form or Tailwind.

Organizing Core Documentation

I'm presently structuring the documentation guide, prioritizing TanStack Router's docs as the critical starting point. Following that, I plan to cover TanStack Start's basics and server functions. I'm also considering highlighting crucial concepts, such as file-based routing and type-safe server functions. My target is to explain the essential concepts in a structured and professional tone, while also mentioning the still-emerging nature of TanStack Start.

Concluding Project Structure

I'm now putting the finishing touches on the project structure. This entails producing a concise summary, categorizing the libraries by core functionality, UI, and backend. I'll include links to the official documentation and develop a step-by-step learning path. I'm also ensuring that Vite's foundational role is clearly emphasized throughout the guide. Lastly, I'm highlighting the essential aspects of the framework: the router as the central component, server functions as the backend glue, and making sure to cover the underlying use of Vite. I will then translate all of this into Japanese.

AI: TanStack Start (以下 Start) を使った開発ですね。素晴らしい選択です！ Start は **「Type-Safe（型安全）」** を最優先にしたフルスタック React フレームワークです。Next.js や Remix の競合にあたりますが、TanStack エコシステム（Router, Query など）を基盤にしているため、それらのライブラリとの親和性が非常に高いのが特徴です。

開発を始めるにあたり、「必須・推奨ライブラリ」 と 「見るべきドキュメント」 を整理しました。

1. 技術スタック・ライブラリの選定

TanStack Start は単体で動くというより、TanStack ファミリーを組み合わせて使うことを前提としています。

A. 必須ライブラリ (Core)

これらは TanStack Start の動作において中核となるため、ほぼ必ず使用します。

TanStack Start: フレームワーク本体。サーバーサイドレンダリング (SSR) やサーバー関数 (Server Functions) を提供します。
TanStack Router: 【最重要】 Start の土台です。ルーティング、データローディング（Loaders）、検索パラメータの管理を行います。Start は実質的に「サーバー機能付きの TanStack Router」と言えます。
Vite: ビルドツールです。Start は Vite プラグインとして動作します。

B. 強く推奨されるライブラリ (Standard Stack)

TanStack の哲学（型安全）に合わせるなら、以下の組み合わせがベストプラクティスです。

TanStack Query (React Query):
- 役割: 非同期データの取得、キャッシング、サーバー状態の同期。
- 理由: Router の Loader と統合されており、Start で最もスムーズにデータを扱えます。
TanStack Form:
- 役割: フォーム管理とバリデーション。
- 理由: 型安全性が高く、Server Functions と組み合わせた時のデータ送信（Action）が非常に楽です。（React Hook Form でも可ですが、Start との相性は TanStack Form が良いです）
Zod (または Valibot):
- 役割: スキーマバリデーション。
- 理由: API の入出力やフォームのバリデーション、Router の Search Params の型定義に必須です。

C. バックエンド・UI (推奨)

Drizzle ORM:
- 役割: データベース操作。
- 理由: Prisma よりも軽量で、TypeScript との親和性が極めて高く、TanStack ユーザーに好まれています。
Tailwind CSS:
- 役割: スタイリング。
- 理由: 現代の React 開発のデファクトスタンダードであり、設定が容易です。

2. 見るべきドキュメントと学習順序

TanStack Start はまだ新しい（Beta段階を含む）ため、ドキュメントを行き来する必要があります。以下の順序で見ていくと理解しやすいです。

Step 1: TanStack Router の理解 (最優先)

Start を使いこなすには、Router の理解が 8割です。

ドキュメント: TanStack Router Documentation
見るべきポイント:
- File-Based Routing: ファイル名でルートを定義する方法。
- Loaders: ページが表示される前にデータを取得する仕組み。
- Search Params: URL クエリパラメータを Zod で型安全に管理する方法。

Step 2: TanStack Start のセットアップとサーバー機能

ドキュメント: TanStack Start Documentation
見るべきポイント:
- Server Functions (createServerFn): API ルートを作らずに、関数としてバックエンドロジックを書く方法（Remix の Action/Loader に近い概念）。
- Deployment: デプロイ先（Vercel, Netlify, Node.js, Cloudflare 等）の設定。

Step 3: データフェッチとフォーム処理

TanStack Query: useSuspenseQuery などの基本的な使い方。
TanStack Form: バリデーションと送信（Submission）の流れ。

3. 開発の流れイメージ（まとめ）

実際に開発を始める際は、以下の構成をイメージしてください。

プロジェクト作成: npm create @tanstack/start@latest （または公式のスターターテンプレートを使用）
ルーティング作成: app/routes/ ディレクトリにファイルを作成（TanStack Router のルール）。
バックエンド処理: createServerFn を使って、DB (Drizzle) からデータを取得したり保存したりする関数を定義。
- ポイント: API エンドポイント (/api/usersなど) を手動で作る必要はありません。
フロントエンド結合: 作成した Server Function を Router の loader や Form の onSubmit から直接呼び出す。

アドバイス

TanStack Start は現在急速に開発が進んでいるため、ドキュメントだけでなく 公式の GitHub Examples を見るのが一番の近道になることが多いです。

GitHub Examples: TanStack Start Examples
- 特に basic や hackernews の例を見ると、ディレクトリ構造や Server Function の書き方がよくわかります。

まずは、TanStack Router の概念（特に Loader と型安全なルーティング）をしっかり押さえることから始めてみてください。

(返答終了理由: STOP)

ユーザー: いいえ知りたいのは各種ライブラリのドキュメントのドメインです。

ドキュメント検索用のMCPサーバーを作りたいのです。そのために各種ドキュメントにアクセスしテキストを抽出する機能を作りたいのです。そのためのドキュメントの一覧が欲しいのです。

AIの思考プロセス

Considering Data Domains

I've been focusing on the domain aspect of the MCP server, specifically gathering URLs for document searching. Currently, my thought process revolves around identifying and categorizing suitable data domains. I'm exploring different methods to source these URLs, focusing on relevance and coverage.

Clarifying Library Documentation

I'm now zeroing in on the specific documentation URLs required for the core TanStack libraries. I've re-evaluated my previous direction. My immediate task is gathering the accurate URLs for Start, Router, Query, and Form from the official tanstack.com domain. These URLs are crucial for text extraction, the next step, I will determine the documentation URLs.

Refining Documentation URLs

I'm making progress identifying documentation URLs for essential libraries. I've gathered base URLs, focusing on the /docs paths for optimal content extraction. I've added Vite, Zod, Drizzle ORM, and Tailwind CSS to the list. For TanStack, I'm noting the shared documentation structure, starting with Start.

Prioritizing URL Acquisition

I've just solidified my documentation URL list, focusing on key TanStack libraries. I've specifically targeted the /latest/docs structure to ensure the most current content. I am now adding TanStack Query. I'm also planning to incorporate a note regarding TanStack's use of versioning, which will be essential for the scraper to ensure it gets the latest documentation.

Gathering Final Documentation

I'm wrapping up the final documentation URL acquisition. I've added TanStack Form to the list, completing the TanStack core libraries. Vite, Zod, and Drizzle ORM are also included. I'm now focusing on a Markdown list format to present the documentation URLs. Next, I'll translate the list into Japanese.

AI: 承知いたしました。MCPサーバーによるテキスト抽出・検索用としての、**クロール/アクセス対象となるドキュメントのURLリスト**ですね。

TanStack Startでの開発において、参照頻度が高く、かつテキスト抽出しておくべきドキュメントの「ベースURL」を整理しました。スクレイピングのスコープを指定する際にご利用ください。

1. TanStack Core (必須)

これらは開発の根幹となるため、優先度「最高」でインデックスする必要があります。 TanStackのドキュメントはパスにバージョン（latest）が含まれるため、これを固定してクロールすることを推奨します。

ライブラリ	ドキュメントベースURL (Docs Root)	備考
TanStack Start	`https://tanstack.com/start/latest/docs`	フレームワーク本体・サーバー機能
TanStack Router	`https://tanstack.com/router/latest/docs`	ルーティング・Loader・型安全

2. TanStack Ecosystem (標準スタック)

Startを使う場合、ほぼセットで導入される公式ライブラリです。

ライブラリ	ドキュメントベースURL	備考
TanStack Query	`https://tanstack.com/query/latest/docs`	非同期データ管理 (React Query)
TanStack Form	`https://tanstack.com/form/latest/docs`	フォーム管理・バリデーション

3. 周辺ツール (推奨スタック)

TanStack Start の推奨構成やドキュメント内で頻繁に登場するライブラリです。

ライブラリ	ドキュメントベースURL	備考
Vite	`https://vite.dev/guide/`	ビルド設定・プラグイン API
Zod	`https://zod.dev/`	スキーマバリデーション (URLパラメータ定義に必須)

4. その他 (必要に応じて追加)

プロジェクト構成によりますが、以下のバックエンド・UI系も検索対象にあるとコード生成精度が上がります。

ライブラリ	ドキュメントベースURL	備考
Drizzle ORM	`https://orm.drizzle.team/docs`	DB操作 (Startユーザーに人気)
Tailwind CSS	`https://tailwindcss.com/docs`	スタイリング

スクレイピング/抽出時のヒント

MCPサーバーを作る際の技術的な補足です。

TanStackドキュメントの構造:
- TanStack系のドキュメントはすべて同じ構造をしています。
- https://tanstack.com/[LIBRARY]/latest/docs/framework/react/... のように framework/react 配下にReact固有の情報が集まっています。ここを重点的に取得するとノイズが減ります。
Zodの構造:
- zod.dev は1ページに情報が凝縮されている（またはページ数が非常に少ない）構成です。README.md をそのままHTML化したような構造なので、単一ページとしての取得が必要になる場合があります。
Viteの構造:
- vite.dev は /guide/ (使い方) と /config/ (設定リファレンス) が分かれています。設定周りのエラー解決も考慮するなら /config/ も対象に含める必要があります。

このリストをMCPツールの allowlist や crawl_targets に設定してください。

(返答終了理由: STOP)

ユーザー: うーん。reactやtypescriptも含めてください。

AIの思考プロセス

Identifying Key Domains

I'm currently focused on pinpointing the essential documentation domains needed for the initial document search and text extraction functionality. I've compiled a preliminary list and am now cross-referencing to ensure completeness and relevance to the TanStack Start framework. My aim is to be thorough in identifying all pertinent resources.

Refining Domain Scope

I've incorporated the crucial feedback on missing elements, adding React and TypeScript to my indexing focus. My attention is now on the optimal documentation URLs for React and TypeScript, selecting key paths within their official docs to maximize efficiency for the Minimum Credible Product. I'm prioritizing direct relevance to TanStack Start's architecture and the project's foundational technologies. My goal is to streamline the document search.

Incorporating TypeScript Details

I realized my initial focus on React and TypeScript documentation overlooked crucial details about the importance of TypeScript's type system to TanStack Start. I've added specific paths like the TypeScript handbook, focusing on Concepts and Configuration. I also thought about including MDN, but given that generic JS/web API knowledge is usually presumed, I've kept it as optional.

Generating Focused URLs

I've finalized the core URLs for React and TypeScript, prioritizing /reference for React API details and the root TypeScript /docs for comprehensiveness. MDN's Web API docs are included as an optional resource. I've also established the best URLs for initial crawling. My current work now centers on organizing these resources into distinct groups based on relevance, crafting a clear format for presenting the final output to effectively serve the target user persona building an MCP server.

Clarifying Domain Requirements

I'm now revising my initial domain list, ensuring it directly addresses the request's core. Given the existing structure, I've simplified the URL set to focus exclusively on React and TypeScript documentation, streamlining the content for clarity and direct usability. For React, I'm using https://react.dev/ as the primary domain, and for TS, I've selected https://www.typescriptlang.org/docs/. This minimizes extraneous data while providing a straightforward, focused resource set. I recognize the value of MDN, especially for Web Standards, but will mention it as an extra.

AI: 承知いたしました。基礎となる **React** と **TypeScript**、および TanStack Start が標準 Web API (Request/Response) に準拠している点を考慮し、MDN も含めた**「TanStack Start 開発用 MCP サーバーのための完全なドキュメントリスト」**を提示します。

これらをスクレイピング対象として設定してください。

1. 基礎技術 (Foundation)

フレームワークの基盤となる言語とライブラリです。特に React は Server Components や Suspense の理解に、TypeScript は高度な型定義の理解に必須です。

ライブラリ/言語	ドキュメントベースURL	抽出のポイント
React	`https://react.dev/`	特に `/reference/` (Hooks, Components API) が重要。TanStack Start は `use` フックや Suspense を多用します。
TypeScript	`https://www.typescriptlang.org/docs/`	ハンドブックとリファレンス。`tsconfig` の設定オプションなども検索対象になります。
MDN Web Docs	`https://developer.mozilla.org/ja/docs/Web/API`	重要: Start の Server Functions は標準の `Request` / `Response` オブジェクトを使用します（Node.js 固有 API ではない）。Fetch API 周りの仕様確認に必要です。

2. TanStack Core (必須フレームワーク)

開発の主軸となるドキュメントです。

ライブラリ	ドキュメントベースURL
TanStack Start	`https://tanstack.com/start/latest/docs`
TanStack Router	`https://tanstack.com/router/latest/docs`

3. TanStack Ecosystem (推奨ライブラリ)

Start と統合して使う標準的なライブラリです。

ライブラリ	ドキュメントベースURL
TanStack Query	`https://tanstack.com/query/latest/docs`
TanStack Form	`https://tanstack.com/form/latest/docs`

4. ユーティリティ・ビルド (必須ツール)

バリデーションとビルド設定です。

ライブラリ	ドキュメントベースURL
Zod	`https://zod.dev/`
Vite	`https://vite.dev/guide/` `https://vite.dev/config/` (設定用)

[便利機能] JSON 設定用リスト

MCPサーバーの設定ファイル（config.json等）にそのまま貼り付けて使えるような、URLリストの配列形式です。

[
  "https://react.dev/",
  "https://www.typescriptlang.org/docs/",
  "https://developer.mozilla.org/ja/docs/Web/API",
  "https://tanstack.com/start/latest/docs",
  "https://tanstack.com/router/latest/docs",
  "https://tanstack.com/query/latest/docs",
  "https://tanstack.com/form/latest/docs",
  "https://zod.dev/",
  "https://vite.dev/guide/",
  "https://vite.dev/config/"
]

このリストを対象にクロール/インデックスを作成すれば、TanStack Start 開発における「型定義」「コンポーネント実装」「サーバー通信」「ビルドエラー」のほぼ全ての質問に対応できる知識ベースになります。

(返答終了理由: STOP)

ユーザー: chrome-devtools-mcp を使ってmarkdownドキュメント用の情報が欲しいです。claude codeなどに渡すプロンプトを考えてください。

必要なツールは以下の3つです

# 理想的なインターフェースのイメージ
tools = [
  "search_document",  # 検索して該当ドキュメントのリストを返す
  "read_document",    # 指定されたドキュメントの中身をMarkdownで返す
  "recommend_docs"    # コンテキストに合わせて読むべきドキュメントを提案する
]

具体的なコンテキストは以下のような感じです

コードから設定へ：集合知を記述するJSON
透明性を確保しつつ、集合知によってメンテナンスされるシステムをどう実現するか。思考は、命令的なコードによる実装から、宣言的な設定による定義へとシフトしました。

ドキュメントサイトごとに異なるHTML構造を解析し、Markdownへ変換するロジックをPythonスクリプトなどで個別に記述していては、メンテナンスコストが肥大化し、セキュリティ上の懸念も拭えません。そこで導き出されたのが、「ドキュメントの構造定義をJSON設定ファイルに外出しする」というアイデアです。

具体的には、各ライブラリのドキュメントについて、以下の情報をJSONで定義します。

対象のURL
タイトルや本文を抽出するためのXPath4やCSSセレクタ
不要な要素（ナビゲーションや広告など）の除去ルール
ページネーションの辿り方
このアプローチの革新性は、プログラミングスキルを持たないユーザーでも、CSSセレクタさえ分かればドキュメントソースを追加・修正できる点にあります。GitHub上で設定ファイルに対するPull Requestを受け付ける体制を整えれば、特定のドキュメントが古くなったり構造が変わったりしても、コミュニティが即座に修正可能です。これにより、システムは「不透明なサービス」から「透明なルールの集合体」へと進化します

AIの思考プロセス

Formulating the Core Prompt

I've been drilling down on the fundamental goal – crafting a prompt that will yield a valid JSON configuration. The primary focus is on how to communicate with Claude Code so it understands the request to construct a JSON file. Right now, I'm thinking about the most effective language to convey the specific structure and data required for scraping documentation, keeping in mind the MCP server context. I'm focusing on simplicity and clarity so that Claude doesn't get confused!

Refining the JSON Schema

I'm now focusing on a declarative JSON schema to drive content extraction. The core shift is from imperative code to JSON config, which outlines URLs, selectors, and cleanup steps for document scraping. My focus is on supporting React, TypeScript, MDN, TanStack (various parts), Zod, and Vite documentation sites. I'm also considering chrome-devtools-mcp as a potential tool to identify the correct CSS/XPath selectors.

Evolving the Prompt Structure

I'm now focusing on a more sophisticated prompt strategy. The aim is to make Claude Code the "Selector Engineer." It needs to analyze HTML, build a structured JSON config, and understand its purpose. I've drafted a JSON schema with "domain," "entry_point," "selectors," and "exclude" fields, and I'm refining the instructions for its role. I'm striving for a modular design, so I can adapt easily to other document scraping scenarios.

Optimizing the Process

My focus is now on streamlining the entire process, specifically the prompt design. I'm breaking down the task into smaller, manageable phases: defining the task, specifying target websites from the previous list, ensuring the use of chrome-devtools-mcp for DOM inspection, and clearly defining the desired JSON output. I'm also considering an iterative approach, either site-by-site or in manageable groups, to optimize efficiency. Furthermore, I'm addressing the nuances of the JSON format, ensuring it supports markdown conversion and handles element removal as specified.

Fine-tuning JSON Structure

I'm now zeroing in on the finer points of the JSON schema, particularly the selectors. I'm making sure it includes content_selector, which explicitly defines what to keep in the extracted content, and ignore_selectors, which specifies what to remove from within that content to ensure clean Markdown output. I'm focusing on providing very clear instructions so Claude understands how to use the browser tools to verify and refine these selectors.

Defining Output Format

The final piece of the prompt is now ready. I've polished the tone and integrated the "Collective Intelligence" context. Also, I've specifically instructed Claude to note the structural similarities of the TanStack sites to save time. I've now clarified the final output structure, which will include the prompt itself, followed by an explanation of its functionality. I've also added a JSON structure that makes the root_url and other associated elements clear.

Clarifying Selector Refinement

Now, I'm focusing on the "rules" field within the JSON output. I've realized I need to be more explicit with the selectors, ensuring main_content, title, and exclude are very precise. The example given provides a clear illustration, but I need to make sure the instructions are comprehensive enough so Claude understands what kind of values those fields should hold.

AI: その「集合知設定ファイル（JSON）による宣言的なスクレイピング」というアーキテクチャは非常に堅牢で、メンテナンス性が高い素晴らしい設計です。

Claude Code（および chrome-devtools-mcp）に対して、**「実際のサイト構造を解析し、そのJSON設定ファイルを生成させる」**ためのプロンプトを作成しました。

このプロンプトは、単にURLを渡すだけでなく、**「実際にブラウザでアクセスしてDOMを検証し、最適なCSSセレクタを特定すること」**を指示内容に含めています。

Claude Code へのプロンプト

以下のテキストをコピーして Claude Code に渡してください。

# 目的
Webドキュメントをクローリングし、Markdown化して保存するための「構成設定ファイル (JSON)」を作成したいです。
あなたはDOM解析のスペシャリストとして、`chrome-devtools-mcp` (または利用可能なブラウザツール) を使用して指定されたドキュメントサイトに実際にアクセスし、コンテンツ抽出に最適なCSSセレクタを特定してください。

# 背景: 集合知による宣言的設定
我々は、命令的なスクレイピングコードではなく、JSON設定ファイルによる宣言的な定義を目指しています。
ドキュメントの構造（タイトル、本文、削除すべきノイズ）をJSONで定義することで、サイト構造が変化した際にもJSONの更新のみで対応できるシステムを構築します。

# ターゲットとなるドキュメント
以下の技術スタックにおけるドキュメントを解析対象とします。

1. **React**: https://react.dev/reference/react
2. **TypeScript**: https://www.typescriptlang.org/docs/handbook/intro.html
3. **MDN Web Docs (Web API)**: https://developer.mozilla.org/ja/docs/Web/API
4. **TanStack Start**: https://tanstack.com/start/latest/docs/framework/react/overview
5. **TanStack Router**: https://tanstack.com/router/latest/docs/framework/react/overview
6. **TanStack Query**: https://tanstack.com/query/latest/docs/framework/react/overview
7. **TanStack Form**: https://tanstack.com/form/latest/docs/framework/react/overview
8. **Zod**: https://zod.dev/
9. **Vite**: https://vite.dev/guide/

# 成果物 (JSONフォーマット要件)
以下のJSONスキーマに従って、全サイト分の設定を含む1つのJSONを出力してください。

```json
[
  {
    "name": "ライブラリ名 (例: TanStack Start)",
    "base_url": "ドキュメントのルートURL",
    "match_pattern": "この設定を適用するURLパターン (例: https://tanstack.com/start/**)",
    "selectors": {
      "title": "ページタイトルを取得するユニークなセレクタ (例: h1)",
      "content": "本文全体を囲むラッパー要素のセレクタ (例: article, main .content)",
      "exclude": [
        "本文内に含まれるが除去すべき要素のセレクタ配列",
        "サイドバー",
        "広告",
        "パンくずリスト",
        "ページ内ナビゲーション(On this pageなど)"
      ]
    },
    "pagination": {
      "next_link": "「次のページ」へのリンク要素のセレクタ (存在する場合)"
    }
  }
]

作業手順

指定された各URLに実際にアクセスしてください。
DevTools等を用いて、以下の要素を特定するための 「最も堅牢で汎用的なCSSセレクタ」 を調査してください。
- content: <main> や <article> など、ヘッダー/フッターを含まない本文領域。
- exclude: 本文領域の中に混ざっている広告 (.carbon-ads等) や、モバイル用メニュー、APIの型定義における不要な装飾など。
特にTanStack系は構造が似ている可能性がありますが、必ず個別に確認し、差異があれば反映してください。
Zodなど、1ページに全てがまとまっているタイプか、複数ページかを確認し pagination 設定の要否を判断してください。
最終的なJSONブロックを生成してください。


---

### このプロンプトの意図とポイント

1.  **ツールの強制使用 (`chrome-devtools-mcp`)**:
    *   単に知識ベースから回答するのではなく、「実際にアクセスしてDOMを見る」ことを強制しています。これにより、クラス名が変わっていたり、Tailwindのようなユーティリティクラス（`div.flex.p-4`など）ばかりで指定しにくい場合でも、`article` や `main[role="main"]` といった意味論的なセレクタを探させることができます。

2.  **`exclude` (除外ルール) の重視**:
    *   LLMにドキュメントを読ませる際、最大の敵は「ナビゲーションメニュー」「広告」「"この記事は役に立ちましたか？"ボタン」などのノイズです。これらをCSSセレクタレベルで除去する設定をJSONに含めることで、後段のMarkdown変換の品質を劇的に向上させます。

3.  **TanStackファミリーの類似性**:
    *   TanStack系は同じドキュメント生成ツールを使っている可能性が高いですが、微妙にクラス名が違う場合があります。あえて「個別に確認せよ」と指示することで、共通化できる部分とできない部分をAIに判断させます。

4.  **JSON構造の指定**:
    *   あなたが作りたい「MCPサーバー」が読み取りやすいスキーマを予め定義しました。これにより、出力されたJSONをそのまま設定ファイルとして保存できます。

(返答終了理由: STOP)

---

ユーザー:
あ、URLを指定したら必要なツールの情報取得方法を取得するプロンプトにしたいです。一気に取得するのではなく任意のドキュメントのドメインを渡したらそれ用の情報が取れるイメージです。

おそらく検索用のプロンプトと特定のドキュメントのmarkdown化のプロンプトは別かもしれません

---

<details>
<summary>AIの思考プロセス</summary>

**Examining the Core Task**

I've been drilling down on the fundamental objective: crafting precise prompts for Claude Code. My focus is now on how best to utilize `chrome-devtools-mcp` to retrieve config data from document processing systems given a URL. I'm exploring different prompt strategies to ensure comprehensive data extraction.


**Developing Config Extraction Strategy**

\n\nI'm now focusing on generating the "Collective Intelligence JSON" config. The challenge is clear: turning a URL into structured data. My current thinking revolves around two distinct prompt types. Type A focuses on site navigation, identifying pages. Type B is all about extracting title and body, filtering out irrelevant content. The user's input highlights two distinct phases, Search/Discovery and Single Document config, that inform my prompt strategy.


**Formulating a Master Prompt**

\n\nI'm now zeroing in on a "Master Prompt" approach. Instead of separate modules, the idea is to handle a specific domain/URL with a single prompt. My focus is on delivering a streamlined process for the user. I've been considering this new structure, and how best to represent it to the user. This master prompt will need to cover title selection, main content identification, exclusion of irrelevant elements (ads, navigation), and potential navigation rules, if needed. I'm focusing on simplicity for the user.


**Designing Unified Prompts**

\n\nI'm now integrating the "Analyzer" prompt (Scenario 1) with the "Search/Navigation" aspect (Scenario 2). My focus is to create a single, versatile prompt that handles both data extraction and site traversal. I'm exploring how to embed logic for identifying search features (e.g., Algolia boxes) and crawling mechanisms (sitemaps, sidebars) within the same prompt. The goal is to provide a seamless, unified solution. The user will get a single prompt capable of addressing all needs.


**Synthesizing Prompt Architecture**

I'm now integrating the "Config Generator" and "Search Logic Analyzer" prompts into a cohesive system.  My focus is on creating a two-prompt workflow. The first handles immediate content extraction, while the second dynamically analyzes the site for search features or alternative traversal methods.  I'm exploring how best to leverage `search_document` and `read_document` functionalities for a smooth user experience.


**Defining Config Output Strategy**

\n\nI'm now focusing on the final output format. I'm prioritizing clarity and completeness, and moving towards defining the JSON schema that will act as the contract for the output. The current thought is to generate a comprehensive prompt, specifically tailored for scenarios requiring site navigation like this case, ensuring the prompt will handle potential fallback situations. The output must reliably get the sidebar link selector, along with code samples and the why the design pattern is preferred.


**Refining JSON Output**

\n\nI'm now zeroing in on the JSON structure for the final output. The goal is to accommodate both scenarios: direct content extraction and site navigation. My plan is to expand the JSON to include separate sections for "selectors" (for direct content) and "navigation" (for finding related documents). This will ensure a robust and flexible output, catering to diverse use cases. I'm focusing on simplicity and clarity within this JSON schema.
</details>
AI:
なるほど、**「任意のURLを渡したら、そのサイトをスクレイピングするための設定（JSON）を生成してくれる」**という、いわば「設定生成器（Config Generator）」としてのプロンプトをご希望ですね。

ご推察の通り、MCPツールの役割（`search_document` と `read_document`）に合わせて、解析すべき観点は異なります。

1.  **Read用（抽出）**: 記事単体の「タイトル」「本文」「ノイズ除去」の構造解析
2.  **Search用（探索）**: サイト全体の「サイドバー構造」「次のページ」「サイトマップ」の構造解析

これらをClaude Codeに実行させるための、2つの特化型プロンプトを作成しました。

---

### 前提：出力するJSONのスキーマ（共通言語）

Claude Codeにこのフォーマットを目指させます。

```json
{
  "domain": "対象ドメイン",
  "extraction": {
    "title": "タイトル要素のCSSセレクタ",
    "content": "本文ラッパーのCSSセレクタ",
    "exclude": ["除去する要素のセレクタ配列"]
  },
  "navigation": {
    "sidebar_items": "サイドバーのリンク要素のセレクタ",
    "pagination_next": "「次へ」ボタンのセレクタ"
  }
}

プロンプト 1: `read_document` 用設定生成

（単一記事の構造解析）

記事ページ（例: https://tanstack.com/start/.../overview）を1つ渡し、そこから本文を綺麗に抜き出すための設定を作らせます。

Claude Codeへの入力:

# 依頼: ドキュメント構造解析と抽出ルールの生成

ターゲットURL: [ここに解析したいURLを入れる]

あなたはDOM解析のエキスパートです。上記のURLに `chrome-devtools-mcp` を使ってアクセスし、コンテンツをMarkdownへ変換するための最適な「抽出ルール」を定義してください。

## 解析手順
1. 指定URLを開き、ページのDOM構造を検証してください。
2. 以下の要素を特定するための **最も堅牢なCSSセレクタ** を見つけてください。
   - Tailwindなどのランダムなクラス名ではなく、`article`, `main`, `h1`, `id` 属性、`data-testid` など、構造が変わっても壊れにくい属性を優先してください。

## 調査項目
1. **Title**: 記事のタイトル（h1）。
2. **Content**: ヘッダー、フッター、サイドバーを含まない「記事本文」のみを囲むラッパー要素。
3. **Exclude**: `Content` の中にあるが、Markdownには不要な要素（広告、"Edit this page"ボタン、目次(ToC)など）のセレクタ。

## 出力フォーマット
以下のJSONのみを出力してください。解説は不要です。

```json
{
  "extraction": {
    "title": "string (例: h1.text-3xl)",
    "content": "string (例: main article)",
    "exclude": [
      "string (例: .carbon-ads)",
      "string (例: button[aria-label='Edit page'])"
    ]
  }
}


---

### プロンプト 2: `search_document` / `recommend_docs` 用設定生成
**（サイト全体のナビゲーション解析）**

ドキュメントサイトには必ず「サイドバー」や「パンくずリスト」があり、そこに全ドキュメントへのリンクがあります。これを解析することで、検索やクローリングが可能になります。

**Claude Codeへの入力:**
```markdown
# 依頼: ドキュメント・ナビゲーション構造の解析

ターゲットURL: [ここに解析したいURLを入れる]

あなたはWebクローラーの設計者です。上記のドキュメントサイトにおいて、ドキュメント一覧を取得したり、順序通りに読み進めるための「ナビゲーション構造」を解析してください。

## 解析手順
1. 指定URLにアクセスしてください。
2. サイト内の他のドキュメントへのリンク構造を特定してください。

## 調査項目
1. **Sidebar Items**: 左側（または上部）にあるドキュメントメニュー内の、個別のリンクタグ（`a`タグ）を特定するセレクタ。
   - カテゴリ見出しではなく、クリック可能な記事リンクを対象としてください。
2. **Next Page**: 記事下部にある「次のページへ」進むリンクのセレクタ。

## 出力フォーマット
以下のJSONのみを出力してください。解説は不要です。

```json
{
  "navigation": {
    "sidebar_items": "string (例: nav#sidebar ul li a)",
    "pagination_next": "string (例: a.pagination-next)"
  }
}


---

### 使い方（ワークフローのイメージ）

開発フローとしては以下のようになります。

1.  あなたが「TanStack Startのドキュメントを追加したい」と思う。
2.  **Claude Code** に `プロンプト1` + `https://tanstack.com/start/...` を投げる。
    *   → `extraction` 設定（JSON）が返ってくる。
3.  **Claude Code** に `プロンプト2` + `同URL` を投げる。
    *   → `navigation` 設定（JSON）が返ってくる。
4.  返ってきたJSONをマージして、あなたの「集合知設定ファイル（config.json）」に追記する。

#### なぜ分けるのか？
*   **精度向上**: 「本文の抽出」と「リンクの探索」を同時に考えさせると、コンテキストウィンドウが溢れたり、DOM解析が雑になったりするため。
*   **用途の違い**: `search_document` ツールを実装するときは `navigation` の情報だけあればよく、実際にユーザーが「これを読む」と言った段階で初めて `extraction` の情報が必要になるため。

このアプローチなら、将来的に「検索はGoogle Custom Search APIに任せるから `navigation` 設定は不要」といった変更にも柔軟に対応できます。

(返答終了理由: STOP)

---

ユーザー:
あとページネーションの対応の情報が必要かもです

---

<details>
<summary>AIの思考プロセス</summary>

**Refining the Approach**

I'm zeroing in on the best prompt strategy. Analyzing the request, I see the core need is extracting *pagination* details from docs, for Collective Intelligence setup. Now, I'm considering how to phrase the prompt so it's clear and effective, without being too verbose. I'm focusing on the user's specific goal and making sure the prompt reflects that concisely.


**Mapping Pagination Strategies**

I've been examining different pagination types prevalent in documentation, focusing on how a Chrome DevTools MCP server and LLM can crawl the content. I've narrowed it down to four main approaches: linear "next/previous" flow, list-based numbered pagination, infinite scrolling (unlikely), and most commonly sidebar navigation. The goal is to define specific prompts tailored to each to ensure full content extraction.


**Pinpointing Crawler Configuration**

I'm now identifying the crucial pieces of information the crawler needs: the "Next" button's CSS selector, the link's attribute (often `href`), and the overall pagination *strategy* (linear or list-based).  I'm drafting prompts, starting with a basic attempt and then getting increasingly specific, to cover various scenarios.  I'm iterating toward a robust "Config Generator" approach.


**Determining the Navigation Type**

I'm now prioritizing the critical distinction between "Next page in a list" and "Next chapter in a guide," essential for accurate content extraction.  I'm focusing on crafting prompts that can reliably identify the "Next" button's function and whether it's part of a pagination sequence or linear document flow. This helps determine the correct crawler configuration. Furthermore, I've defined the necessary JSON output format for this crucial "pagination" information.


**Integrating Single-Page Detection**

I've realized a simple "next" button prompt isn't sufficient. Now I'm integrating single-page detection, as sites like Zod have all content on one page, while others, like TanStack, are multi-page. This requires the LLM to analyze the structure and determine if it's SPA/anchor-based or multi-page, for correct crawler configuration. I'm adding a prompt to verify the existence of a "Next" button too.
</details>
AI:
おっしゃる通りです。特にクローラーとしてドキュメントを順番に読み込んでいく場合、「次のページへのリンク」を正確に特定する設定（Pagination）は不可欠です。

ドキュメントサイトには主に2種類のページ送りパターンがあります：
1.  **Linear Navigation (順次遷移)**: 記事の最後に「Next: Introduction」のように次の章へのリンクがある（多くのドキュメントはこれ）。
2.  **List Pagination (リストページ送り)**: ブログ一覧などで「1, 2, 3... 次へ」となっている（Changelogなどで見られる）。

これらを識別し、設定を生成させるためのプロンプトです。

---

### プロンプト 3: `pagination` (ページネーション・順次読み込み) 解析用

このプロンプトは、**「この記事を読み終わったら、次にどこへ遷移すべきか？」** というルールを抽出させます。

**Claude Codeへの入力:**
```markdown
# 依頼: ドキュメントのページネーション(順次遷移)構造の解析

ターゲットURL: [ここに解析したいURLを入れる]

あなたはWebクローラーの設計者です。
現在閲覧しているドキュメントを読み終えた後、論理的に「次の章」や「次のページ」へ遷移するためのナビゲーション要素を解析してください。

## 解析手順
1. 指定URLにアクセスし、ページの **最下部付近** を重点的にDOM検証してください。
2. 「Next (次へ)」「Previous (前へ)」といったコンテキストを持つリンク要素を探してください。
   - サイドバーではなく、**記事本文の直後** にあるフッターナビゲーションが対象です。

## 調査項目
1. **Next Link Selector**: 「次の記事」へ遷移するための `a` タグのCSSセレクタ。
   - ラッパーの `div` ではなく、`href` を持つ `a` タグそのものを指定してください。
2. **Title Attribute**: そのリンクの中に、次の記事のタイトルが含まれている要素があれば、そのセレクタ（例: `span.title`）。
3. **Strategy**: このサイトは「ページごとにURLが変わる (multi_page)」か、それとも「1ページで完結しておりスクロール/アンカー移動のみ (single_page)」か判断してください。
   - ヒント: Zodなどは `single_page`、TanStack Startなどは `multi_page` です。

## 出力フォーマット
以下のJSONのみを出力してください。解説は不要です。

```json
{
  "pagination": {
    "type": "string (例: 'linear_next_button' | 'single_page_scroll')",
    "next_link": "string (例: a.pagination-nav__link--next)",
    "prev_link": "string (例: a.pagination-nav__link--prev)",
    "next_title_selector": "string (例: .pagination-nav__label)"
  }
}


---

### これまでの設定を統合した「完全な設定ファイル」のイメージ

これら3つのプロンプト（抽出・ナビゲーション・ページネーション）を実行することで、あなたのMCPサーバーは以下のようなJSON設定を持つことになります。ここまで情報があれば、非常に高精度な検索と回答が可能になります。

**例: TanStack Start 用の完成設定 (イメージ)**

```json
{
  "domain": "tanstack.com",
  "match_pattern": "https://tanstack.com/start/**",
  
  // プロンプト1 (Read) の成果
  "extraction": {
    "title": "h1",
    "content": "div[class^='DocSearch-content']", 
    "exclude": [".carbon-ads", "button[aria-label='Edit this page']", "nav.on-this-page"]
  },

  // プロンプト2 (Search/Recommend) の成果
  "navigation": {
    "sidebar_root": "nav#sidebar",
    "sidebar_items": "nav#sidebar ul li a"
  },

  // プロンプト3 (Pagination) の成果
  "pagination": {
    "type": "linear_next_button",
    "next_link": "a.nav-next",
    "prev_link": "a.nav-prev"
  }
}

開発時のアドバイス

Zod (single_page) の場合:
- プロンプト3を実行した際、「Nextリンクが見つからない」という結果になるはずです。その場合は navigation.sidebar_items（サイドバーのリンク）が実質的なページネーション（アンカーリンク）として機能するロジックをMCPサーバー側に実装する必要があります。
chrome-devtools-mcp の活用:
- Claude Codeに「DOM構造が変わる可能性があるため、必ず document.querySelector を実行して、そのセレクタで要素が1つだけヒットすることを確認してからJSONを作ってください」と指示に追加すると、さらに信頼性が上がります。

(返答終了理由: STOP)

ユーザー: 予測されうる困難を列挙してみてください

AI: thinking userは「予測されうる困難を列挙して」と求めている。これまでの文脈は、TanStack Startの開発のためにドキュメント検索用のMCPサーバー（Markdown化機能付き）を作ろうとしている。その仕組みは「JSON設定ファイル（集合知）にCSSセレクタ等を記述し、Python等のロジックがそれを読んでスクレイピングする」というもの。 Claude Code + chrome-devtools-mcp を使って設定JSONを半自動生成しようとしている。

このアプローチにおける「予測されうる困難（技術的・運用的課題）」を列挙する必要がある。

思考プロセス：

スクレイピング対象（ドキュメントサイト）側の要因
- SPA/ハイドレーション: 単純なHTTP GETではHTMLが空で、JS実行が必要な場合がある（Docusaurus, Nextraなどは静的HTMLがあることが多いが、完全なSPAもある）。
- DOM構造の変更: ライブラリのバージョンアップでCSSクラス名（特にTailwindのハッシュ付きクラス）や構造が変わる。
- 動的なクラス名: Styled ComponentsやCSS Modulesで生成されたクラス名は不安定。
- 遅延ロード: スクロールしないと要素が出ない、クリックしないとメニューが開かない。
- アンチボット/レート制限: Cloudflareなどの保護、大量アクセスによるIPBAN。
設定生成（LLM/Claude Code）側の要因
- ハルシネーション: 存在しないセレクタをでっち上げる。
- 汎用性の欠如: 特定のページでしか動かないセレクタを選んでしまう（例: nth-child(3) 多用）。
- メンテナンス: サイトが更新されるたびにJSONを更新する手間（集合知とはいえ誰かが気づく必要がある）。
MCPサーバー/ツール実装側の要因
- Markdown変換の品質: コードブロックの言語指定が取れない、画像の扱い、テーブルの崩れ。
- 検索精度: 単にテキスト抽出しただけだと、意味のないヘッダーやフッターがノイズになる。チャンク化の難しさ。
- Zodのような「1ページ完結型」とTanStackのような「複数ページ型」の混在: ロジックの複雑化。
- ページネーションの無限ループ: 循環参照（A -> B -> A）への対応。
TanStackエコシステム特有の事情
- バージョン管理（v1, v2, latest）。URLにバージョンが含まれる場合と含まれない場合の処理。
- TanStackドキュメントは似ているようで微妙に違う可能性がある。

これらをカテゴリ分けして整理し、対策案とセットで提示すると親切。

構成案:

DOM解析・抽出の不安定さ (Selector Fragility)
- Tailwind/CSS Modules問題
- SPA/Shadow DOM問題
クローリング・アクセスの壁 (Access & Performance)
- アンチスクレイピング/レート制限
- 遅延ロード/動的コンテンツ
コンテンツ構造の多様性 (Structural Variety)
- シングルページ vs マルチページ
- バージョン切り替えの罠
LLMによる設定生成の限界 (LLM Limitations)
- 過学習/ハルシネーション
- 検証不足

よし、これで回答を作成する。

この「集合知設定ファイル × MCPサーバー」というアーキテクチャは非常に強力ですが、実際に運用し始めるといくつかの壁にぶつかります。特に 「継続的な安定性」 と 「精度の維持」 に関する困難が予想されます。

以下に、予測される主要な困難と、それに対する緩和策（Countermeasures）を整理しました。

1. セレクタの脆弱性 (Selector Fragility)

最も頻発する問題です。今日のWebサイトは頻繁に変更されます。

Tailwind / CSS Modules の壁:
- 現象: class="css-1a2b3c" や class="flex p-4 bg-gray-100" といったクラス名は、ビルドのたびに変わったり、デザイン変更で容易に壊れます。
- 困難: LLMが生成時に「その時たまたま動くクラス」を選んでしまい、翌週には動かなくなる。
- 対策: プロンプトで「id, data-testid, role, aria-label 属性を最優先し、Tailwindクラスは避ける」と強く制約する。また、article > h1 のような親子関係セレクタ（構造セレクタ）を推奨する。
DOM構造のA/Bテスト:
- 現象: 同じURLでもアクセスするタイミングやUser-AgentによってDOM構造が微妙に変わる（広告の位置、通知バナーの有無など）。

2. コンテンツ取得の技術的障壁

単純な fetch では取れない情報が増えています。

Client-Side Rendering (CSR) / SPA:
- 現象: view-source: で見ると <div id="root"></div> しかなく、中身が空。JSを実行しないとコンテンツが表示されない。
- 困難: Pythonの BeautifulSoup や lxml だけでは抽出できない。
- 対策: Playwright や Puppeteer、または Selenium を使ってヘッドレスブラウザ経由でアクセスし、レンダリング後のDOMを取得するアーキテクチャが必須になります（リソース消費増）。
Lazy Loading (遅延読み込み):
- 現象: ページ下部のコンテンツやコードブロックが、スクロールするまでDOMに存在しない。
- 困難: ページネーションリンクが見つからない、長い記事の後半が切れる。

3. ドキュメント構造の多様性への対応

「ドキュメントサイト」と一口に言っても、構造は千差万別です。

Single Page (Zod) vs Multi Page (TanStack):
- 現象: Zodは1ページに全てがあり、サイドバーは「ページ内リンク」。TanStackはページ遷移する。
- 困難: クローラーが「次のページ」を探し続けて無限ループしたり、Zodのようなページで同じコンテンツを何度も重複してインデックスしてしまう。
- 対策: JSON設定に strategy: "single_page" | "multi_page" のフラグを持たせ、処理ロジックを分岐させる必要があります。
バージョニングの罠:
- 現象: tanstack.com/router/latest/... をクロールしていたつもりが、リンクを辿るうちに v1 のドキュメントに迷い込んでしまう。
- 困難: 検索結果に古いバージョンのAPI（非推奨）が混ざり、LLMが誤ったコードを生成する原因になる。
- 対策: match_pattern によるURLフィルタリング（ホワイトリスト）を厳格に実装する必要があります。

4. コンテンツ変換の品質 (Markdown Conversion)

HTMLからテキストを抜くだけでは不十分です。

コードブロックの言語消失:
- 現象: <pre><code>...</code></pre> からテキストを抜くと、それが ts なのか bash なのかの情報が消えがち。
- 困難: LLMにコードを読ませる際、言語指定がないと精度が落ちる。
- 対策: class="language-tsx" などの属性を解析し、Markdownのフェンスコードブロック（```tsx）に復元するロジックが必要。
相対リンクのリンク切れ:
- 現象: href="../guide/setup" のようなリンクが、Markdown化された時点で機能しなくなる。
- 困難: LLMが「詳しくはこちらを参照」と言われてもリンク先がわからない。
- 対策: 抽出時にすべての href を絶対URL (https://...) に置換する処理が必要です。

5. LLM/Claude Code との協調課題

設定ファイルを自動生成させる際の問題点です。

ハルシネーションによる「存在しないセレクタ」:
- 現象: LLMが「一般的によくあるから」という理由で、実在しない .sidebar-link というクラスを捏造する。
- 困難: 実行するまでエラーに気づけない。
- 対策: プロンプトに「必ず document.querySelector(...) を実行して検証し、nullでないことを確認してからJSONを作れ」という指示（Verification Step）を含める。

結論としての推奨事項

最も警戒すべきは 「2. SPA/CSR対応」 と 「1. セレクタの脆弱性」 です。

これに対する現実的なアプローチは：

ブラウザ実行環境を持つ: 単なるHTTPリクエストではなく、Playwright等でDOMを構築してから解析する。
フォールバック戦略: 設定ファイルに content セレクタを配列で複数持たせ（例: ["article", "main", ".content"]）、上から順に試す仕組みにする。

この「予測される困難」を前提に、JSONスキーマを少しリッチ（selectors を配列にするなど）にしておくと、後々の手戻りが少なくなります。

(返答終了理由: STOP)

ユーザー: selectorは配列にして順々に実行する必要はありそうですね

AIの思考プロセス

Revising Sequential Execution

I've been meticulously analyzing the user's input, particularly their observation that the selector should be an array and executed sequentially. This sparked a deep dive into how to best structure the JSON schema to accommodate this crucial sequential execution requirement. I'm exploring different approaches, aiming to devise a strategy that is both elegant and efficient.

Developing Robust Scrapers

I'm now fully immersed in building a robust Multi-Channel Processing (MCP) server designed to scrape documentation for TanStack Start and related technologies. My primary aim is to mitigate the "Selector Fragility" issue I pinpointed earlier, and I'm currently using a collective intelligence approach. Leveraging JSON configurations generated by large language models, like Claude Code, I'm working to enhance reliability by adopting an array-of-selectors strategy where failure of one selector triggers a retry with the next. This has led to a slight increase in scraping logic complexity, but should result in a significant robustness gain.

Refining Scraper Logic

I've been meticulously refining the scraping logic to incorporate fallback and exclusion selector arrays. My focus is on ensuring the MCP server effectively processes these arrays, mirroring the behavior described. I'm thinking about how to frame this in a way that minimizes ambiguity. For instance, title/content extractions will use a "first match wins" approach and exclude extractions employ a "remove all matching" strategy. I've also started outlining the Python/JS pseudocode to handle selector array processing.

Implementing Selector Priorities

I'm now focusing on refining the proposed JSON schema. Simply transitioning from a string to an array for selectors isn't sufficient. To enhance clarity, I'm defining selector priorities with "Fallback" (first match wins) and "Aggregation" (apply all) strategies. I'm modifying prompts to elicit "Primary," "Secondary," and "Fallback" selectors, using the TanStack site as a concrete example to illustrate how selectors such as title might be specified using this method. I'm also planning to incorporate advice about distinguishing a "Fallback Strategy" from a "Union Strategy" when writing this guidance.

Distinguishing Selector Behavior

I'm currently honing in on the critical distinction between how we handle exclude selectors versus title/content selectors within the array structure. The array order now directly dictates priority: index 0 is the highest. But the exclude array needs distinct handling—it should apply all selectors listed, unlike title or content which use a "first match wins" approach. This difference is paramount for ensuring correct behavior within the scraper logic. I'm focusing on how to clearly articulate this difference in the schema and the related guidance.

AI: その通りですね。「セレクタを配列（リスト）で持ち、上から順に試行する（フォールバック戦略）」は、スクレイピングの堅牢性を高めるための**定石**です。

サイトの更新で id="article-title" が消えても、次点の h1 が生きていればシステムは止まりません。

この戦略に合わせて、JSONスキーマとプロンプト、そして実装ロジックをアップデートしましょう。

1. アップデートされた JSONスキーマ

「単一の文字列」から「優先順位付きの配列」に変更します。ただし、**Exclude（除外）だけは「全て適用」**である点に注意が必要です。

{
  "domain": "tanstack.com",
  "extraction": {
    // 【Fallback戦略】上から順に試し、最初に見つかったものを使用
    "title": [
      "h1.text-3xl",       // 1. 最も具体的 (Tailwind class等)
      "article h1",        // 2. 構造的 (Semantic)
      "main h1",           // 3. 汎用的
      "h1"                 // 4. 最終手段
    ],
    // 【Fallback戦略】上から順に試す
    "content": [
      "div[class^='DocSearch-content']",
      "article",
      "main",
      ".markdown-body"
    ],
    // 【Union戦略】これらにマッチする要素は「すべて」削除する
    "exclude": [
      ".carbon-ads",
      "button[aria-label='Edit this page']",
      "nav",
      "footer"
    ]
  },
  "pagination": {
    // 【Fallback戦略】上から順に試す
    "next_link": [
      "a.pagination-next",
      "a[rel='next']", 
      ".pagination-nav__item--next a"
    ]
  }
}

2. プロンプトの修正（複数候補の提案を要求）

Claude Codeに対して、「一つだけ答えろ」ではなく「松・竹・梅の3パターンで答えろ」と指示を変えます。

Claude Codeへの入力（修正版）:

# 依頼: 堅牢な抽出ルールの定義 (フォールバック対応)

ターゲットURL: [URL]

Webサイトの構造変化に強いスクレイピング設定を作成したいため、各要素について「複数のセレクタ候補」を優先度順に特定してください。

## 調査要件
要素ごとに以下の3つのレベルのセレクタを探し、配列にしてください。

1. **Specific (高精度)**: ID属性、`data-testid`、特定のクラス名など、誤爆の可能性が低いもの。
2. **Semantic (構造的)**: `article > h1` や `main p` など、HTML構造に基づくもの。
3. **Generic (汎用的)**: `h1` や `main` など、構造が変わっても生き残りやすいもの。

## 出力フォーマット (JSON)

```json
{
  "extraction": {
    "title": [
      "string (Specific)",
      "string (Semantic)",
      "string (Generic)"
    ],
    "content": [
      "string (Specific)",
      "string (Generic)"
    ],
    "exclude": [
      "string (削除対象1)",
      "string (削除対象2)"
    ]
  },
  "pagination": {
    "next_link": [
      "string (Specific)",
      "string (Generic)"
    ]
  }
}


---

### 3. MCPサーバー側の実装イメージ (Python)

設定ファイルを受け取って実際にDOMを解析する部分のロジックは以下のようになります。
`find_first_valid`（どれか一つ）と `remove_all`（すべて削除）を使い分けるのがポイントです。

```python
from bs4 import BeautifulSoup

def extract_content(html: str, config: dict):
    soup = BeautifulSoup(html, 'html.parser')
    
    # 1. 不要な要素をすべて削除 (Union戦略)
    # excludeリストにあるセレクタにマッチするものは問答無用で消す
    for selector in config['extraction'].get('exclude', []):
        for tag in soup.select(selector):
            tag.decompose()

    # 2. タイトルの取得 (Fallback戦略)
    # リストの上から順に検索し、最初に見つかったものを採用
    title = None
    for selector in config['extraction'].get('title', []):
        element = soup.select_one(selector)
        if element:
            title = element.get_text(strip=True)
            break # 見つかったのでループ終了

    # 3. 本文の取得 (Fallback戦略)
    content_html = None
    for selector in config['extraction'].get('content', []):
        element = soup.select_one(selector)
        if element:
            # 内部のHTMLを保持したまま取得
            content_html = str(element) 
            break

    return {
        "title": title or "No Title Found",
        "content": content_html or ""
    }

このアプローチの利点

メンテナンス頻度の低下: サイトが少しリニューアルされても、「Generic（汎用的）」なセレクタが引っかかってくれればエラーになりません。
Claude Codeの推論ミス吸収: AIが提案した「Specific」なセレクタが実はページごとにIDが変わる動的なものだったとしても、2番目の「Semantic」なセレクタが救ってくれます。

「予測されうる困難」への対策として、この配列構造は非常に有効です。

(返答終了理由: STOP)

ユーザー: htmlをmarkdown化するときにこの手法だと広告への対応は無視してもよさそうですが全文取得できない可能性があるかもしれませんね

AI: thinking Userは「セレクタを配列でフォールバックさせる手法」について、「広告への対応は無視できるかもしれないが、全文取得できない可能性がある」という懸念を持っている。

これは重要な指摘。 content のセレクタで article や main を指定したつもりが、サイトによっては「本文」と「APIリファレンス」が別々の div に分かれていたり、長い記事が途中でDOM的に分断されていたりする場合がある。フォールバック（break で最初に見つかったものだけ採用）だと、「ページの一部しか取れない」 というリスクがある。

例:

<main>
  <div class="intro">...</div>
  <div class="api-specs">...</div> <!-- こっちが取れないかも -->
</main>

もし content: [".intro", "main"] という設定で、.intro がヒットして終わってしまうと、.api-specs が取れない。逆に content: ["main"] だけだと、余計なサイドバーが含まれるかもしれない。

対処法として考えられるパターン:

Multiple Selection (Union): 複数のセレクタにヒットする要素をすべて結合する。
Parent Selection: 確実に全体を囲む親要素を指定し、そこから exclude で削るアプローチを徹底する。

Userの懸念「全文取得できない可能性」に対して、どうアプローチを修正すべきか？

考えられる対策:

セレクタの意図を明確にする: content は「Fallback（どれか一つ）」ではなく、場合によっては「Union（結合）」であるべきか？
- いや、基本的には「全体を囲むラッパー」を一つ特定するのが最も Markdown 化の品質が良い。バラバラの要素を結合すると構造（見出しレベルなど）が壊れやすい。
- なので、「親要素を大きく取り、ノイズを exclude で削る」（引き算のアプローチ）が正解に近い。
- しかし、Userの懸念通り、もし「親要素」が見つからず、誤って「子要素」のセレクタが先にヒットしてしまうと、部分取得になってしまう。
→ 対策: 設定ファイル（JSON）における content セレクタ配列の順序は、「広い範囲（親）」から「狭い範囲（子）」ではなく、「最も信頼できる範囲」順 であるべき。しかし、通常は「正解のラッパー」が見つかればそれでいい。

問題は、「誤ったセレクタ（部分的な要素）」がリストの上位に来てしまった場合。例: content: ["#introduction", "article"] これだと #introduction だけ取って終わる。

Userの指摘「全文取得できない可能性」への回答構成案:
1. ご指摘の通り。フォールバック戦略（最初に見つかったら終了）の弱点は、誤って「記事の一部だけの要素」にマッチして終わってしまうこと。
2. 対策A: "引き算"のアプローチを徹底する:
  - content セレクタは、極力「大きな枠（main, article）」を狙う。
  - その代わり exclude（除外）リストを充実させて広告やナビゲーションを消す。これが最も取りこぼしが少ない。
3. 対策B: バリデーション（最小文字数チェック）:
  - ヒットした要素のテキスト量が少なすぎる場合（例: 200文字以下）、それは「全文」ではなく「一部」かもしれないので、次の候補を試行するロジックを入れる。
4. 対策C: コンテンツセレクタの "Union" 対応（オプション）:
  - どうしてもDOMが分かれているサイト用に、"strategy": "merge" のようなオプションを用意する。
TanStack系のドキュメントなど、現代的なサイトは main や article がしっかりしているので、基本は「対策A」で良いはず。しかし汎用的なMCPとしては「対策B」があると賢い。

よし、これを整理して回答する。

鋭いご指摘です。**「最初に見つかったセレクタで満足して終了してしまう（Fallback戦略）」**ことによる最大のリスクがまさにそれです。

もし設定ファイルで content: [".intro-text", "main"] のような順序になっていた場合、本当は main 全体が欲しいのに、冒頭の .intro-text だけ取得して「取得完了」とみなされ、記事の後半（API仕様やサンプルコード）が欠落する恐れがあります。

この「取りこぼし」を防ぐために、ロジックと設定を少し強化する必要があります。

対策1: 「足し算」ではなく「引き算」のアプローチを徹底する

部分的な欠落を防ぐ最も確実な方法は、**「常に一番大きな枠（親要素）を取得し、不要なものを削る」**ことです。

悪い設定: content: [".section-a", ".section-b"] （パーツを狙う）
良い設定: content: ["main", "article"] （全体枠を狙う） + exclude: [".ad", ".sidebar"]

content のセレクタ配列は、「具体性（Specific）」順ではなく、「信頼できる包含要素（Container）」順 に並べるよう、プロンプトで指示する必要があります。

対策2: 「最小コンテンツ量」によるバリデーション (Sanity Check)

MCPサーバーの実装側で、「ヒットはしたけれど、あまりにも中身が少ない」場合は誤検知とみなして、**次のセレクタを試行（Skip & Next）**するロジックを入れるのが有効です。

# Python実装イメージ (改善版)

def extract_content(html: str, config: dict):
    soup = BeautifulSoup(html, 'html.parser')
    
    # 候補を上から順に試す
    final_element = None
    
    for selector in config['extraction'].get('content', []):
        element = soup.select_one(selector)
        
        # 【追加対策】要素が見つかっても、文字数が少なすぎたら「ハズレ」とみなす
        if element:
            text_length = len(element.get_text(strip=True))
            if text_length > 200: # 例えば200文字以上ないと本文と認めない
                final_element = element
                break
            else:
                print(f"Skipped selector '{selector}': text too short ({text_length} chars)")
    
    if not final_element:
        return "Content not found"

    # ここから除外処理 (exclude)
    # ...

この「200文字ルール」などを入れるだけで、広告バナーの div を誤って本文として拾ってしまう事故を劇的に減らせます。

対策3: プロンプトへの指示追加 (Container First)

Claude CodeにJSONを作らせる際、「部分的な要素ではなく、全体を含むラッパーを選べ」と強く念押しします。

修正すべきプロンプトの指示:

## Content Selector の選定ルール
- **絶対に避けるもの**: 記事の「導入部だけ」「タイトルだけ」「特定のセクションだけ」を指すセレクタ。
- **優先するもの**: 記事のすべてのセクション（H2, H3, CodeBlock含む）を内包する **最大のラッパー要素** (`main`, `article`, `div.doc-container` 等)。
- ノイズ（サイドバー等）が含まれていても構いません。それは `exclude` ルールで除去します。「取りこぼし」を最優先で避けてください。

まとめ：安全な構成

戦略: 「広くとって（Content）、中をくり抜く（Exclude）」
実装: 文字数チェックによる「部分取得」の回避

このようにしておけば、広告などのノイズは exclude で消え、かつ本文の取りこぼしリスクも最小限に抑えられます。

(返答終了理由: STOP)

メタデータ

{
  "runSettings": {
    "temperature": 1,
    "model": "models/gemini-3-pro-preview",
    "topP": 0.95,
    "topK": 64,
    "maxOutputTokens": 65536,
    "safetySettings": [
      {
        "category": "HARM_CATEGORY_HARASSMENT",
        "threshold": "OFF"
      },
      {
        "category": "HARM_CATEGORY_HATE_SPEECH",
        "threshold": "OFF"
      },
      {
        "category": "HARM_CATEGORY_SEXUALLY_EXPLICIT",
        "threshold": "OFF"
      },
      {
        "category": "HARM_CATEGORY_DANGEROUS_CONTENT",
        "threshold": "OFF"
      }
    ],
    "enableCodeExecution": false,
    "enableSearchAsATool": false,
    "enableBrowseAsATool": false,
    "enableAutoFunctionResponse": false,
    "outputResolution": "1K",
    "thinkingLevel": "THINKING_HIGH"
  },
  "systemInstruction": {}
}

Author

podhmo commented Dec 2, 2025

ドキュメント抽出エンジンの再考: 集合知 vs. 汎用アルゴリズム

🤔 "集合知"は本当に最適解だったのか？

先ほどまで、ドキュメントの構造定義をJSON設定ファイル（集合知）に外出しし、CSSセレクタを駆使してコンテンツを抽出するアプローチを固めていた。

しかし、「Mozilla Readability」¹ という存在を思い出した。Firefoxのリーダービュー機能の中核となっているライブラリだ。これは、CSSセレクタなどの設定なしで、ヒューリスティックに「本文っぽい部分」を自動抽出してくれる。

もしこれが使えるなら、苦労して「このサイトのタイトルは h1.title で...」とJSONを定義し続ける必要がなくなるのではないか？

🔬 Readabilityのアプローチを検証する

Readabilityの仕組みは、DOMツリーを走査し、テキストの密度、リンクの密度、クラス名（article, body, content などが含まれるか）、要素の種類などをスコアリングして、最も「記事らしい」要素を特定するというものだ。

Pythonには readability-lxml や breadability といったポートがあるし、JavaScript (Node.js) なら本家の @mozilla/readability が使える。言語にはこだわっていないので、精度が出るならJSに切り替えてもいい。

比較: Config-based vs. Heuristic-based

特徴	Config-based (CSS Selectors)	Heuristic-based (Readability)
精度	設定さえ正しければ100%正確。不要なものを完全に除去できる。	概ね正解するが、サイドバーの一部が残ったり、コードブロックが消えたりすることがある。
メンテナンス	サイト構造が変わるたびにJSONの更新が必要。	基本的にメンテナンス不要。サイト構造が変わってもアルゴリズムが適応する可能性がある。
ドメイン適応	技術ドキュメント特有の要素（API定義、型情報）を狙い撃ちできる。	ブログやニュース記事に最適化されており、技術ドキュメントの特異な構造に弱い可能性がある。

技術ドキュメント特有の課題

技術ドキュメントは、一般的なブログ記事とは構造が違うことがある。

コードブロックの重要性: Readabilityは「テキスト密度」を重視するため、コードばかりのAPIリファレンスを「ノイズ」と判定してしまうリスクがあるかもしれない。
ナビゲーションの混入: SPA系のドキュメントサイト（Docusaurusなど）は、<main> の中に <nav> が入り込んでいることがあり、Readabilityが除去しきれないことがある。
複数カラム: API仕様書などで、左に説明、右にコードといったレイアウトの場合、片方が捨てられる可能性がある。

💡 ハイブリッド戦略への転換

完全にReadabilityに頼るのはリスクがあるが、完全に手動設定（Config-based）にするのもメンテナンスが辛い。

ならば、両方使えばいいのではないか？

基本戦略を「ハイブリッド」に修正する。

First Pass (Config): 設定ファイルに extraction ルールがあれば、それを最優先で使う。これは「精度100%」を目指す場合や、Readabilityが失敗する特殊なサイト用。
Fallback (Readability): 設定がない、または抽出に失敗した場合のバックアップとして Readability を使う。

こうすれば、マイナーなライブラリのドキュメントを追加したい時、いちいちJSONを書かなくても「とりあえずURLだけ登録すれば Readability がなんとかしてくれる」という手軽さが手に入る。そして、精度が気に入らない場合だけ JSON を書けばいい。これはシステムとしての堅牢性と使い勝手を大きく向上させる。

🔄 検索システム全体のアーキテクチャ更新

これまでの検討を踏まえ、MCPサーバーのロジックを以下のように更新する。

フェーズ1: ナビゲーション解析 (Navigation)
ここだけは Readability では代替できない。「次のページ」や「サイドバー」の構造は、依然として設定ファイル（またはClaude Codeによる解析）が必要だ。ここは変更なし。

フェーズ2: コンテンツ抽出 (Extraction)
ここを柔軟にする。

// JS実装イメージ (Playwright + Readability)

async function extractContent(page, config) {
  // 1. 設定ファイルによる明示的な除外 (Exclude)
  // Readabilityを通す前に、明らかなノイズ（広告など）は消しておく方が精度が上がる
  if (config.exclude) {
    await page.evaluate((selectors) => {
      selectors.forEach(s => document.querySelectorAll(s).forEach(e => e.remove()));
    }, config.exclude);
  }

  // 2. 設定ファイルによる抽出 (Selector)
  if (config.content_selector) {
    const element = await page.$(config.content_selector);
    if (element) return await element.innerHTML();
  }

  // 3. Readabilityによる自動抽出 (Fallback)
  // JSDOM等を使ってDOMをパースし、Readabilityに渡す
  const { Readability } = require('@mozilla/readability');
  const { JSDOM } = require('jsdom');
  const dom = new JSDOM(await page.content());
  const article = new Readability(dom.window.document).parse();
  
  return article ? article.content : null;
}

このアプローチなら、chrome-devtools-mcp で生成する設定ファイルは、必須の navigation と、オプションの extraction（精度を高めたい場合のみ）という構成になる。

「集合知」の出番は、Readabilityがうまく動かないサイトや、ページネーションの解析といった「より高度な部分」に集中させることができる。これは理にかなっている。

Webページの本文抽出アルゴリズムのデファクトスタンダード。元々はArc90が開発したブックマークレットで、現在はMozillaによってメンテナンスされている。余計な装飾を除去し、読みやすい形式に変換する。 ↩

podhmo/TanStack-Start-開発ライブラリとドキュメント.md

TanStack Start開発に向けたドキュメント検索基盤の構築ログ

🤔 TanStack Startで開発を始めたいが、情報源はどこにある？

💡 ドキュメント検索用のMCPサーバーを作ろう

⚙️ 集合知による宣言的なスクレイピング設定

🧐 設定生成プロンプトの設計：ReadとSearchの分離

🚧 予測される困難とその対策：セレクタの脆弱性と「取りこぼし」

対話履歴

1. 技術スタック・ライブラリの選定

A. 必須ライブラリ (Core)

B. 強く推奨されるライブラリ (Standard Stack)

C. バックエンド・UI (推奨)

2. 見るべきドキュメントと学習順序

Step 1: TanStack Router の理解 (最優先)

Step 2: TanStack Start のセットアップとサーバー機能

Step 3: データフェッチとフォーム処理

3. 開発の流れイメージ（まとめ）

アドバイス

1. TanStack Core (必須)

2. TanStack Ecosystem (標準スタック)

3. 周辺ツール (推奨スタック)

4. その他 (必要に応じて追加)

スクレイピング/抽出時のヒント

1. 基礎技術 (Foundation)

2. TanStack Core (必須フレームワーク)

3. TanStack Ecosystem (推奨ライブラリ)

4. ユーティリティ・ビルド (必須ツール)

[便利機能] JSON 設定用リスト

Claude Code へのプロンプト

作業手順

プロンプト 1: read_document 用設定生成

開発時のアドバイス

1. セレクタの脆弱性 (Selector Fragility)

2. コンテンツ取得の技術的障壁

3. ドキュメント構造の多様性への対応

4. コンテンツ変換の品質 (Markdown Conversion)

5. LLM/Claude Code との協調課題

結論としての推奨事項

1. アップデートされた JSONスキーマ

2. プロンプトの修正（複数候補の提案を要求）

このアプローチの利点

対策1: 「足し算」ではなく「引き算」のアプローチを徹底する

対策2: 「最小コンテンツ量」によるバリデーション (Sanity Check)

対策3: プロンプトへの指示追加 (Container First)

まとめ：安全な構成

メタデータ

podhmo commented Dec 2, 2025

ドキュメント抽出エンジンの再考: 集合知 vs. 汎用アルゴリズム

🤔 "集合知"は本当に最適解だったのか？

🔬 Readabilityのアプローチを検証する

比較: Config-based vs. Heuristic-based

技術ドキュメント特有の課題

💡 ハイブリッド戦略への転換

🔄 検索システム全体のアーキテクチャ更新

Footnotes

Uh oh!

プロンプト 1: `read_document` 用設定生成