The documentToText function converts hypermedia documents (with all their embeds) to plain text representation. It recursively resolves all inline and block embeds, replacing them with their actual content.
Overview
Location: @shm/shared/document-to-text
Purpose: Generate plain text version of documents with resolved embeds
Use Cases:
Text fragment rendering with inline embeds resolved
Document search indexing
Export to plain text
Content preview generation
API
Function Signature
async function documentToText({
documentId,
grpcClient,
options = {},
}: {
documentId: UnpackedHypermediaId
grpcClient: GRPCClient
options: DocumentToTextOptions
}): Promise<string>
Options
interface DocumentToTextOptions {
maxDepth?: number // Maximum embed depth (default: 10)
resolveInlineEmbeds?: boolean // Replace inline embeds with doc names (default: true)
lineBreaks?: boolean // Add line breaks between blocks (default: true)
}
Features
1. Hierarchical Block Processing
Processes document blocks depth-first:
Paragraphs: Extract text content
Headings: Extract heading text
Code blocks: Include code content
Buttons: Extract button labels from attributes.name
Images/Videos/Files: Include captions
Embeds: Recursively fetch and include content
2. Inline Embed Resolution
Replaces invisible character markers (U+FEFF) with document names:
Detects inline embed annotations
Fetches referenced document
Replaces marker with @DocumentName
Example:
"Check out this post!" → "Check out @Alice's Guide this post!"
3. Block Embed Resolution
Recursively fetches and includes embedded documents:
Full document embeds
Block-specific embeds (blockRef)
Block range embeds (blockRef with range)
4. Fragment Support
Handles blockRef and blockRange:
#blockId - Returns only that block's content
#blockId[start:end] - Returns only children within range
Respects parent-child relationships
5. Safety Features
Circular reference detection: Tracks visited documents
Depth limiting: Prevents infinite recursion
Error handling: Graceful fallbacks for missing content
Cross-Platform Integration
The function is available in both desktop and web apps through the document content context:
Desktop App
Direct access to grpcClient:
const {getDocumentText} = useDocContentContext()
const text = await getDocumentText(documentId, {
lineBreaks: false,
resolveInlineEmbeds: true,
})
Web App
API endpoint at /hm/api/document-text:
const {getDocumentText} = useDocContentContext()
// Same API, but fetches from server
const text = await getDocumentText(documentId, {
maxDepth: 5,
resolveInlineEmbeds: true,
})
Implementation Details
Architecture
Desktop:
Component → useDocContentContext() → documentToText(grpcClient) → Text
Web:
Component → useDocContentContext() → API /hm/api/document-text → documentToText(grpcClient) → Text
Key Files
frontend/packages/shared/src/document-to-text.ts - Core implementation
frontend/packages/shared/src/document-content-types.ts - Context interface
frontend/apps/desktop/src/pages/document-content-provider.tsx - Desktop provider
frontend/apps/web/app/doc-content-provider.tsx - Web provider
frontend/apps/web/app/routes/hm.api.document-text.tsx - Web API endpoint
Usage Examples
Basic Usage
import {documentToText, hmId} from '@shm/shared'
import {grpcClient} from './grpc-client'
const documentId = hmId('account123', {path: ['my-doc']})
const text = await documentToText({
documentId,
grpcClient,
options: {},
})
console.log(text)
With Options
// Compact text without line breaks
const compactText = await documentToText({
documentId,
grpcClient,
options: {
lineBreaks: false,
maxDepth: 5,
resolveInlineEmbeds: true,
},
})
// Without inline embed resolution (keep original text)
const rawText = await documentToText({
documentId,
grpcClient,
options: {
resolveInlineEmbeds: false,
},
})
In React Components
function MyComponent({docId}: {docId: UnpackedHypermediaId}) {
const {getDocumentText} = useDocContentContext()
const [text, setText] = useState('')
useEffect(() => {
getDocumentText?.(docId, {lineBreaks: false})
.then(setText)
.catch(console.error)
}, [docId, getDocumentText])
return <pre>{text}</pre>
}
Testing
17 comprehensive tests covering:
Basic text extraction
Inline embed resolution
Block embed processing
Nested structures
Circular reference detection
Max depth handling
BlockRef/BlockRange fragments
Button and heading extraction
LineBreaks option
Run tests:
NODE_ENV=test yarn workspace @shm/shared test run document-to-text
Performance Considerations
Caching: Consider caching results for frequently accessed documents
Depth limiting: Use maxDepth option for large document trees
Inline embeds: Disabling resolveInlineEmbeds improves performance
Async: Function is async and may take time for deep embed trees
Related Documentation
Text Fragment Rendering
Document Blocks
Document Linking
Do you like what you are reading?. Subscribe to receive updates.
Unsubscribe anytime