Audio Recording
Learn how to record and send audio to Autessa agents using two different approaches: real-time voice mode for conversational AI and multimodal audio upload for asynchronous processing.
Overview
Autessa supports two audio input methods:
- Real-time Voice Mode: Stream audio continuously via WebSocket for conversational experiences with
voiceModeEnabled=true - Multimodal Audio: Record audio, upload to S3, and send as part of a multimodal request
Choose real-time voice mode for interactive conversations with low latency. Use multimodal audio when you need to process pre-recorded audio or don't require real-time interaction.
Real-time Voice Mode (WebSocket)
Real-time voice mode enables continuous bidirectional audio streaming for conversational AI experiences.
Connection Setup
Connect to the WebSocket endpoint with the voiceModeEnabled=true parameter:
const agentId = 123
const apiKey = 'your_api_key'
const wsUrl = `wss://api.autessa.com/ws/clients/agents/execute?authorization=${apiKey}&resourceId=${agentId}&voiceModeEnabled=true`
const websocket = new WebSocket(wsUrl)
Voice Establishment Message
After the WebSocket connection opens, send a voice establishment message to configure audio format and agent parameters:
websocket.onopen = () => {
const establishmentMessage = {
messageType: 'VOICE',
outputType: 'AUDIO', // or 'TEXT' for transcription only
agentId: 123,
audioFormat: {
sampleRate: 16000,
sampleSizeInBits: 32,
channels: 1,
signed: true,
bigEndian: false
},
environmentVariables: {},
promptTemplateVariables: {}
}
websocket.send(JSON.stringify(establishmentMessage))
console.log('Voice mode established')
}
AudioRecorder Utility Class
Here's a production-ready utility class for continuous audio recording with WebSocket streaming:
export class AudioRecorder {
private sampleRate: number
private audioContext: AudioContext | null = null
private processor: ScriptProcessorNode | null = null
private source: MediaStreamAudioSourceNode | null = null
private stream: MediaStream | null = null
private isRecording: boolean = false
private websocket: WebSocket | null = null
private websocketReady: boolean = false
// Voice Activity Detection
private vadThreshold: number = 0.01
private vadDebounceMs: number = 300
private isSpeaking: boolean = false
private silenceStart: number = 0
private rmsBuffer: number[] = []
private readonly rmsBufferSize: number = 5
// Audio accumulation for JSON export
private accumulatedAudio: Float32Array[] = []
private isAccumulating: boolean = false
constructor(sampleRate: number = 16000) {
this.sampleRate = sampleRate
}
/**
* Start continuous recording with optional WebSocket streaming
*/
async startContinuousRecording(
websocketUrl: string | null = null,
agentConfig: {
outputType: string
agentId: number
environmentVariables?: Record<string, any>
promptTemplateVariables?: Record<string, any>
} | null = null
): Promise<void> {
try {
// Set up WebSocket if URL provided
if (websocketUrl) {
this.websocket = new WebSocket(websocketUrl)
this.websocketReady = false
await new Promise((resolve, reject) => {
this.websocket!.onopen = () => {
console.log('WebSocket connected')
// Send voice establishment message
if (agentConfig) {
const establishmentPayload = {
messageType: 'VOICE',
outputType: agentConfig.outputType,
agentId: agentConfig.agentId,
audioFormat: {
sampleRate: this.sampleRate,
sampleSizeInBits: 32,
channels: 1,
signed: true,
bigEndian: false
},
environmentVariables: agentConfig.environmentVariables || {},
promptTemplateVariables: agentConfig.promptTemplateVariables || {}
}
this.websocket!.send(JSON.stringify(establishmentPayload))
console.log('Sent voice establishment message')
} else {
this.websocketReady = true
}
resolve(undefined)
}
this.websocket!.onmessage = (event) => {
if (!this.websocketReady) {
console.log('WebSocket ready to receive audio')
this.websocketReady = true
}
this.handleWebSocketMessage(event)
}
this.websocket!.onerror = (error) => reject(error)
})
}
// Request microphone access
this.stream = await navigator.mediaDevices.getUserMedia({
audio: {
echoCancellation: true,
noiseSuppression: true,
autoGainControl: true
}
})
// Create audio context with specified sample rate
this.audioContext = new (window.AudioContext || (window as any).webkitAudioContext)({
sampleRate: this.sampleRate
})
this.source = this.audioContext.createMediaStreamSource(this.stream)
// Create processor with 4096 buffer size
const bufferSize = 4096
this.processor = this.audioContext.createScriptProcessor(bufferSize, 1, 1)
this.source.connect(this.processor)
this.processor.connect(this.audioContext.destination)
// Process audio in real-time
this.processor.onaudioprocess = (event) => {
const float32Samples = event.inputBuffer.getChannelData(0)
// Detect voice activity
this.detectVoiceActivity(float32Samples)
// Send to WebSocket if ready
if (this.websocket && this.websocketReady && this.isSpeaking) {
this.sendAudioViaWebSocket(float32Samples)
}
// Accumulate for JSON export if requested
if (this.isAccumulating) {
this.accumulatedAudio.push(new Float32Array(float32Samples))
}
}
this.isRecording = true
console.log('Continuous recording started')
} catch (error) {
console.error('Error starting recording:', error)
throw error
}
}
/**
* Detect voice activity using RMS energy
*/
private detectVoiceActivity(samples: Float32Array): void {
// Calculate RMS (Root Mean Square)
let sum = 0
for (let i = 0; i < samples.length; i++) {
sum += samples[i] * samples[i]
}
const rms = Math.sqrt(sum / samples.length)
// Add to smoothing buffer
this.rmsBuffer.push(rms)
if (this.rmsBuffer.length > this.rmsBufferSize) {
this.rmsBuffer.shift()
}
// Calculate smoothed RMS
const smoothedRms = this.rmsBuffer.reduce((a, b) => a + b, 0) / this.rmsBuffer.length
const now = Date.now()
if (smoothedRms > this.vadThreshold) {
// Voice detected
if (!this.isSpeaking) {
this.isSpeaking = true
this.onSpeechStart?.()
}
this.silenceStart = now
} else {
// Silence detected
if (this.isSpeaking && (now - this.silenceStart) > this.vadDebounceMs) {
this.isSpeaking = false
this.onSpeechEnd?.()
}
}
}
/**
* Send audio data via WebSocket as ArrayBuffer
*/
private sendAudioViaWebSocket(float32Array: Float32Array): void {
if (!this.websocket || this.websocket.readyState !== WebSocket.OPEN || !this.websocketReady) {
return
}
try {
// Send as raw ArrayBuffer (Float32Array buffer)
this.websocket.send(float32Array.buffer)
} catch (error) {
console.error('Error sending audio via WebSocket:', error)
}
}
/**
* Handle incoming WebSocket messages
*/
private handleWebSocketMessage(event: MessageEvent): void {
// Override this method or set onMessage callback
console.log('WebSocket message received:', event.data)
}
/**
* Stop recording and clean up resources
*/
stopRecording(): void {
this.isRecording = false
this.isAccumulating = false
this.websocketReady = false
this.isSpeaking = false
if (this.processor) {
this.processor.disconnect()
this.processor = null
}
if (this.source) {
this.source.disconnect()
this.source = null
}
if (this.audioContext) {
this.audioContext.close()
this.audioContext = null
}
if (this.stream) {
this.stream.getTracks().forEach(track => track.stop())
this.stream = null
}
if (this.websocket) {
this.websocket.close()
this.websocket = null
}
this.accumulatedAudio = []
this.rmsBuffer = []
console.log('Recording stopped and cleaned up')
}
/**
* Check if currently recording
*/
isCurrentlyRecording(): boolean {
return this.isRecording
}
/**
* Check if voice is detected
*/
isCurrentlySpeaking(): boolean {
return this.isSpeaking
}
// Optional callbacks
onSpeechStart?: () => void
onSpeechEnd?: () => void
onMessage?: (event: MessageEvent) => void
}
Usage Example: Real-time Voice Mode
// Initialize recorder
const recorder = new AudioRecorder(16000)
// Set up callbacks
recorder.onSpeechStart = () => {
console.log('User started speaking')
}
recorder.onSpeechEnd = () => {
console.log('User stopped speaking')
}
recorder.onMessage = (event) => {
try {
const response = JSON.parse(event.data)
console.log('Received:', response)
// Handle audio chunks - see Audio Playback documentation
} catch (e) {
// Binary audio data
}
}
// Start recording with WebSocket streaming
const agentId = 123
const apiKey = 'your_api_key'
const wsUrl = `wss://api.autessa.com/ws/clients/agents/execute?authorization=${apiKey}&resourceId=${agentId}&voiceModeEnabled=true`
await recorder.startContinuousRecording(wsUrl, {
outputType: 'AUDIO',
agentId: agentId,
environmentVariables: {},
promptTemplateVariables: {}
})
// Later: stop recording
recorder.stopRecording()
For handling received audio chunks, see the Audio Playback documentation.
Multimodal Audio Mode (S3 Upload)
For non-real-time use cases, record audio, convert to WAV, upload to S3, and send the S3 URI in a multimodal request.
Step 1: Record and Accumulate Audio
class AudioAccumulator {
private sampleRate: number
private audioContext: AudioContext | null = null
private processor: ScriptProcessorNode | null = null
private source: MediaStreamAudioSourceNode | null = null
private stream: MediaStream | null = null
private isRecording: boolean = false
private chunks: Float32Array[] = []
constructor(sampleRate: number = 16000) {
this.sampleRate = sampleRate
}
async startRecording(): Promise<void> {
try {
this.stream = await navigator.mediaDevices.getUserMedia({ audio: true })
this.audioContext = new (window.AudioContext || (window as any).webkitAudioContext)({
sampleRate: this.sampleRate
})
this.source = this.audioContext.createMediaStreamSource(this.stream)
this.processor = this.audioContext.createScriptProcessor(4096, 1, 1)
this.source.connect(this.processor)
this.processor.connect(this.audioContext.destination)
this.processor.onaudioprocess = (event) => {
const samples = event.inputBuffer.getChannelData(0)
this.chunks.push(new Float32Array(samples))
}
this.isRecording = true
this.chunks = []
} catch (error) {
console.error('Error starting recording:', error)
throw error
}
}
stopRecording(): Float32Array {
this.isRecording = false
if (this.processor) this.processor.disconnect()
if (this.source) this.source.disconnect()
if (this.audioContext) this.audioContext.close()
if (this.stream) {
this.stream.getTracks().forEach(track => track.stop())
}
// Combine all chunks
const totalLength = this.chunks.reduce((sum, chunk) => sum + chunk.length, 0)
const combined = new Float32Array(totalLength)
let offset = 0
for (const chunk of this.chunks) {
combined.set(chunk, offset)
offset += chunk.length
}
return combined
}
}
Step 2: Convert to WAV Format
class WavConverter {
static float32ToWav(float32Array: Float32Array, sampleRate: number): Blob {
// Convert Float32 to Int16
const int16Array = new Int16Array(float32Array.length)
for (let i = 0; i < float32Array.length; i++) {
const s = Math.max(-1, Math.min(1, float32Array[i]))
int16Array[i] = s < 0 ? s * 0x8000 : s * 0x7FFF
}
const buffer = new ArrayBuffer(44 + int16Array.length * 2)
const view = new DataView(buffer)
// WAV header
const writeString = (offset: number, str: string) => {
for (let i = 0; i < str.length; i++) {
view.setUint8(offset + i, str.charCodeAt(i))
}
}
writeString(0, 'RIFF')
view.setUint32(4, 36 + int16Array.length * 2, true)
writeString(8, 'WAVE')
writeString(12, 'fmt ')
view.setUint32(16, 16, true)
view.setUint16(20, 1, true)
view.setUint16(22, 1, true)
view.setUint32(24, sampleRate, true)
view.setUint32(28, sampleRate * 2, true)
view.setUint16(32, 2, true)
view.setUint16(34, 16, true)
writeString(36, 'data')
view.setUint32(40, int16Array.length * 2, true)
// Write audio data
const wavData = new Uint8Array(buffer)
wavData.set(new Uint8Array(int16Array.buffer), 44)
return new Blob([wavData], { type: 'audio/wav' })
}
}
Step 3: Upload to S3
class AudioUploader {
private apiKey: string
private baseUrl: string
constructor(apiKey: string, baseUrl: string = 'https://api.autessa.com') {
this.apiKey = apiKey
this.baseUrl = baseUrl
}
async uploadAudio(agentId: number, wavBlob: Blob): Promise<string> {
// Step 1: Get presigned URL
const presignedResponse = await fetch(
`${this.baseUrl}/clients/agents/generate-audio-upload-link?resourceId=${agentId}`,
{
method: 'POST',
headers: {
'Authorization': this.apiKey,
'Content-Type': 'application/json'
}
}
)
if (!presignedResponse.ok) {
throw new Error('Failed to get presigned URL')
}
const { uploadUrl, s3Uri } = await presignedResponse.json()
// Step 2: Upload to S3
const uploadResponse = await fetch(uploadUrl, {
method: 'PUT',
body: wavBlob,
headers: {
'Content-Type': 'audio/wav'
}
})
if (!uploadResponse.ok) {
throw new Error('Failed to upload to S3')
}
// Step 3: Return S3 URI
return s3Uri
}
}
Step 4: Use in Multimodal Request
// Complete workflow
async function recordAndSendAudio(agentId: number, apiKey: string) {
// 1. Record audio
const accumulator = new AudioAccumulator(16000)
await accumulator.startRecording()
console.log('Recording... (press any key to stop)')
// ... wait for user to finish ...
const audioData = accumulator.stopRecording()
// 2. Convert to WAV
const wavBlob = WavConverter.float32ToWav(audioData, 16000)
// 3. Upload to S3
const uploader = new AudioUploader(apiKey)
const s3Uri = await uploader.uploadAudio(agentId, wavBlob)
console.log('Audio uploaded:', s3Uri)
// 4. Send multimodal request
const response = await fetch(
`https://api.autessa.com/clients/agents/execute?resourceId=${agentId}`,
{
method: 'POST',
headers: {
'Authorization': apiKey,
'Content-Type': 'application/json'
},
body: JSON.stringify({
agentId: agentId,
input: [
{
inputType: 'AUDIO',
s3Uri: s3Uri,
audioFormat: {
sampleRate: 16000,
sampleSizeInBits: 16,
channels: 1,
signed: true,
bigEndian: false
}
}
],
executionOutputMode: 'TEXT' // or 'AUDIO'
})
}
)
const result = await response.json()
console.log('Agent response:', result)
}
Audio Format Specifications
All audio in Autessa follows these specifications:
- Name
sampleRate- Type
- number
- Description
16000 Hz (16 kHz)
- Name
channels- Type
- number
- Description
1 (mono)
- Name
sampleSizeInBits- Type
- number
- Description
32-bit for recording (Float32), 16-bit for WAV export (Int16)
- Name
signed- Type
- boolean
- Description
true
- Name
bigEndian- Type
- boolean
- Description
false (little-endian)
Best Practices
Voice Activity Detection
The AudioRecorder class includes VAD to detect when the user is speaking:
- Threshold: Adjust
vadThreshold(default: 0.01) for sensitivity - Debounce: Set
vadDebounceMs(default: 300ms) to avoid flickering - Callbacks: Use
onSpeechStartandonSpeechEndfor UI updates
Memory Management
- Always call
stopRecording()when done to free resources - For long recordings, consider chunked processing
- Use blob URLs sparingly and revoke them after use
Error Handling
try {
await recorder.startContinuousRecording(wsUrl, config)
} catch (error) {
if (error.name === 'NotAllowedError') {
console.error('Microphone permission denied')
} else if (error.name === 'NotFoundError') {
console.error('No microphone found')
} else {
console.error('Failed to start recording:', error)
}
}
WebSocket Reconnection
For production use, implement reconnection logic:
let reconnectAttempts = 0
const maxReconnectAttempts = 3
websocket.onclose = () => {
if (reconnectAttempts < maxReconnectAttempts) {
reconnectAttempts++
setTimeout(() => {
console.log(`Reconnecting... (${reconnectAttempts}/${maxReconnectAttempts})`)
// Restart recording
}, 1000 * reconnectAttempts)
}
}
Next Steps
- Learn how to play received audio in the Audio Playback guide
- See complete examples in the Agent API documentation
- Explore multimodal capabilities in the Agent API reference