Skip to main content

Audio Recording

Learn how to record and send audio to Autessa agents using two different approaches: real-time voice mode for conversational AI and multimodal audio upload for asynchronous processing.

Overview

Autessa supports two audio input methods:

  1. Real-time Voice Mode: Stream audio continuously via WebSocket for conversational experiences with voiceModeEnabled=true
  2. Multimodal Audio: Record audio, upload to S3, and send as part of a multimodal request

Choose real-time voice mode for interactive conversations with low latency. Use multimodal audio when you need to process pre-recorded audio or don't require real-time interaction.


Real-time Voice Mode (WebSocket)

Real-time voice mode enables continuous bidirectional audio streaming for conversational AI experiences.

Connection Setup

Connect to the WebSocket endpoint with the voiceModeEnabled=true parameter:

const agentId = 123
const apiKey = 'your_api_key'
const wsUrl = `wss://api.autessa.com/ws/clients/agents/execute?authorization=${apiKey}&resourceId=${agentId}&voiceModeEnabled=true`

const websocket = new WebSocket(wsUrl)

Voice Establishment Message

After the WebSocket connection opens, send a voice establishment message to configure audio format and agent parameters:

websocket.onopen = () => {
const establishmentMessage = {
messageType: 'VOICE',
outputType: 'AUDIO', // or 'TEXT' for transcription only
agentId: 123,
audioFormat: {
sampleRate: 16000,
sampleSizeInBits: 32,
channels: 1,
signed: true,
bigEndian: false
},
environmentVariables: {},
promptTemplateVariables: {}
}

websocket.send(JSON.stringify(establishmentMessage))
console.log('Voice mode established')
}

AudioRecorder Utility Class

Here's a production-ready utility class for continuous audio recording with WebSocket streaming:

export class AudioRecorder {
private sampleRate: number
private audioContext: AudioContext | null = null
private processor: ScriptProcessorNode | null = null
private source: MediaStreamAudioSourceNode | null = null
private stream: MediaStream | null = null
private isRecording: boolean = false
private websocket: WebSocket | null = null
private websocketReady: boolean = false

// Voice Activity Detection
private vadThreshold: number = 0.01
private vadDebounceMs: number = 300
private isSpeaking: boolean = false
private silenceStart: number = 0
private rmsBuffer: number[] = []
private readonly rmsBufferSize: number = 5

// Audio accumulation for JSON export
private accumulatedAudio: Float32Array[] = []
private isAccumulating: boolean = false

constructor(sampleRate: number = 16000) {
this.sampleRate = sampleRate
}

/**

* Start continuous recording with optional WebSocket streaming
*/
async startContinuousRecording(
websocketUrl: string | null = null,
agentConfig: {
outputType: string
agentId: number
environmentVariables?: Record<string, any>
promptTemplateVariables?: Record<string, any>
} | null = null
): Promise<void> {
try {
// Set up WebSocket if URL provided
if (websocketUrl) {
this.websocket = new WebSocket(websocketUrl)
this.websocketReady = false

await new Promise((resolve, reject) => {
this.websocket!.onopen = () => {
console.log('WebSocket connected')

// Send voice establishment message
if (agentConfig) {
const establishmentPayload = {
messageType: 'VOICE',
outputType: agentConfig.outputType,
agentId: agentConfig.agentId,
audioFormat: {
sampleRate: this.sampleRate,
sampleSizeInBits: 32,
channels: 1,
signed: true,
bigEndian: false
},
environmentVariables: agentConfig.environmentVariables || {},
promptTemplateVariables: agentConfig.promptTemplateVariables || {}
}

this.websocket!.send(JSON.stringify(establishmentPayload))
console.log('Sent voice establishment message')
} else {
this.websocketReady = true
}

resolve(undefined)
}

this.websocket!.onmessage = (event) => {
if (!this.websocketReady) {
console.log('WebSocket ready to receive audio')
this.websocketReady = true
}
this.handleWebSocketMessage(event)
}

this.websocket!.onerror = (error) => reject(error)
})
}

// Request microphone access
this.stream = await navigator.mediaDevices.getUserMedia({
audio: {
echoCancellation: true,
noiseSuppression: true,
autoGainControl: true
}
})

// Create audio context with specified sample rate
this.audioContext = new (window.AudioContext || (window as any).webkitAudioContext)({
sampleRate: this.sampleRate
})

this.source = this.audioContext.createMediaStreamSource(this.stream)

// Create processor with 4096 buffer size
const bufferSize = 4096
this.processor = this.audioContext.createScriptProcessor(bufferSize, 1, 1)

this.source.connect(this.processor)
this.processor.connect(this.audioContext.destination)

// Process audio in real-time
this.processor.onaudioprocess = (event) => {
const float32Samples = event.inputBuffer.getChannelData(0)

// Detect voice activity
this.detectVoiceActivity(float32Samples)

// Send to WebSocket if ready
if (this.websocket && this.websocketReady && this.isSpeaking) {
this.sendAudioViaWebSocket(float32Samples)
}

// Accumulate for JSON export if requested
if (this.isAccumulating) {
this.accumulatedAudio.push(new Float32Array(float32Samples))
}
}

this.isRecording = true
console.log('Continuous recording started')

} catch (error) {
console.error('Error starting recording:', error)
throw error
}
}

/**

* Detect voice activity using RMS energy
*/
private detectVoiceActivity(samples: Float32Array): void {
// Calculate RMS (Root Mean Square)
let sum = 0
for (let i = 0; i < samples.length; i++) {
sum += samples[i] * samples[i]
}
const rms = Math.sqrt(sum / samples.length)

// Add to smoothing buffer
this.rmsBuffer.push(rms)
if (this.rmsBuffer.length > this.rmsBufferSize) {
this.rmsBuffer.shift()
}

// Calculate smoothed RMS
const smoothedRms = this.rmsBuffer.reduce((a, b) => a + b, 0) / this.rmsBuffer.length

const now = Date.now()

if (smoothedRms > this.vadThreshold) {
// Voice detected
if (!this.isSpeaking) {
this.isSpeaking = true
this.onSpeechStart?.()
}
this.silenceStart = now
} else {
// Silence detected
if (this.isSpeaking && (now - this.silenceStart) > this.vadDebounceMs) {
this.isSpeaking = false
this.onSpeechEnd?.()
}
}
}

/**

* Send audio data via WebSocket as ArrayBuffer
*/
private sendAudioViaWebSocket(float32Array: Float32Array): void {
if (!this.websocket || this.websocket.readyState !== WebSocket.OPEN || !this.websocketReady) {
return
}

try {
// Send as raw ArrayBuffer (Float32Array buffer)
this.websocket.send(float32Array.buffer)
} catch (error) {
console.error('Error sending audio via WebSocket:', error)
}
}

/**

* Handle incoming WebSocket messages
*/
private handleWebSocketMessage(event: MessageEvent): void {
// Override this method or set onMessage callback
console.log('WebSocket message received:', event.data)
}

/**

* Stop recording and clean up resources
*/
stopRecording(): void {
this.isRecording = false
this.isAccumulating = false
this.websocketReady = false
this.isSpeaking = false

if (this.processor) {
this.processor.disconnect()
this.processor = null
}

if (this.source) {
this.source.disconnect()
this.source = null
}

if (this.audioContext) {
this.audioContext.close()
this.audioContext = null
}

if (this.stream) {
this.stream.getTracks().forEach(track => track.stop())
this.stream = null
}

if (this.websocket) {
this.websocket.close()
this.websocket = null
}

this.accumulatedAudio = []
this.rmsBuffer = []
console.log('Recording stopped and cleaned up')
}

/**

* Check if currently recording
*/
isCurrentlyRecording(): boolean {
return this.isRecording
}

/**

* Check if voice is detected
*/
isCurrentlySpeaking(): boolean {
return this.isSpeaking
}

// Optional callbacks
onSpeechStart?: () => void
onSpeechEnd?: () => void
onMessage?: (event: MessageEvent) => void
}

Usage Example: Real-time Voice Mode

// Initialize recorder
const recorder = new AudioRecorder(16000)

// Set up callbacks
recorder.onSpeechStart = () => {
console.log('User started speaking')
}

recorder.onSpeechEnd = () => {
console.log('User stopped speaking')
}

recorder.onMessage = (event) => {
try {
const response = JSON.parse(event.data)
console.log('Received:', response)
// Handle audio chunks - see Audio Playback documentation
} catch (e) {
// Binary audio data
}
}

// Start recording with WebSocket streaming
const agentId = 123
const apiKey = 'your_api_key'
const wsUrl = `wss://api.autessa.com/ws/clients/agents/execute?authorization=${apiKey}&resourceId=${agentId}&voiceModeEnabled=true`

await recorder.startContinuousRecording(wsUrl, {
outputType: 'AUDIO',
agentId: agentId,
environmentVariables: {},
promptTemplateVariables: {}
})

// Later: stop recording
recorder.stopRecording()

For handling received audio chunks, see the Audio Playback documentation.


Multimodal Audio Mode (S3 Upload)

For non-real-time use cases, record audio, convert to WAV, upload to S3, and send the S3 URI in a multimodal request.

Step 1: Record and Accumulate Audio

class AudioAccumulator {
private sampleRate: number
private audioContext: AudioContext | null = null
private processor: ScriptProcessorNode | null = null
private source: MediaStreamAudioSourceNode | null = null
private stream: MediaStream | null = null
private isRecording: boolean = false
private chunks: Float32Array[] = []

constructor(sampleRate: number = 16000) {
this.sampleRate = sampleRate
}

async startRecording(): Promise<void> {
try {
this.stream = await navigator.mediaDevices.getUserMedia({ audio: true })
this.audioContext = new (window.AudioContext || (window as any).webkitAudioContext)({
sampleRate: this.sampleRate
})

this.source = this.audioContext.createMediaStreamSource(this.stream)
this.processor = this.audioContext.createScriptProcessor(4096, 1, 1)

this.source.connect(this.processor)
this.processor.connect(this.audioContext.destination)

this.processor.onaudioprocess = (event) => {
const samples = event.inputBuffer.getChannelData(0)
this.chunks.push(new Float32Array(samples))
}

this.isRecording = true
this.chunks = []
} catch (error) {
console.error('Error starting recording:', error)
throw error
}
}

stopRecording(): Float32Array {
this.isRecording = false

if (this.processor) this.processor.disconnect()
if (this.source) this.source.disconnect()
if (this.audioContext) this.audioContext.close()
if (this.stream) {
this.stream.getTracks().forEach(track => track.stop())
}

// Combine all chunks
const totalLength = this.chunks.reduce((sum, chunk) => sum + chunk.length, 0)
const combined = new Float32Array(totalLength)

let offset = 0
for (const chunk of this.chunks) {
combined.set(chunk, offset)
offset += chunk.length
}

return combined
}
}

Step 2: Convert to WAV Format

class WavConverter {
static float32ToWav(float32Array: Float32Array, sampleRate: number): Blob {
// Convert Float32 to Int16
const int16Array = new Int16Array(float32Array.length)
for (let i = 0; i < float32Array.length; i++) {
const s = Math.max(-1, Math.min(1, float32Array[i]))
int16Array[i] = s < 0 ? s * 0x8000 : s * 0x7FFF
}

const buffer = new ArrayBuffer(44 + int16Array.length * 2)
const view = new DataView(buffer)

// WAV header
const writeString = (offset: number, str: string) => {
for (let i = 0; i < str.length; i++) {
view.setUint8(offset + i, str.charCodeAt(i))
}
}

writeString(0, 'RIFF')
view.setUint32(4, 36 + int16Array.length * 2, true)
writeString(8, 'WAVE')
writeString(12, 'fmt ')
view.setUint32(16, 16, true)
view.setUint16(20, 1, true)
view.setUint16(22, 1, true)
view.setUint32(24, sampleRate, true)
view.setUint32(28, sampleRate * 2, true)
view.setUint16(32, 2, true)
view.setUint16(34, 16, true)
writeString(36, 'data')
view.setUint32(40, int16Array.length * 2, true)

// Write audio data
const wavData = new Uint8Array(buffer)
wavData.set(new Uint8Array(int16Array.buffer), 44)

return new Blob([wavData], { type: 'audio/wav' })
}
}

Step 3: Upload to S3

class AudioUploader {
private apiKey: string
private baseUrl: string

constructor(apiKey: string, baseUrl: string = 'https://api.autessa.com') {
this.apiKey = apiKey
this.baseUrl = baseUrl
}

async uploadAudio(agentId: number, wavBlob: Blob): Promise<string> {
// Step 1: Get presigned URL
const presignedResponse = await fetch(
`${this.baseUrl}/clients/agents/generate-audio-upload-link?resourceId=${agentId}`,
{
method: 'POST',
headers: {
'Authorization': this.apiKey,
'Content-Type': 'application/json'
}
}
)

if (!presignedResponse.ok) {
throw new Error('Failed to get presigned URL')
}

const { uploadUrl, s3Uri } = await presignedResponse.json()

// Step 2: Upload to S3
const uploadResponse = await fetch(uploadUrl, {
method: 'PUT',
body: wavBlob,
headers: {
'Content-Type': 'audio/wav'
}
})

if (!uploadResponse.ok) {
throw new Error('Failed to upload to S3')
}

// Step 3: Return S3 URI
return s3Uri
}
}

Step 4: Use in Multimodal Request

// Complete workflow
async function recordAndSendAudio(agentId: number, apiKey: string) {
// 1. Record audio
const accumulator = new AudioAccumulator(16000)
await accumulator.startRecording()

console.log('Recording... (press any key to stop)')
// ... wait for user to finish ...

const audioData = accumulator.stopRecording()

// 2. Convert to WAV
const wavBlob = WavConverter.float32ToWav(audioData, 16000)

// 3. Upload to S3
const uploader = new AudioUploader(apiKey)
const s3Uri = await uploader.uploadAudio(agentId, wavBlob)

console.log('Audio uploaded:', s3Uri)

// 4. Send multimodal request
const response = await fetch(
`https://api.autessa.com/clients/agents/execute?resourceId=${agentId}`,
{
method: 'POST',
headers: {
'Authorization': apiKey,
'Content-Type': 'application/json'
},
body: JSON.stringify({
agentId: agentId,
input: [
{
inputType: 'AUDIO',
s3Uri: s3Uri,
audioFormat: {
sampleRate: 16000,
sampleSizeInBits: 16,
channels: 1,
signed: true,
bigEndian: false
}
}
],
executionOutputMode: 'TEXT' // or 'AUDIO'
})
}
)

const result = await response.json()
console.log('Agent response:', result)
}

Audio Format Specifications

All audio in Autessa follows these specifications:

  • Name
    sampleRate
    Type
    number
    Description

    16000 Hz (16 kHz)

  • Name
    channels
    Type
    number
    Description

    1 (mono)

  • Name
    sampleSizeInBits
    Type
    number
    Description

    32-bit for recording (Float32), 16-bit for WAV export (Int16)

  • Name
    signed
    Type
    boolean
    Description

    true

  • Name
    bigEndian
    Type
    boolean
    Description

    false (little-endian)


Best Practices

Voice Activity Detection

The AudioRecorder class includes VAD to detect when the user is speaking:

  • Threshold: Adjust vadThreshold (default: 0.01) for sensitivity
  • Debounce: Set vadDebounceMs (default: 300ms) to avoid flickering
  • Callbacks: Use onSpeechStart and onSpeechEnd for UI updates

Memory Management

  • Always call stopRecording() when done to free resources
  • For long recordings, consider chunked processing
  • Use blob URLs sparingly and revoke them after use

Error Handling

try {
await recorder.startContinuousRecording(wsUrl, config)
} catch (error) {
if (error.name === 'NotAllowedError') {
console.error('Microphone permission denied')
} else if (error.name === 'NotFoundError') {
console.error('No microphone found')
} else {
console.error('Failed to start recording:', error)
}
}

WebSocket Reconnection

For production use, implement reconnection logic:

let reconnectAttempts = 0
const maxReconnectAttempts = 3

websocket.onclose = () => {
if (reconnectAttempts < maxReconnectAttempts) {
reconnectAttempts++
setTimeout(() => {
console.log(`Reconnecting... (${reconnectAttempts}/${maxReconnectAttempts})`)
// Restart recording
}, 1000 * reconnectAttempts)
}
}

Next Steps

  • Learn how to play received audio in the Audio Playback guide
  • See complete examples in the Agent API documentation
  • Explore multimodal capabilities in the Agent API reference