Audio Recording

Learn how to record and send audio to Autessa agents using two different approaches: real-time voice mode for conversational AI and multimodal audio upload for asynchronous processing.

Overview

Autessa supports two audio input methods:

Real-time Voice Mode: Stream audio continuously via WebSocket for conversational experiences with voiceModeEnabled=true
Multimodal Audio: Record audio, upload to S3, and send as part of a multimodal request

Choose real-time voice mode for interactive conversations with low latency. Use multimodal audio when you need to process pre-recorded audio or don't require real-time interaction.

Real-time Voice Mode (WebSocket)

Real-time voice mode enables continuous bidirectional audio streaming for conversational AI experiences.

Connection Setup

Connect to the WebSocket endpoint with the voiceModeEnabled=true parameter:

const agentId = 123
const apiKey = 'your_api_key'
const wsUrl = `wss://api.autessa.com/ws/clients/agents/execute?authorization=${apiKey}&resourceId=${agentId}&voiceModeEnabled=true`

const websocket = new WebSocket(wsUrl)

Voice Establishment Message

After the WebSocket connection opens, send a voice establishment message to configure audio format and agent parameters:

websocket.onopen = () => {
  const establishmentMessage = {
    messageType: 'VOICE',
    outputType: 'AUDIO',  // or 'TEXT' for transcription only
    agentId: 123,
    audioFormat: {
      sampleRate: 16000,
      sampleSizeInBits: 32,
      channels: 1,
      signed: true,
      bigEndian: false
    },
    environmentVariables: {},
    promptTemplateVariables: {}
  }

  websocket.send(JSON.stringify(establishmentMessage))
  console.log('Voice mode established')
}

AudioRecorder Utility Class

Here's a production-ready utility class for continuous audio recording with WebSocket streaming:

export class AudioRecorder {
  private sampleRate: number
  private audioContext: AudioContext | null = null
  private processor: ScriptProcessorNode | null = null
  private source: MediaStreamAudioSourceNode | null = null
  private stream: MediaStream | null = null
  private isRecording: boolean = false
  private websocket: WebSocket | null = null
  private websocketReady: boolean = false

  // Voice Activity Detection
  private vadThreshold: number = 0.01
  private vadDebounceMs: number = 300
  private isSpeaking: boolean = false
  private silenceStart: number = 0
  private rmsBuffer: number[] = []
  private readonly rmsBufferSize: number = 5

  // Audio accumulation for JSON export
  private accumulatedAudio: Float32Array[] = []
  private isAccumulating: boolean = false

  constructor(sampleRate: number = 16000) {
    this.sampleRate = sampleRate
  }

  /**

   * Start continuous recording with optional WebSocket streaming
   */
  async startContinuousRecording(
    websocketUrl: string | null = null,
    agentConfig: {
      outputType: string
      agentId: number
      environmentVariables?: Record<string, any>
      promptTemplateVariables?: Record<string, any>
    } | null = null
  ): Promise<void> {
    try {
      // Set up WebSocket if URL provided
      if (websocketUrl) {
        this.websocket = new WebSocket(websocketUrl)
        this.websocketReady = false

        await new Promise((resolve, reject) => {
          this.websocket!.onopen = () => {
            console.log('WebSocket connected')

            // Send voice establishment message
            if (agentConfig) {
              const establishmentPayload = {
                messageType: 'VOICE',
                outputType: agentConfig.outputType,
                agentId: agentConfig.agentId,
                audioFormat: {
                  sampleRate: this.sampleRate,
                  sampleSizeInBits: 32,
                  channels: 1,
                  signed: true,
                  bigEndian: false
                },
                environmentVariables: agentConfig.environmentVariables || {},
                promptTemplateVariables: agentConfig.promptTemplateVariables || {}
              }

              this.websocket!.send(JSON.stringify(establishmentPayload))
              console.log('Sent voice establishment message')
            } else {
              this.websocketReady = true
            }

            resolve(undefined)
          }

          this.websocket!.onmessage = (event) => {
            if (!this.websocketReady) {
              console.log('WebSocket ready to receive audio')
              this.websocketReady = true
            }
            this.handleWebSocketMessage(event)
          }

          this.websocket!.onerror = (error) => reject(error)
        })
      }

      // Request microphone access
      this.stream = await navigator.mediaDevices.getUserMedia({
        audio: {
          echoCancellation: true,
          noiseSuppression: true,
          autoGainControl: true
        }
      })

      // Create audio context with specified sample rate
      this.audioContext = new (window.AudioContext || (window as any).webkitAudioContext)({
        sampleRate: this.sampleRate
      })

      this.source = this.audioContext.createMediaStreamSource(this.stream)

      // Create processor with 4096 buffer size
      const bufferSize = 4096
      this.processor = this.audioContext.createScriptProcessor(bufferSize, 1, 1)

      this.source.connect(this.processor)
      this.processor.connect(this.audioContext.destination)

      // Process audio in real-time
      this.processor.onaudioprocess = (event) => {
        const float32Samples = event.inputBuffer.getChannelData(0)

        // Detect voice activity
        this.detectVoiceActivity(float32Samples)

        // Send to WebSocket if ready
        if (this.websocket && this.websocketReady && this.isSpeaking) {
          this.sendAudioViaWebSocket(float32Samples)
        }

        // Accumulate for JSON export if requested
        if (this.isAccumulating) {
          this.accumulatedAudio.push(new Float32Array(float32Samples))
        }
      }

      this.isRecording = true
      console.log('Continuous recording started')

    } catch (error) {
      console.error('Error starting recording:', error)
      throw error
    }
  }

  /**

   * Detect voice activity using RMS energy
   */
  private detectVoiceActivity(samples: Float32Array): void {
    // Calculate RMS (Root Mean Square)
    let sum = 0
    for (let i = 0; i < samples.length; i++) {
      sum += samples[i] * samples[i]
    }
    const rms = Math.sqrt(sum / samples.length)

    // Add to smoothing buffer
    this.rmsBuffer.push(rms)
    if (this.rmsBuffer.length > this.rmsBufferSize) {
      this.rmsBuffer.shift()
    }

    // Calculate smoothed RMS
    const smoothedRms = this.rmsBuffer.reduce((a, b) => a + b, 0) / this.rmsBuffer.length

    const now = Date.now()

    if (smoothedRms > this.vadThreshold) {
      // Voice detected
      if (!this.isSpeaking) {
        this.isSpeaking = true
        this.onSpeechStart?.()
      }
      this.silenceStart = now
    } else {
      // Silence detected
      if (this.isSpeaking && (now - this.silenceStart) > this.vadDebounceMs) {
        this.isSpeaking = false
        this.onSpeechEnd?.()
      }
    }
  }

  /**

   * Send audio data via WebSocket as ArrayBuffer
   */
  private sendAudioViaWebSocket(float32Array: Float32Array): void {
    if (!this.websocket || this.websocket.readyState !== WebSocket.OPEN || !this.websocketReady) {
      return
    }

    try {
      // Send as raw ArrayBuffer (Float32Array buffer)
      this.websocket.send(float32Array.buffer)
    } catch (error) {
      console.error('Error sending audio via WebSocket:', error)
    }
  }

  /**

   * Handle incoming WebSocket messages
   */
  private handleWebSocketMessage(event: MessageEvent): void {
    // Override this method or set onMessage callback
    console.log('WebSocket message received:', event.data)
  }

  /**

   * Stop recording and clean up resources
   */
  stopRecording(): void {
    this.isRecording = false
    this.isAccumulating = false
    this.websocketReady = false
    this.isSpeaking = false

    if (this.processor) {
      this.processor.disconnect()
      this.processor = null
    }

    if (this.source) {
      this.source.disconnect()
      this.source = null
    }

    if (this.audioContext) {
      this.audioContext.close()
      this.audioContext = null
    }

    if (this.stream) {
      this.stream.getTracks().forEach(track => track.stop())
      this.stream = null
    }

    if (this.websocket) {
      this.websocket.close()
      this.websocket = null
    }

    this.accumulatedAudio = []
    this.rmsBuffer = []
    console.log('Recording stopped and cleaned up')
  }

  /**

   * Check if currently recording
   */
  isCurrentlyRecording(): boolean {
    return this.isRecording
  }

  /**

   * Check if voice is detected
   */
  isCurrentlySpeaking(): boolean {
    return this.isSpeaking
  }

  // Optional callbacks
  onSpeechStart?: () => void
  onSpeechEnd?: () => void
  onMessage?: (event: MessageEvent) => void
}

Usage Example: Real-time Voice Mode

// Initialize recorder
const recorder = new AudioRecorder(16000)

// Set up callbacks
recorder.onSpeechStart = () => {
  console.log('User started speaking')
}

recorder.onSpeechEnd = () => {
  console.log('User stopped speaking')
}

recorder.onMessage = (event) => {
  try {
    const response = JSON.parse(event.data)
    console.log('Received:', response)
    // Handle audio chunks - see Audio Playback documentation
  } catch (e) {
    // Binary audio data
  }
}

// Start recording with WebSocket streaming
const agentId = 123
const apiKey = 'your_api_key'
const wsUrl = `wss://api.autessa.com/ws/clients/agents/execute?authorization=${apiKey}&resourceId=${agentId}&voiceModeEnabled=true`

await recorder.startContinuousRecording(wsUrl, {
  outputType: 'AUDIO',
  agentId: agentId,
  environmentVariables: {},
  promptTemplateVariables: {}
})

// Later: stop recording
recorder.stopRecording()

For handling received audio chunks, see the Audio Playback documentation.

Multimodal Audio Mode (S3 Upload)

For non-real-time use cases, record audio, convert to WAV, upload to S3, and send the S3 URI in a multimodal request.

Step 1: Record and Accumulate Audio

class AudioAccumulator {
  private sampleRate: number
  private audioContext: AudioContext | null = null
  private processor: ScriptProcessorNode | null = null
  private source: MediaStreamAudioSourceNode | null = null
  private stream: MediaStream | null = null
  private isRecording: boolean = false
  private chunks: Float32Array[] = []

  constructor(sampleRate: number = 16000) {
    this.sampleRate = sampleRate
  }

  async startRecording(): Promise<void> {
    try {
      this.stream = await navigator.mediaDevices.getUserMedia({ audio: true })
      this.audioContext = new (window.AudioContext || (window as any).webkitAudioContext)({
        sampleRate: this.sampleRate
      })

      this.source = this.audioContext.createMediaStreamSource(this.stream)
      this.processor = this.audioContext.createScriptProcessor(4096, 1, 1)

      this.source.connect(this.processor)
      this.processor.connect(this.audioContext.destination)

      this.processor.onaudioprocess = (event) => {
        const samples = event.inputBuffer.getChannelData(0)
        this.chunks.push(new Float32Array(samples))
      }

      this.isRecording = true
      this.chunks = []
    } catch (error) {
      console.error('Error starting recording:', error)
      throw error
    }
  }

  stopRecording(): Float32Array {
    this.isRecording = false

    if (this.processor) this.processor.disconnect()
    if (this.source) this.source.disconnect()
    if (this.audioContext) this.audioContext.close()
    if (this.stream) {
      this.stream.getTracks().forEach(track => track.stop())
    }

    // Combine all chunks
    const totalLength = this.chunks.reduce((sum, chunk) => sum + chunk.length, 0)
    const combined = new Float32Array(totalLength)

    let offset = 0
    for (const chunk of this.chunks) {
      combined.set(chunk, offset)
      offset += chunk.length
    }

    return combined
  }
}

Step 2: Convert to WAV Format

class WavConverter {
  static float32ToWav(float32Array: Float32Array, sampleRate: number): Blob {
    // Convert Float32 to Int16
    const int16Array = new Int16Array(float32Array.length)
    for (let i = 0; i < float32Array.length; i++) {
      const s = Math.max(-1, Math.min(1, float32Array[i]))
      int16Array[i] = s < 0 ? s * 0x8000 : s * 0x7FFF
    }

    const buffer = new ArrayBuffer(44 + int16Array.length * 2)
    const view = new DataView(buffer)

    // WAV header
    const writeString = (offset: number, str: string) => {
      for (let i = 0; i < str.length; i++) {
        view.setUint8(offset + i, str.charCodeAt(i))
      }
    }

    writeString(0, 'RIFF')
    view.setUint32(4, 36 + int16Array.length * 2, true)
    writeString(8, 'WAVE')
    writeString(12, 'fmt ')
    view.setUint32(16, 16, true)
    view.setUint16(20, 1, true)
    view.setUint16(22, 1, true)
    view.setUint32(24, sampleRate, true)
    view.setUint32(28, sampleRate * 2, true)
    view.setUint16(32, 2, true)
    view.setUint16(34, 16, true)
    writeString(36, 'data')
    view.setUint32(40, int16Array.length * 2, true)

    // Write audio data
    const wavData = new Uint8Array(buffer)
    wavData.set(new Uint8Array(int16Array.buffer), 44)

    return new Blob([wavData], { type: 'audio/wav' })
  }
}

Step 3: Upload to S3

class AudioUploader {
  private apiKey: string
  private baseUrl: string

  constructor(apiKey: string, baseUrl: string = 'https://api.autessa.com') {
    this.apiKey = apiKey
    this.baseUrl = baseUrl
  }

  async uploadAudio(agentId: number, wavBlob: Blob): Promise<string> {
    // Step 1: Get presigned URL
    const presignedResponse = await fetch(
      `${this.baseUrl}/clients/agents/generate-audio-upload-link?resourceId=${agentId}`,
      {
        method: 'POST',
        headers: {
          'Authorization': this.apiKey,
          'Content-Type': 'application/json'
        }
      }
    )

    if (!presignedResponse.ok) {
      throw new Error('Failed to get presigned URL')
    }

    const { uploadUrl, s3Uri } = await presignedResponse.json()

    // Step 2: Upload to S3
    const uploadResponse = await fetch(uploadUrl, {
      method: 'PUT',
      body: wavBlob,
      headers: {
        'Content-Type': 'audio/wav'
      }
    })

    if (!uploadResponse.ok) {
      throw new Error('Failed to upload to S3')
    }

    // Step 3: Return S3 URI
    return s3Uri
  }
}

Step 4: Use in Multimodal Request

// Complete workflow
async function recordAndSendAudio(agentId: number, apiKey: string) {
  // 1. Record audio
  const accumulator = new AudioAccumulator(16000)
  await accumulator.startRecording()

  console.log('Recording... (press any key to stop)')
  // ... wait for user to finish ...

  const audioData = accumulator.stopRecording()

  // 2. Convert to WAV
  const wavBlob = WavConverter.float32ToWav(audioData, 16000)

  // 3. Upload to S3
  const uploader = new AudioUploader(apiKey)
  const s3Uri = await uploader.uploadAudio(agentId, wavBlob)

  console.log('Audio uploaded:', s3Uri)

  // 4. Send multimodal request
  const response = await fetch(
    `https://api.autessa.com/clients/agents/execute?resourceId=${agentId}`,
    {
      method: 'POST',
      headers: {
        'Authorization': apiKey,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        agentId: agentId,
        input: [
          {
            inputType: 'AUDIO',
            s3Uri: s3Uri,
            audioFormat: {
              sampleRate: 16000,
              sampleSizeInBits: 16,
              channels: 1,
              signed: true,
              bigEndian: false
            }
          }
        ],
        executionOutputMode: 'TEXT' // or 'AUDIO'
      })
    }
  )

  const result = await response.json()
  console.log('Agent response:', result)
}

Audio Format Specifications

All audio in Autessa follows these specifications:

Name
sampleRate
Type
number
Description
16000 Hz (16 kHz)
Name
channels
Type
number
Description
1 (mono)
Name
sampleSizeInBits
Type
number
Description
32-bit for recording (Float32), 16-bit for WAV export (Int16)
Name
signed
Type
boolean
Description
true
Name
bigEndian
Type
boolean
Description
false (little-endian)

Best Practices

Voice Activity Detection

The AudioRecorder class includes VAD to detect when the user is speaking:

Threshold: Adjust vadThreshold (default: 0.01) for sensitivity
Debounce: Set vadDebounceMs (default: 300ms) to avoid flickering
Callbacks: Use onSpeechStart and onSpeechEnd for UI updates

Memory Management

Always call stopRecording() when done to free resources
For long recordings, consider chunked processing
Use blob URLs sparingly and revoke them after use

Error Handling

try {
  await recorder.startContinuousRecording(wsUrl, config)
} catch (error) {
  if (error.name === 'NotAllowedError') {
    console.error('Microphone permission denied')
  } else if (error.name === 'NotFoundError') {
    console.error('No microphone found')
  } else {
    console.error('Failed to start recording:', error)
  }
}

WebSocket Reconnection

For production use, implement reconnection logic:

let reconnectAttempts = 0
const maxReconnectAttempts = 3

websocket.onclose = () => {
  if (reconnectAttempts < maxReconnectAttempts) {
    reconnectAttempts++
    setTimeout(() => {
      console.log(`Reconnecting... (${reconnectAttempts}/${maxReconnectAttempts})`)
      // Restart recording
    }, 1000 * reconnectAttempts)
  }
}

Next Steps

Learn how to play received audio in the Audio Playback guide
See complete examples in the Agent API documentation
Explore multimodal capabilities in the Agent API reference

Learn how to record and send audio to Autessa agents using two different approaches: real-time voice mode for conversational AI and multimodal audio upload for asynchronous processing.​

Overview​

Real-time Voice Mode (WebSocket)​

Connection Setup​

Voice Establishment Message​

AudioRecorder Utility Class​

Usage Example: Real-time Voice Mode​

Multimodal Audio Mode (S3 Upload)​

Step 1: Record and Accumulate Audio​

Step 2: Convert to WAV Format​

Step 3: Upload to S3​

Step 4: Use in Multimodal Request​

Audio Format Specifications​

Best Practices​

Voice Activity Detection​

Memory Management​

Error Handling​

WebSocket Reconnection​

Next Steps​