Most robot voice interfaces rely on automatic speech recognition (ASR) and treat the transcribed text as the full meaning of an utterance. In real human communication, however, how something is said carries critical information: hesitation, uncertainty, and lack of decisiveness can signal that a command should not be executed yet. This project targets that missing layer by teaching a robot to detect “uncertain vs. decisive” speech from audio alone, enabling safer and more natural interaction.
In this 10-week Bachelor project, the student will build an audio-only model that estimates a continuous “hesitation/decisiveness” score (and a calibrated confidence) from short spoken commands. Training will start from the public CREMA-D dataset to learn transferable prosody and paralinguistic cues, then the model will be fine-tuned on a small in-lab command dataset recorded with our microphone setup and speaking styles. The focus is not generic emotion labeling, but a decision-relevant signal that remains robust across speakers and noise while running in real time.
The final step is integration as a lightweight module in a voice control loop (e.g., ROS): when hesitation is high or confidence is low, the system triggers a clarification question and waits for a clear confirmation; when speech is decisive, the command is executed normally. Deliverables include a reproducible training/fine-tuning pipeline, a simple data collection and labeling protocol for our private commands, quantitative evaluation (robustness, latency, calibration), and a short demonstration on a robot or ROS-based mockup showing “same words, different certainty → different system behavior.”