Detect objects
Your vision service is configured and running, but you need to do something useful with the results. This how-to shows you how to retrieve detections programmatically, filter them to reduce noise, and extract the information you need to build real applications – counting objects, triggering actions when something appears, or feeding positions into a control loop.
Concepts
What a detection contains
A detection is a single recognized object in an image. Each detection includes:
| Field | Type | Description |
|---|---|---|
class_name | String | The label assigned by the model (for example, “person”, “dog”, “stop-sign”) |
confidence | Float (0.0-1.0) | How confident the model is in this detection |
x_min | Integer | Left edge of the bounding box in pixels |
y_min | Integer | Top edge of the bounding box in pixels |
x_max | Integer | Right edge of the bounding box in pixels |
y_max | Integer | Bottom edge of the bounding box in pixels |
The bounding box coordinates use the image coordinate system: (0,0) is the top-left corner, x increases to the right, y increases downward. The bounding box is axis-aligned (not rotated).
Detections also include normalized coordinates (x_min_normalized, y_min_normalized, x_max_normalized, y_max_normalized) as floats between 0.0 and 1.0, representing positions relative to image dimensions. These are useful when you need resolution-independent coordinates.
A single call to the detection API can return zero, one, or many detections. The number depends on how many objects the model finds in the frame.
GetDetections vs GetDetectionsFromCamera
The vision service provides two methods for getting detections:
- GetDetectionsFromCamera takes a camera name and handles everything: it captures an image from the camera, runs it through the model, and returns detections. This is the simplest approach and the one you should use in most cases.
- GetDetections takes an image you already have and runs it through the model. Use this when you need to run detection on an image you captured separately, an image from a file, or an image you have preprocessed.
Confidence thresholds
Every detection has a confidence score between 0.0 and 1.0. A score of 0.95 means the model is very confident; a score of 0.3 means it is guessing. Models often produce many low-confidence detections that are noise rather than real objects.
Choosing the right threshold depends on your application:
- Safety-critical applications (such as detecting people before a robot moves): use a low threshold (0.3-0.5) to avoid missing real objects, and accept some false positives.
- Counting or logging applications: use a higher threshold (0.7-0.9) to reduce noise and get accurate counts.
- Start at 0.5 and adjust based on what you observe.
The color_detector alternative
Not every detection task requires a trained ML model. Viam includes a built-in color_detector vision service model that detects regions of a specified color. This is useful for simple tasks like finding a red ball or detecting a blue marker. It requires no model training, no GPU, and minimal configuration.
Steps
1. Get detections from a camera
The simplest way to get detections is to let the vision service capture an image and run the model in one call.
import asyncio
from viam.robot.client import RobotClient
from viam.services.vision import VisionClient
async def main():
opts = RobotClient.Options.with_api_key(
api_key="YOUR-API-KEY",
api_key_id="YOUR-API-KEY-ID"
)
robot = await RobotClient.at_address("YOUR-MACHINE-ADDRESS", opts)
detector = VisionClient.from_robot(robot, "my-detector")
# Get detections directly from the camera
detections = await detector.get_detections_from_camera("my-camera")
for d in detections:
print(f"{d.class_name}: {d.confidence:.2f} "
f"at ({d.x_min},{d.y_min})-({d.x_max},{d.y_max})")
await robot.close()
if __name__ == "__main__":
asyncio.run(main())
package main
import (
"context"
"fmt"
"go.viam.com/rdk/logging"
"go.viam.com/rdk/robot/client"
"go.viam.com/rdk/services/vision"
"go.viam.com/utils/rpc"
)
func main() {
ctx := context.Background()
logger := logging.NewLogger("detect")
machine, err := client.New(ctx, "YOUR-MACHINE-ADDRESS", logger,
client.WithDialOptions(rpc.WithEntityCredentials(
"YOUR-API-KEY-ID",
rpc.Credentials{
Type: rpc.CredentialsTypeAPIKey,
Payload: "YOUR-API-KEY",
})),
)
if err != nil {
logger.Fatal(err)
}
defer machine.Close(ctx)
detector, err := vision.FromProvider(machine, "my-detector")
if err != nil {
logger.Fatal(err)
}
detections, err := detector.DetectionsFromCamera(ctx, "my-camera", nil)
if err != nil {
logger.Fatal(err)
}
for _, d := range detections {
fmt.Printf("%s: %.2f at (%d,%d)-(%d,%d)\n",
d.Label(), d.Score(),
d.BoundingBox().Min.X, d.BoundingBox().Min.Y,
d.BoundingBox().Max.X, d.BoundingBox().Max.Y)
}
}
2. Get detections from an existing image
If you already have an image – from a file, from a previous capture, or from a different camera – you can run detection on it directly.
from viam.components.camera import Camera
from viam.services.vision import VisionClient
camera = Camera.from_robot(robot, "my-camera")
detector = VisionClient.from_robot(robot, "my-detector")
# Capture images from the camera
images, _ = await camera.get_images()
# Run detection on the first image
detections = await detector.get_detections(images[0])
for d in detections:
print(f"{d.class_name}: {d.confidence:.2f}")
cam, err := camera.FromProvider(machine, "my-camera")
if err != nil {
logger.Fatal(err)
}
detector, err := vision.FromProvider(machine, "my-detector")
if err != nil {
logger.Fatal(err)
}
// Capture an image first
images, _, err := cam.Images(ctx, nil, nil)
if err != nil {
logger.Fatal(err)
}
img, err := images[0].Image(ctx)
if err != nil {
logger.Fatal(err)
}
// Run detection on the captured image
detections, err := detector.Detections(ctx, img, nil)
if err != nil {
logger.Fatal(err)
}
for _, d := range detections {
fmt.Printf("%s: %.2f\n", d.Label(), d.Score())
}
3. Filter detections by confidence
In practice, you almost always want to filter out low-confidence detections. Apply a threshold before processing results.
CONFIDENCE_THRESHOLD = 0.7
detections = await detector.get_detections_from_camera("my-camera")
# Filter to only high-confidence detections
confident_detections = [
d for d in detections
if d.confidence >= CONFIDENCE_THRESHOLD
]
print(f"{len(confident_detections)} of {len(detections)} "
f"detections above {CONFIDENCE_THRESHOLD} threshold")
for d in confident_detections:
print(f" {d.class_name}: {d.confidence:.2f}")
confidenceThreshold := 0.7
detections, err := detector.DetectionsFromCamera(ctx, "my-camera", nil)
if err != nil {
logger.Fatal(err)
}
total := len(detections)
var confident []objectdetection.Detection
for _, d := range detections {
if d.Score() >= confidenceThreshold {
confident = append(confident, d)
}
}
fmt.Printf("%d of %d detections above %.1f threshold\n",
len(confident), total, confidenceThreshold)
for _, d := range confident {
fmt.Printf(" %s: %.2f\n", d.Label(), d.Score())
}
4. Filter detections by class
When your model detects multiple object types, you may only care about specific classes.
TARGET_CLASSES = {"person", "dog"}
detections = await detector.get_detections_from_camera("my-camera")
targets = [
d for d in detections
if d.class_name in TARGET_CLASSES and d.confidence >= 0.6
]
for d in targets:
width = d.x_max - d.x_min
height = d.y_max - d.y_min
print(f"{d.class_name}: {d.confidence:.2f}, "
f"size {width}x{height} pixels")
targetClasses := map[string]bool{"person": true, "dog": true}
detections, err := detector.DetectionsFromCamera(ctx, "my-camera", nil)
if err != nil {
logger.Fatal(err)
}
for _, d := range detections {
if targetClasses[d.Label()] && d.Score() >= 0.6 {
bb := d.BoundingBox()
width := bb.Max.X - bb.Min.X
height := bb.Max.Y - bb.Min.Y
fmt.Printf("%s: %.2f, size %dx%d pixels\n",
d.Label(), d.Score(), width, height)
}
}
5. Run detections in a loop
Most real applications need continuous detection, not a single snapshot. Run detections in a loop with a short delay to avoid overwhelming the system.
import asyncio
import time
CONFIDENCE_THRESHOLD = 0.7
detector = VisionClient.from_robot(robot, "my-detector")
while True:
start = time.time()
detections = await detector.get_detections_from_camera("my-camera")
confident = [d for d in detections if d.confidence >= CONFIDENCE_THRESHOLD]
elapsed = time.time() - start
if confident:
names = [f"{d.class_name}({d.confidence:.2f})" for d in confident]
print(f"[{elapsed:.2f}s] Detected: {', '.join(names)}")
else:
print(f"[{elapsed:.2f}s] No detections")
await asyncio.sleep(0.1)
confidenceThreshold := 0.7
detector, err := vision.FromProvider(machine, "my-detector")
if err != nil {
logger.Fatal(err)
}
for {
start := time.Now()
detections, err := detector.DetectionsFromCamera(ctx, "my-camera", nil)
if err != nil {
logger.Error(err)
time.Sleep(time.Second)
continue
}
elapsed := time.Since(start)
var confident []objectdetection.Detection
for _, d := range detections {
if d.Score() >= confidenceThreshold {
confident = append(confident, d)
}
}
if len(confident) > 0 {
for _, d := range confident {
fmt.Printf("[%v] %s: %.2f\n", elapsed, d.Label(), d.Score())
}
} else {
fmt.Printf("[%v] No detections\n", elapsed)
}
time.Sleep(100 * time.Millisecond)
}
6. Use the color_detector (no ML model needed)
For simple color-based detection, configure a vision service with the color_detector model instead of mlmodel. This requires no trained model.
{
"name": "red-detector",
"api": "rdk:service:vision",
"model": "color_detector",
"attributes": {
"detect_color": "#FF0000",
"hue_tolerance_pct": 0.1,
"segment_size_px": 200
}
}
| Attribute | Type | Required | Description |
|---|---|---|---|
detect_color | string | Required | Target color in hex format (for example, "#FF0000"). Cannot be black, white, or grayscale. |
hue_tolerance_pct | float | Required | How much the hue can vary from the target (must be > 0.0 and <= 1.0). A value of 0.1 means 10% tolerance. |
segment_size_px | int | Required | Minimum number of pixels a detected region must contain to count as a detection. |
saturation_cutoff_pct | float | Optional | Minimum saturation for a pixel to be considered a match. Default: 0.2. Increase to reject washed-out colors. |
value_cutoff_pct | float | Optional | Minimum brightness value for a pixel to be considered a match. Default: 0.3. Increase to reject dark regions. |
label | string | Optional | The label to assign to detections. If omitted, detections have no class name. |
camera_name | string | Optional | Default camera to use with GetDetectionsFromCamera if no camera name is specified in the request. |
The detection API works identically whether you use mlmodel or color_detector. Your code does not need to change.
7. Get everything in one call with CaptureAllFromCamera
If you need the image and its detections (and optionally classifications) together, use CaptureAllFromCamera instead of separate calls. This is more efficient when you need multiple result types and ensures the detections correspond exactly to the returned image.
from viam.services.vision import VisionClient
detector = VisionClient.from_robot(robot, "my-detector")
result = await detector.capture_all_from_camera(
"my-camera",
return_image=True,
return_detections=True,
return_classifications=True
)
image = result.image # The captured image
detections = result.detections # Detections for that exact image
classifications = result.classifications # Classifications too
captOpts := viscapture.CaptureOptions{
ReturnImage: true,
ReturnDetections: true,
ReturnClassifications: true,
}
result, err := detector.CaptureAllFromCamera(
context.Background(), "my-camera", captOpts, nil)
if err != nil {
logger.Fatal(err)
}
img := result.Image
detections := result.Detections
classifications := result.Classifications
This is particularly useful for data logging, visualization, or any case where you need to correlate results with the exact frame that produced them.
Try It
- Run the detection loop from step 5. Point your camera at objects your model recognizes and observe the output.
- Adjust the confidence threshold and notice how it affects the number of detections. Try 0.3, 0.5, 0.7, and 0.9.
- Add class filtering to focus on a specific object type.
- If you do not have a trained model, configure a
color_detectorand point the camera at a brightly colored object.
Troubleshooting
What’s Next
- Classify Objects – use whole-image classification instead of per-object bounding boxes.
- Track Objects Across Frames – maintain persistent identities for detected objects as they move between frames.
- Localize Objects in 3D – combine 2D detections with depth data to get real-world 3D positions.
Was this page helpful?
Glad to hear it! If you have any other feedback please let us know:
We're sorry about that. To help us improve, please tell us what we can do better:
Thank you!