# Dataset

Manage training datasets

## Create Dataset

 - [POST /tailored-gen/datasets](https://docs.bria.ai/tailored-generation/dataset/create-dataset.md): Create a new dataset.

Training Version:
The training_version parameter determines the dataset's compatibility and structure:
* Legacy (max, light, 3.2, 2.3): Uses text captions and caption_prefix. Compatible with each other.
* FIBO (fibo): Uses JSON structured data (visual_schema). caption_prefix is null.

Defaults & constraints:
* training_version defaults to max (will change to fibo in future updates).
* training_version is immutable after creation.

Project Compatibility & Automatic Assignment:
When creating a dataset, the system validates compatibility with the parent Project:

1. Automatic Assignment: If the Project's training_version is null, it will automatically inherit the training_version of this new dataset. The Project will then be locked to this version family.

2. Validation: If the Project already has a training_version set:
   * FIBO Projects: Can only contain fibo datasets.
   * Legacy Projects: Can contain any legacy dataset (max, light, 3.2, 2.3).
   * Mixing Forbidden: You cannot create a fibo dataset in a legacy project, or vice versa.

Completion Requirements:
* Legacy: Minimum 1 image required to mark as completed.
* FIBO: Minimum 5 images required to mark as complete

Upload types:
* Basic upload type: Supports up to 200 images, uploading image files
* Advanced upload type: Supports up to 5000 images, uploading a zip file

## Get Datasets

 - [GET /tailored-gen/datasets](https://docs.bria.ai/tailored-generation/dataset/get-datasets.md): Retrieve a list of all datasets. If there are no datasets, returns an empty array.

## Get Datasets by Project

 - [GET /tailored-gen/projects/{project_id}/datasets](https://docs.bria.ai/tailored-generation/dataset/get-datasets-by-project.md): Retrieve all datasets for a specific project

## Get Dataset by ID

 - [GET /tailored-gen/datasets/{dataset_id}](https://docs.bria.ai/tailored-generation/dataset/get-dataset-by-id.md): Retrieve a specific dataset

## Update Dataset

 - [PUT /tailored-gen/datasets/{dataset_id}](https://docs.bria.ai/tailored-generation/dataset/paths/~1tailored-gen~1datasets~1%7Bdataset_id%7D/put.md): Update a dataset.

FIBO vs Legacy Behavior:
* If training_version is 'fibo':
  * You CAN update visual_schema (only when status is draft).
  * You CANNOT update caption_prefix.
* If training_version is 'max', 'light', '3.2', or '2.3':
  * You CAN update caption_prefix (only when status is draft).
  * You CANNOT update visual_schema.
  
Completion Requirements:
To set status to completed:
* FIBO Datasets: Must have at least 5 images.
* Legacy Datasets: Must have at least 1 image.

Note: training_version cannot be updated.

## Delete Dataset

 - [DELETE /tailored-gen/datasets/{dataset_id}](https://docs.bria.ai/tailored-generation/dataset/delete-dataset.md): Delete a specific dataset. Deletes all associated images.

## Clone Dataset As Draft

 - [POST /tailored-gen/datasets/{dataset_id}/clone](https://docs.bria.ai/tailored-generation/dataset/clone-dataset.md): Create a new draft dataset based on existing one. This is useful when you would like to use the same dataset again for another training, but with some modification (create a variation). Inheritance: The cloned dataset inherits the training_version (and visual_schema if applicable) from the source dataset.

## Upload Image files

 - [POST /tailored-gen/datasets/{dataset_id}/images](https://docs.bria.ai/tailored-generation/dataset/upload-image.md): Upload new image to a dataset. 

Image Requirements:
- Recommended minimum resolution: 1024x1024 pixels for best quality
  - By default, smaller images (down to 256x256) will be automatically upscaled to meet this threshold (increase_resolution=true)
  - To strictly enforce the 1024x1024 minimum, set increase_resolution=false
- Supported formats: jpg, jpeg, png, webp
- Preferably use original high-quality assets

Dataset Guidelines:
- Recommended: 5-50 images for optimal results when using Max/Fibo training version, 15-100 for optimal results when using Light training version
- Maximum supported: 200 images
- Ensure consistency in style, structure, and visual elements
- Balance diversity in content (poses, scenes, objects) while maintaining consistency in key elements (style, colors, theme)
- Note: Larger datasets may introduce more variety, which can reduce overall consistency

For optimal training (especially for characters/objects):
- Subject should occupy most of the image area
- Minimize unnecessary margins around the subject
- Transparent backgrounds will be converted to black
- For character datasets: include diverse poses, environments, attires, and interactions

Captions and Generation:
For Legacy models:
  - Each image receives an automatic caption that continues from the dataset's caption prefix
  - Default caption prefix is recommended for initial training
- Captions can be modified to include domain-specific terms
- Both captions and prefix influence training and future generations
- Focus on essential elements rather than extensive details

Constraints:
- Can only be used by "basic" upload type. use images/bulk for advanced dataset upload
- Dataset must have at least 5 images
- Dataset cannot exceed 200 images
- Cannot upload to a completed dataset

This API endpoint supports content moderation via an optional parameter that can prevent processing if input images contain inappropriate content - the first blocked input image will fail the entire request.

## Get Images

 - [GET /tailored-gen/datasets/{dataset_id}/images](https://docs.bria.ai/tailored-generation/dataset/get-images.md): Retrieve all images in a specific dataset. If there are no images, returns an empty array.

## Regenerate All Captions

 - [PUT /tailored-gen/datasets/{dataset_id}/images](https://docs.bria.ai/tailored-generation/dataset/regenerate-all-captions.md): Regenerate captions for all images in dataset. This action is crucial after the user updates the visual schema or caption_prefix, and then it's recommended to regenerate all the captions of all images, to have full compatibility with the new visual schema or caption_prefix.

This is an asynchronous operation. Once this endpoint is called, Get Dataset by ID should be sampled until the captions_update_status changes to 'completed'.

## Advanced image upload

 - [POST /tailored-gen/datasets/{dataset_id}/images/bulk-upload](https://docs.bria.ai/tailored-generation/dataset/bulk-upload-images.md): Efficiently upload a large volume of images (up to 5000) from a ZIP file to an advanced dataset.

FIBO Specific Behavior:
* Upload without Schema: You CAN initiate a bulk upload to a FIBO dataset even if visual_schema is null. 
* Captioning: If the schema is missing, images will be uploaded but caption generation will be skipped. You must call Regenerate All Captions after defining the schema.
* ZIP Content: Should contain images only. Text files are ignored.

Legacy Specific Behavior:
* ZIP Content: Can contain images and optional .txt caption files.
* Captioning: Uses automatic_captioning flag or provided text files.

General:
* Asynchronous operation; status can be retrieved via {dataset_id}/bulk-upload/status.
* Supported for 'advanced' upload type datasets only.
* This endpoint is for bulk upload and does not support the increase_resolution parameter.
* Images that fail validation (e.g., unsupported format, wrong dimensions, missing captions) will be skipped and included in the failure report.
* If the dataset is not empty, if another bulk upload is in progress, or if any previous bulk upload attemp took place, the request will fail.

Image Requirements:
* Supported formats: jpg, jpeg, png, webp.
* Minimum dimensions: 1024 x 1024 pixels.
* Total size limit: 5 GB zip file.

## Get Image by ID

 - [GET /tailored-gen/datasets/{dataset_id}/images/{image_id}](https://docs.bria.ai/tailored-generation/dataset/get-image.md): Retrieve full image information including caption (which naturally continues the dataset's caption_prefix), caption source (automatic/manual/unknown), image name, URL and thumbnail URL, dataset ID, and timestamps.

## Update Image Caption

 - [PUT /tailored-gen/datasets/{dataset_id}/images/{image_id}](https://docs.bria.ai/tailored-generation/dataset/update-image-caption.md): Update the caption of a specific image. There are two mutually exclusive ways to update a caption:

1. Provide a new caption text:
   * Use the caption parameter
   * This will set caption_source to "manual"
   * Reflects a human-written caption

2. Request automatic caption regeneration:
   * Set regenerate_caption to true
   * This will set caption_source to "automatic"
   * A new caption will be generated automatically based on the image and caption_prefix (or visual_schema for FIBO)

FIBO Dataset Validation:
If the image belongs to a dataset with training_version = fibo:
* The caption string must contain a valid JSON structure.
* The JSON content is validated against a pre-defined caption schema.
* The system will attempt to auto-correct minor structural issues.
* If the JSON is invalid or structurally incorrect beyond repair, a 400 error is returned.

Constraints:
* Cannot update captions in a completed dataset
* Cannot provide both caption and regenerate_caption in the same request

## Delete Image

 - [DELETE /tailored-gen/datasets/{dataset_id}/images/{image_id}](https://docs.bria.ai/tailored-generation/dataset/delete-image.md): Permanently remove an image from a dataset. This will also delete the image files and associated thumbnails. 

Constraints:
* Cannot delete images from completed datasets

## Get Bulk Upload Status

 - [GET /tailored-gen/datasets/{dataset_id}/images/bulk-upload/status](https://docs.bria.ai/tailored-generation/dataset/get-bulk-upload-status.md): Retrieve the status and progress of a bulk image upload job.

## Generate Visual Schema

 - [POST /tailored-gen/generate_visual_schema](https://docs.bria.ai/tailored-generation/dataset/generate-visual-schema.md): Generates a structured JSON visual schema (backbone) based on the provided sample images.

This endpoint is required for datasets intended for the FIBO training version.
The visual schema represents mutual characteristics (style, IP, colors, etc.) across training images and is used for:
1. Caption generation during image upload.
2. Prompt translation (user text → structured prompt) during generation.

Usage:
- Provide 5-10 representative images of your style/IP.
- The returned visual_schema string must be added to your dataset using the PUT /tailored-gen/datasets/{dataset_id} endpoint.

This API endpoint supports content moderation via an optional parameter.

## Refine Structured Prompt

 - [POST /tailored-gen/refine_structured_prompt](https://docs.bria.ai/tailored-generation/dataset/refine-json.md): Refines a Structured Prompt object (such as a Visual Schema or an Image Caption) based on user instructions.

Access Control & Validation:
* Dataset Ownership: Requires a valid dataset_id to verify that the API token belongs to the organization owning the dataset.
* Draft Status: The referenced dataset must be in draft mode. Refinement is disabled for completed datasets.

Use Cases:
1. Refine Visual Schema: Input the initial schema generated by Bria and instructions like "Make the style description more detailed".
2. Refine Image Caption: Input a specific image's caption and instructions like "Fix the description of the hair color".

The endpoint uses a VLM/LLM to process the input JSON and instructions, returning a valid, modified JSON structure that preserves the required format.

## Generate Caption Prefix

 - [POST /tailored-gen/generate_prefix](https://docs.bria.ai/tailored-generation/dataset/generate-prefix.md): Generates a caption prefix based on the provided images.  


This is currently supported only when ip_type is stylized_scene, 'defined_character' or 'object_variants' IP types.


##### Usage Scenarios:
1. Before uploading visuals to a new dataset  
  - This use case applies when creating a new dataset.  
  - In the first step, you can create the dataset entity in parallel while calling this endpoint.  
  - Randomly sample 1-6 images from the input images provided for training. If there are 6 or more images, provide exactly 6 for the best results.  
  - Once you receive the prefix, update the dataset using the Update Dataset endpoint.  
  - Then, proceed with uploading images to the dataset.

2. To regenerate a new prefix (even if previously generated)  
  - This allows users to select the prefix they prefer.  
  - Randomly sample 1-6 images from the input images provided for training. If there are 6 or more images, provide exactly 6 for the best results.  
  - Update the dataset with the new prefix.  
  - Then, use the Regenerate All Captions endpoint to ensure all images in the dataset get updated captions.


If any image fails validation, the request will fail.  


This API endpoint supports content moderation via an optional parameter that can prevent processing if input images contain inappropriate content - the first blocked input image will fail the entire request.

## Download Advanced Dataset

 - [GET /datasets/{dataset_id}/download](https://docs.bria.ai/tailored-generation/dataset/download-dataset.md): Enables users to download an advanced dataset.
The response includes a pre-signed URL for downloading the dataset, details about the base model used, 
and the prompt prefix applied during training.