Exploring Multimodality in Spring AI

Introduction

This article primarily explores the multimodality of Spring AI.

Examples

chatModel Example

var imageResource = new ClassPathResource("/multimodal.test.png");

var userMessage = new UserMessage(
	"Explain what do you see in this picture?", // content
	new Media(MimeTypeUtils.IMAGE_PNG, this.imageResource)); // media

ChatResponse response = chatModel.call(new Prompt(this.userMessage));

chatClient Example

String response = ChatClient.create(chatModel).prompt()
		.user(u -> u.text("Explain what do you see on this picture?")
			.media(MimeTypeUtils.IMAGE_PNG, new ClassPathResource("/multimodal.test.png")))
		.call()
		.content();

Currently, the following models support multimodality:

  • Anthropic Claude 3
  • AWS Bedrock Converse
  • Azure Open AI (e.g. GPT-4o models)
  • Mistral AI (e.g. Mistral Pixtral models)
  • Ollama (e.g. LLaVA, BakLLaVA, Llama3.2 models)
  • OpenAI (e.g. GPT-4 and GPT-4o models)
  • Vertex AI Gemini (e.g. gemini-1.5-pro-001, gemini-1.5-flash-001 models)

Source Code

UserMessage

org/springframework/ai/chat/messages/UserMessage.java

public class UserMessage extends AbstractMessage implements MediaContent {

	protected final List<Media> media;

	public UserMessage(String textContent) {
		this(MessageType.USER, textContent, new ArrayList<>(), Map.of());
	}

	public UserMessage(Resource resource) {
		super(MessageType.USER, resource, Map.of());
		this.media = new ArrayList<>();
	}

	public UserMessage(String textContent, List<Media> media) {
		this(MessageType.USER, textContent, media, Map.of());
	}

	public UserMessage(String textContent, Media... media) {
		this(textContent, Arrays.asList(media));
	}

	public UserMessage(String textContent, Collection<Media> mediaList, Map<String, Object> metadata) {
		this(MessageType.USER, textContent, mediaList, metadata);
	}

	public UserMessage(MessageType messageType, String textContent, Collection<Media> media,
			Map<String, Object> metadata) {
		super(messageType, textContent, metadata);
		Assert.notNull(media, "media data must not be null");
		this.media = new ArrayList<>(media);
	}

	@Override
	public String toString() {
		return "UserMessage{" + "content='" + getText() + '\'' + ", properties=" + this.metadata + ", messageType="
			+ this.messageType + '}';
	}

	@Override
	public List<Media> getMedia() {
		return this.media;
	}

	@Override
	public String getText() {
		return this.textContent;
	}

}

UserMessage implements the getMedia method of MediaContent.

Media

org/springframework/ai/model/Media.java

public class Media {

	private static final String NAME_PREFIX = "media-";

	/**
	 * An Id of the media object, usually defined when the model returns a reference to
	 * media it has been passed.
	 */
	@Nullable
	private String id;

	private final MimeType mimeType;

	private final Object data;

	/**
	 * The name of the media object that can be referenced by the AI model.
	 * <p>
	 * Important security note: This field is vulnerable to prompt injections, as the
	 * model might inadvertently interpret it as instructions. It is recommended to
	 * specify neutral names.
	 *
	 * <p>
	 * The name must only contain:
	 * <ul>
	 * <li>Alphanumeric characters
	 * <li>Whitespace characters (no more than one in a row)
	 * <li>Hyphens
	 * <li>Parentheses
	 * <li>Square brackets
	 * </ul>
	 */
	private String name;

	//......
} 

Media defines the properties id, mimeType, data, and name.

Format

public static class Format {

	// -----------------
	// Document formats
	// -----------------
	/**
	 * Public constant mime type for {@code application/pdf}.
	 */
	public static final MimeType DOC_PDF = MimeType.valueOf("application/pdf");

	/**
	 * Public constant mime type for {@code text/csv}.
	 */
	public static final MimeType DOC_CSV = MimeType.valueOf("text/csv");

	/**
	 * Public constant mime type for {@code application/msword}.
	 */
	public static final MimeType DOC_DOC = MimeType.valueOf("application/msword");

	/**
	 * Public constant mime type for
	 * {@code application/vnd.openxmlformats-officedocument.wordprocessingml.document}.
	 */
	public static final MimeType DOC_DOCX = MimeType
		.valueOf("application/vnd.openxmlformats-officedocument.wordprocessingml.document");

	/**
	 * Public constant mime type for {@code application/vnd.ms-excel}.
	 */
	public static final MimeType DOC_XLS = MimeType.valueOf("application/vnd.ms-excel");

	/**
	 * Public constant mime type for
	 * {@code application/vnd.openxmlformats-officedocument.spreadsheetml.sheet}.
	 */
	public static final MimeType DOC_XLSX = MimeType
		.valueOf("application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");

	/**
	 * Public constant mime type for {@code text/html}.
	 */
	public static final MimeType DOC_HTML = MimeType.valueOf("text/html");

	/**
	 * Public constant mime type for {@code text/plain}.
	 */
	public static final MimeType DOC_TXT = MimeType.valueOf("text/plain");

	/**
	 * Public constant mime type for {@code text/markdown}.
	 */
	public static final MimeType DOC_MD = MimeType.valueOf("text/markdown");

	// -----------------
	// Video Formats
	// -----------------
	/**
	 * Public constant mime type for {@code video/x-matros}.
	 */
	public static final MimeType VIDEO_MKV = MimeType.valueOf("video/x-matros");

	/**
	 * Public constant mime type for {@code video/quicktime}.
	 */
	public static final MimeType VIDEO_MOV = MimeType.valueOf("video/quicktime");

	/**
	 * Public constant mime type for {@code video/mp4}.
	 */
	public static final MimeType VIDEO_MP4 = MimeType.valueOf("video/mp4");

	/**
	 * Public constant mime type for {@code video/webm}.
	 */
	public static final MimeType VIDEO_WEBM = MimeType.valueOf("video/webm");

	/**
	 * Public constant mime type for {@code video/x-flv}.
	 */
	public static final MimeType VIDEO_FLV = MimeType.valueOf("video/x-flv");

	/**
	 * Public constant mime type for {@code video/mpeg}.
	 */
	public static final MimeType VIDEO_MPEG = MimeType.valueOf("video/mpeg");

	/**
	 * Public constant mime type for {@code video/mpeg}.
	 */
	public static final MimeType VIDEO_MPG = MimeType.valueOf("video/mpeg");

	/**
	 * Public constant mime type for {@code video/x-ms-wmv}.
	 */
	public static final MimeType VIDEO_WMV = MimeType.valueOf("video/x-ms-wmv");

	/**
	 * Public constant mime type for {@code video/3gpp}.
	 */
	public static final MimeType VIDEO_THREE_GP = MimeType.valueOf("video/3gpp");

	// -----------------
	// Image Formats
	// -----------------
	/**
	 * Public constant mime type for {@code image/png}.
	 */
	public static final MimeType IMAGE_PNG = MimeType.valueOf("image/png");

	/**
	 * Public constant mime type for {@code image/jpeg}.
	 */
	public static final MimeType IMAGE_JPEG = MimeType.valueOf("image/jpeg");

	/**
	 * Public constant mime type for {@code image/gif}.
	 */
	public static final MimeType IMAGE_GIF = MimeType.valueOf("image/gif");

	/**
	 * Public constant mime type for {@code image/webp}.
	 */
	public static final MimeType IMAGE_WEBP = MimeType.valueOf("image/webp");

}

Format defines several common MimeTypes.

Conclusion

Spring AI has designed various message types to support multimodality, among which UserMessage has a media property of type <span>List<Media></span>, allowing the inclusion of images, audio, and video, with MimeType used to specify the type.

Documentation

  • multimodality

Leave a Comment