Skip to main content

You can try Apple’s lightning-fast video captioning model right from your browser

A few months ago, Apple released FastVLM, a Visual Language Model (VLM) that offered near-instant high-resolution image processing. Now, you can take it for a spin, provided you have an Apple Silicon-powered Mac. Here’s how.

When we first covered FastVLM, we explained that it leveraged MLX, Apple’s own open ML framework specifically designed for Apple Silicon, to deliver up to 85 times faster video captioning, while being more than 3 times smaller than similar models.

Since then, Apple has worked on the project further, which can now be found on Hugging Face, not just on GitHub. On Hugging Face, you can load the lighter version, FastVLM-0.5B, right on your browser and check it out for yourself.

Depending on your hardware, it may take a bit to load. It took a couple of minutes on my 16GB M2 Pro MacBook Pro. But as soon as it loaded, the model started to accurately describe my appearance, the room behind me, different expressions, and objects I would bring into view.

On the bottom left corner, you can adjust the prompt that the model will take into consideration as it live updates the caption, or you can pick from a few suggestions, such as:

  • Describe what you see in one sentence.
  • What is the color of my shirt?
  • Identify any text or written content visible.
  • What emotions or actions are being portrayed?
  • Name the object I am holding in my hand.

If you feel like taking things further, you can try using a virtual camera app to feed video to the tool, and watch it instantly describe multiple scenes in detail, to the point of making it difficult to understand what’s going on. Of course, the actual use case would be different, but this does underscode how fast and accurate the model can be.

What is particularly interesting about this experiment is that it runs locally on the browser, meaning no data ever leaves the device, and it can even run offline. This, of course, would be a great use case for wearables and assistive technology, where lightness and low latency will be paramount to unlock better use cases.

It’s worth noting that the demo runs on the lighter 0.5-billion-parameter model, while the FastVLM family also includes larger and more powerful variants with 1.5 billion and 7 billion parameters. With bigger models, performance and speed could improve even further, although running it directly on the browser would likely be a no-go.

Did you test it out? Share your thoughts in the comments.

Accessory deals on Amazon

FTC: We use income earning auto affiliate links. More.

You’re reading 9to5Mac — experts who break news about Apple and its surrounding ecosystem, day after day. Be sure to check out our homepage for all the latest news, and follow 9to5Mac on Twitter, Facebook, and LinkedIn to stay in the loop. Don’t know where to start? Check out our exclusive stories, reviews, how-tos, and subscribe to our YouTube channel

Comments

Author

Avatar for Marcus Mendes Marcus Mendes

Marcus Mendes is a Brazilian tech podcaster and journalist who has been closely following Apple since the mid-2000s.

He began covering Apple news in Brazilian media in 2012 and later broadened his focus to the wider tech industry, hosting a daily podcast for seven years.