The Challenge
Manually transcribing restaurant menus or supplier pricelists from images into structured data is a slow, error-prone, and tedious task. The source images are often low-quality, taken at odd angles, and feature complex, multi-column layouts that cause traditional OCR tools to fail. A more intelligent, automated solution was needed to handle the visual complexity and variety of these real-world documents.
The Technical Solution
An end-to-end processing application was developed to turn messy menu photos into structured, ready-to-use data. The system uses a sophisticated, multi-stage AI pipeline to overcome the limitations of standard OCR.
- 1. Semantic Menu Sectioning with YOLO: The core insight was that processing an entire complex menu at once was unreliable. To solve this, a custom YOLO object detection model was trained to first identify and crop logical sections from the menu (e.g., "Appetizers," "Main Courses"). This model was trained on a public dataset of over 2,000 real-world menu images, which I collected and annotated. (View the dataset on Roboflow Universe).
- 2. Hybrid Vision-Text Extraction:
Each cropped menu section was then processed individually for maximum accuracy.
- Azure OCR performed an initial text extraction on the image patch.
- Both the image patch and the raw OCR text were then fed to GPT-4 Vision. Providing both the visual context (the image) and the text improved the model's ability to correctly interpret items and prices, especially in noisy images.
- 3. Structuring and Aggregation: GPT-4 Vision was prompted to return a structured JSON object for each menu section, containing the item name, price, and category. The system then aggregated the results from all sections into a single, complete JSON representation of the entire menu.
- The Application: This pipeline was integrated into an internal web tool where a user could upload a menu image and, after a few moments of processing, download a perfectly structured Excel file, completely eliminating the need for manual data entry.
Results and Impact
This project successfully automated a highly manual workflow, demonstrating the power of a strategic, multi-stage AI approach.
- Eliminated Manual Data Entry: The primary goal was achieved, freeing up staff from the tedious task of typing out menus by hand.
- Superior Accuracy: The section-based, hybrid vision-text approach proved significantly more accurate than using a single AI model on the entire image, overcoming issues with complex layouts and poor image quality.
- Practical, User-Focused Tool: The final output was a ready-to-use Excel file, fitting seamlessly into the existing business workflow.