building mobile fruit recognition for an educational app
deploying machine learning models to mobile devices requires more than just training accuracy. here's how i built an educational android app that detects 21 types of fruit with 96% accuracy while running smoothly on mid-range devices.
the problem space
during my final year research project, i identified a gap in educational technology for elementary students. traditional methods of teaching fruit recognition—picture books and flashcards—lacked the interactivity that digital-native children expect. meanwhile, most educational apps were either too simple (static images) or too complex (requiring constant internet connectivity).
the challenge was clear: build a mobile app that could recognize fruit in real-time, offline, on devices that elementary schools and parents could actually afford. this meant optimizing for constraints that academic papers often ignore—battery life, storage space, inference speed, and user experience for 7-year-olds.
architecture decisions
the first critical decision was choosing the right model architecture. i evaluated three candidates: resnet50, efficientnet-b0, and mobilenetv2. while resnet and efficientnet offered higher theoretical accuracy, mobilenetv2 provided the optimal balance for mobile deployment.
here's why mobilenetv2 won:
- model size - 3.5mb after int8 quantization vs. 25mb+ for alternatives
- inference speed - 500ms per frame on mid-range android devices
- accuracy trade-off - 96.28% validation accuracy, only 2-3% lower than heavier models
- tensorflow lite support - first-class optimization for mobile deployment
- depthwise separable convolutions - reduced computational overhead without sacrificing feature extraction quality
for an educational app where response time directly impacts user engagement, the 2-3% accuracy trade-off was worth the 5x speed improvement and 7x size reduction.
system architecture overview
the complete pipeline from dataset to android deployment involves multiple stages, each with specific optimization decisions. the architecture prioritizes mobile constraints at every stage—from dataset augmentation strategies to quantization for deployment. each decision trades theoretical accuracy for practical usability on resource-constrained devices.
the data pipeline begins with 126k images split 80/20 for training and validation. training data flows through imagedatagenerator with rotation, shift, zoom, and brightness augmentation, while validation data receives only normalization to [-1, 1] range. both streams feed into the mobilenetv2 base model with frozen imagenet weights, followed by global average pooling, batch normalization, and two dense layers (512 and 256 neurons) with dropout regularization before the final 21-class softmax output.
training employs adam optimizer with gradient clipping and multiple callbacks—early stopping, learning rate reduction, model checkpointing, and tensorboard logging. after initial training converges at 27 epochs, fine-tuning unfreezes the last 30 mobilenetv2 layers with reduced learning rate for additional refinement. the resulting 14mb keras model undergoes int8 quantization using a representative dataset, compressing to 3.5mb (75% reduction) while maintaining 96.28% validation accuracy.
deployment to android integrates the tflite model with camerax for real-time capture, preprocessing (resize + normalize), and inference running at ~500ms per frame. predictions exceeding 70% confidence threshold trigger ui display via jetpack compose, completing the end-to-end pipeline from raw camera input to educational content delivery.
dataset engineering
model performance starts with data quality. i curated a dataset of 126,219 images across 21 fruit classes—apples, bananas, mangoes, dragon fruit, and 17 others common in indonesia.
the dataset distribution was intentionally imbalanced to reflect real-world usage. apples (22,529 images) and pears (16,623 images) received more samples because they have higher visual variance—different colors, sizes, and varieties. meanwhile, durian (1,026 images) needed fewer samples due to its distinctive spiky texture that's easy to recognize.
key dataset decisions:
- 80/20 train-validation split - 100,920 training images, 25,299 validation images
- augmentation strategy - rotation, brightness adjustment, horizontal flip. deliberately avoided vertical flip since fruit orientation matters
- preprocessing pipeline - resize to 224x224, normalize pixels to [-1, 1] range
- class weighting - applied inverse frequency weighting to prevent model bias toward overrepresented classes
training and optimization
training took 27 epochs with early stopping based on validation loss. the model converged smoothly—training accuracy increased from 62.93% (epoch 1) to 95.65% (epoch 27), while validation accuracy peaked at 96.28% on epoch 26.
the minimal gap between training and validation accuracy (95.65% vs 96.28%) indicated good generalization without overfitting. top-3 accuracy reached 99.14%, meaning the correct fruit appeared in the top 3 predictions 99% of the time—critical for building user confidence even when the top prediction isn't perfect.
post-training optimization involved int8 quantization, which reduced model size from 14mb to 3.5mb (75% reduction) with only 0.3% accuracy loss. this compression was essential for app distribution—users won't download a 50mb+ app for a simple educational tool.
mobile implementation challenges
deploying the model to android revealed constraints that don't show up in jupyter notebooks. the app needed to handle camera permissions, real-time inference, ui responsiveness, and battery efficiency simultaneously.
technical implementation stack:
- jetpack compose - modern declarative ui for building child-friendly interfaces
- camerax - consistent camera api across android versions and manufacturers
- tensorflow lite interpreter - optimized ml runtime for mobile devices
- kotlin coroutines - non-blocking inference to prevent ui freezing
the inference pipeline runs on a background thread with 500ms throttling to prevent excessive battery drain. when confidence score exceeds 70%, the app displays a modal with the fruit name, confidence percentage, and educational content (vitamin content, fun facts, health benefits).
real-world validation
lab accuracy means nothing if the app fails in actual use. i conducted field testing at two retail locations—midifresh alfa tower and aeon mall alam sutera—to validate performance under real lighting conditions and fruit variations.
results from 16 fruit types tested:
- perfect detection (100%) - banana, pear, pineapple, melon, salak. distinctive textures and shapes
- high confidence (95-99%) - orange (97%), strawberry (98%), durian (98%), tomato (97%)
- moderate confidence (78-90%) - mango (78%), papaya (78%), avocado (80%), apple (82%)
the lower scores for mango, papaya, and avocado revealed an important limitation: smooth, reflective surfaces with similar green-to-orange gradients confused the model. this makes sense—even humans struggle to distinguish unripe mango from avocado at a glance.
viewpoint sensitivity was another discovery. detection accuracy dropped significantly when photographing fruit from the side rather than top-down. this is a known challenge in computer vision—viewpoint invariance requires either multi-view training data or 3d augmentation techniques.
designing for children
technical performance is only half the equation. the app needed to be usable by 7-year-olds with minimal reading ability and short attention spans.
ux decisions based on child development research:
- single-screen interface - no complex navigation. camera preview fills the entire screen
- instant feedback - detection results appear within 500ms to maintain engagement
- large fonts (18sp+) - readable without squinting on small screens
- high contrast colors - bright, saturated colors that appeal to children
- auto-hide modals - result cards disappear after 3 seconds to prevent confusion
- fun facts - "apples float in water because they contain air!" keeps learning playful
user acceptance testing with 14 second-grade students at sd kristen 04 eben haezer salatiga validated these decisions. 100% found the app enjoyable, 92.8% found it easy to use, and 100% said they learned fruit names and benefits. more importantly, observing children use the app revealed unexpected behaviors—they got excited when detecting manggis (mangosteen) after learning it's called "queen of fruits," and associated pineapple texture with spongebob squarepants.
intellectual property protection
beyond technical implementation, i registered the app with indonesia's directorate general of intellectual property (hki). this wasn't just bureaucratic paperwork—it demonstrated understanding that software is intellectual property with commercial potential.
the hki registration process required comprehensive documentation: technical architecture, source code samples, user manual, and copyright transfer agreements. completing this process taught me how to position software as a product, not just a project.
lessons learned
building mobile ml systems requires different thinking than academic research or web applications. here's what mattered most:
- constraints drive architecture - mobilenetv2 wasn't the most accurate model, but it was the right model for the constraints
- quantization is non-negotiable - 75% size reduction with minimal accuracy loss makes deployment feasible
- field testing reveals truth - lab accuracy doesn't predict real-world performance under varied lighting and viewpoints
- ux matters as much as accuracy - 96% accuracy means nothing if children can't figure out how to use the app
- domain knowledge is critical - understanding child cognitive development shaped every design decision
future improvements
if i were to iterate on this project, three areas would get priority. first, multi-view training with 3d augmentation to handle viewpoint variance. second, implementing efficientformer architecture which recent research shows achieves mobilenet-level speed with higher accuracy. third, adding voice feedback for pre-readers and gamification elements (point systems, achievement badges) to increase long-term engagement.
the technical foundation is solid. the model generalizes well, the app runs smoothly, and users find it valuable. these improvements would push it from "functional educational tool" toward a more polished learning product.
why this matters
this project taught me that building ml systems isn't just about maximizing accuracy metrics. it's about understanding constraints, making informed trade-offs, validating with real users, and shipping software that actually works in the real world.
the gap between "96% validation accuracy" and "children successfully learning fruit names" is filled with engineering decisions that academic papers rarely discuss. that's the work that matters.