待翻译:GELab-Zero: Android automation framework for multimodal LLMs
AI 服务暂时不可用,以下为来源摘要,待恢复后补全翻译:Notifications You must be signed in to change notification settings Fork 196 Star 2.2k BranchesTags Open more actions menu Folders and files NameName Last commit message Last commit date Latest commit History 67 Commits…
AI 服务暂时不可用,以下为来源正文,待恢复后补全翻译。
Notifications You must be signed in to change notification settings Fork 196 Star 2.2k BranchesTags Open more actions menu Folders and files NameName Last commit message Last commit date Latest commit History 67 Commits 67 Commits copilot_agent_client copilot_agent_client copilot_agent_server copilot_agent_server copilot_front_end copilot_front_end copilot_tools copilot_tools examples examples images images mcp_server mcp_server report report tools tools visualization visualization .DS_Store .DS_Store .gitignore .gitignore LICENSE LICENSE Notice.txt Notice.txt README.md README.md README_CN.md README_CN.md mcp_server_config.yaml mcp_server_config.yaml model_config.yaml model_config.yaml requirements.txt requirements.txt yadb yadb 极简运行指南_CN.md 极简运行指南_CN.md Repository files navigation 🎁 [Coming Soon...] 🎁 [2025-12-12] MCP-Server ready: Step1 Start MCP server to support multi-device management and task distribution # enable mcp server python mcp_server/detailed_gelab_mcp_server.py Step2 Import MCP tools in Chatbox 🎁 [2025-12] We thank the following projects and authors for providing quantization tools & tutorials: GGUF_v1, GGUF_v2, EXL3, Tutorials_CN, Tutorials_EN 🎁 [2025-11] We release a lightweight 4B model on Hugging Face and Model Scope. 🎁 [2025-11] We release the tasks from the AndroidDaily benchmark. 🎁 [2025-11] We release the current GELab-Zero engineering infrastructure. 🎁 [2025-10] Our research paper on GELab-Engine is accepted by NeurIPS 2025. 📑 Table of Contents 📖 Background 🎥 Application Demonstrations 📊 AndroidDaily 🏆 Open Benchmark 🚀 Installation & Quick Start 📝 Citation 📧 Contact 📖 Background As AI experiences continue to penetrate consumer-grade terminal devices, mobile Agent research is at a critical juncture transitioning from "feasibility verification" to "large-scale application." GUI-based solutions have emerged as the optimal approach for the current stage in addressing complex mobile ecosystems and achieving scalable Agent capabilities, thanks to their universal compatibility with all apps and zero-cost integration without requiring app vendor adaptation. However, due to the highly fragmented nature of mobile application ecosystems, getting GUI Agents to truly work across different brands and device models often faces numerous engineering challenges: multi-device ADB connections, dependency installation, permission configuration, inference service deployment, task recording and replay. This means Agent developers and MCP users need to handle substantial engineering infrastructure work, making it difficult to focus on strategic innovation. To address this challenge, we are open-sourcing GELab-Zero to accelerate the innovation and application deployment of GUI Agents. It consists of two main components: Plug-and-play complete inference engineering infrastructure that handles all the heavy lifting A 4B GUI Agent model capable of running on local computer It provides a one-click launch experience similar to open-source GUI Agent MCP, can be deployed entirely locally, and puts the entire inference pipeline under your complete control. Specific capabilities include: Local Deployment: Supports 4B-scale models running on consumer-grade hardware, balancing low latency with privacy. One-click Launch: Provides unified deployment pipeline that automatically handles environment dependencies and device management. Task Distribution: Can distribute tasks to multiple phones while recording interaction trajectories for observability and reproducibility. Three Agent Modes: Covers multiple working modes including ReAct loops, multi-agent collaboration, and scheduled tasks. These capabilities enable GELab-Zero to flexibly handle complex task flows in real-world scenarios and provide a solid foundation for future extensions. For Agent developers, this infrastructure enables rapid testing of new ideas and strategies, validating interaction approaches; for enterprise users, it allows direct reuse of this infrastructure to quickly integrate MCP capabilities into product business. 🎥 Application Demonstrations Recommendation - Sci-Fi Movies Task: Help me find any good recent sci-fi movies 📹 Click to view demo video Recommendation - Travel Destination Task: Help me find a place where I can take my kids on the weekend 📹 Click to view demo video Practical Task - Claim Subsidy Task: Claim meal vouchers on the enterprise welfare platform 📹 Click to view demo video Practical Task - Metro Line Query Task: Check if Metro Line 1 is operating normally, then navigate to the nearest entrance of Line 1 metro station 📹 Click to view demo video Complex Task - Multi-Item Shopping Task: Go to the nearest Hema Fresh Store on Ele.me and purchase: Red strawberries 300g, Peruvian Bianca blueberries 125g (18mm diameter), seasonal fresh yellow potatoes 500g, sweet baby pumpkin 750g, Hema large grain shrimp sliders, 2 bottles of Hema pure black soy milk 300ml, Little Prince macadamia nut cocoa crisp 120g, Hema spinach noodles, Hema five-spice beef, 5 bags of Haohuan snail Liuzhou river snail rice noodles (extra spicy extra smelly) 400g, m&m's milk chocolate beans 100g 📹 Click to view demo video Complex Task - Information Retrieval Task: Search for 'how to learn financial management' on Zhihu and view the first answer with over 10k likes 📹 Click to view demo video Complex Task - Conditional Search Task: Find a pair of white canvas shoes in size 37 on Taobao, priced under 100 yuan, then add the first item that meets the criteria to favorites 📹 Click to view demo video Complex Task - Online Quiz Task: Go to Baicizhan and help me complete the vocabulary learning task 📹 Click to view demo video 📊 AndroidDaily: A Self-Built Benchmark Close to Daily Life Current mainstream benchmarks mostly focus on productivity applications (such as email), but users' daily high-frequency usage is dominated by lifestyle service applications (such as food delivery, ride-hailing, social media, payments, etc.), and these scenarios better reflect the practical value of current GUI Agents. To this end, we propose AndroidDaily: a multi-dimensional dynamic benchmark for the real world. We focus on empirical analysis of six core dimensions of modern life (food, transportation, shopping, housing, information consumption, entertainment), prioritizing popular applications that dominate these categories. This makes the tasks in the benchmark characterized by real-world interaction results (such as transaction payments, service bookings) and tight online-offline inheritance. To balance evaluation comprehensiveness and execution efficiency, AndroidDaily adopts two evaluation modes: Static Testing Contains 3146 actions. Provides task descriptions and step-by-step screenshots, requiring the Agent to predict the action type and action value (such as click coordinates, input text) for each step, primarily evaluating numerical accuracy. This method requires no complex engineering infrastructure and enables rapid, cost-effective large-scale model iteration and testing. The action type distribution in static testing is as follows (total 3146 actions): CLICK: 1354 times - Click operations COMPLETE: 410 times - Task completion AWAKE: 528 times - App activation TYPE: 371 times - Text input INFO: 305 times - Information query WAIT: 85 times - Wait operations SLIDE: 93 times - Slide operations AndroidDaily Static Benchmark Results Model Accuracy GPT-4o 0.196 Gemini-2.5-pro-thinking 0.366 UI-TARS-1.5 0.470 GELab-Zero-4B-preview 0.734 End-to-End Benchmark Contains 235 tasks. Conducted in a fully functional test environment (such as real devices or emulators), the Agent needs to autonomously execute tasks from start to finish, with overall task success rate as the evaluation metric. This setup has the highest ecological validity and truly reflects the Agent's comprehensive capabilities in complex environments. The scenario distribution in the end-to-end benchmark is as follows: Transportation: 78 tasks (33.19%) - Ride-hailing, navigation, public transit, etc. Shopping: 61 tasks (25.96%) - E-commerce shopping, payment, order management, etc. Social Communication: 43 tasks (18.3%) - Messaging, social interactions, etc. Content Consumption: 37 tasks (15.74%) - News reading, video watching, content bookmarking, etc. Local Services: 16 tasks (6.81%) - Food delivery, on-site services, etc. Typical tasks include ride-hailing, shopping, message sending, content bookmarking, food delivery ordering, etc. GELab-Zero-4B-preview achieves 75.86% success rate on AndroidWorld testing, demonstrating excellent performance on complex mobile tasks. 🏆 Open Benchmark We conducted comprehensive evaluations of GELab-Zero-4B-preview model across multiple open-source benchmarks, covering various dimensions including GUI understanding, localization, and interaction. The comparison results with other open-source models are shown below: The benchmark results demonstrate that GELab-Zero-4B-preview exhibits exceptional performance across multiple open-source benchmarks, with particularly outstanding results in real mobile scenarios (Android World), proving its strong capabilities in practical applications. 🚀 Installation& Quick Start End-to-end inference requires just a few simple steps: Set up LLM inference environment (ollama or vllm) Set up Android device execution environment (adb configuration) and enable developer mode Set up Agent runtime environment (gelab-zero one-click deployment script) Set up trajectory visualization environment (optional) The third-party infrastructure dependencies mentioned above are very mature, so don't be afraid. We assume you have installed Python 3.12+ environment and have a certain command line operation foundation. If you have not installed the python environment yet, please refer to Step 0 for installation. Step 0: Python Environment Setup If you have not installed Python 3.12+ environment yet, you can refer to the following steps for installation: For commercial friendliness and cross-platform support, we recommend using miniforge for Python environment installation and management. Official website: https://github.com/conda-forge/miniforge Windows Users: MUST USE powershell Directly download and manually install Miniforge. Refer to the Install section at: https://github.com/conda-forge/miniforge. During installation, ensure to check the option to add Conda to the PATH environment variable to guarantee proper activation of Conda. After installation, activate Conda. Open PowerShell and enter the following commands: # Activate Conda in PowerShell conda init powershell # Allow Conda scripts to run on PowerShell startup Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser Successful activation is indicated by "(base)" displayed at the beginning of the latest line in the terminal. It is recommended to use VS Code for code execution and debugging. Download and install it from the official website: https://code.visualstudio.com/ MAC and Linux Users: Download and install miniforge using the command line: curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh" bash Miniforge3-$(uname)-$(uname -m).sh After installation, create and activate a new Python environment: conda create -n gelab-zero python=3.12 -y conda activate gelab-zero Step 1: LLM Inference Environment Setup We have verified two mainstream LLM local inference deployment methods: ollama and vllm. Personal users are recommended to use the ollama method, while enterprise users and those with certain technical backgrounds can choose the vllm method for more stable inference [truncated for AI cost control]