A futuristic scene of three Japanese cars—Prius, Fit, and Note—driving through a smart city, with glowing circuit diagrams and the word “AI” in the background. This symbolizes the launch of our AI-powered car model recognition project.
🔍 Why Collect Images from Used Car Sites?
To train an AI to accurately identify car models, a large and diverse set of real-world images is essential. That’s why I focused on used car listing websites.
Reasons:
- Numerous real-world photos are available per car model
- Variety in angle, background, and lighting conditions
- Easier to treat as labeled data than generic image searches
Thanks to this approach, I was able to gather practical, realistic data for training purposes.
⚙️ Technologies and Setup
- Language: Python 3.x
- Scraping Tools: Selenium + webdriver-manager
- Browser Automation: Chrome (headless mode)
- Image Downloading: urllib.request
- Preprocessing: File size-based filtering of junk images
🧠 Centralized Car List Management
I managed car names and search keywords in both Japanese and English like this:
car_list = [
{"jp_name": "トヨタ プリウス", "en_name": "Toyota Prius", "keyword": "トヨタ プリウス site:example.com"},
{"jp_name": "ホンダ フィット", "en_name": "Honda Fit", "keyword": "ホンダ フィット site:example.com"},
...
]
※ Replace “example.com” with the actual domain of the used car site you’re scraping.
📸 Image Collection Workflow
- Access the target used car site
- Search for each car model and scroll through results
- Extract image URLs
- Download images with Python
- Filter out small (e.g. banner) images
Example:
if os.path.getsize(filepath) < 5000:
os.remove(filepath)
🧼 Filtering Out Unusable Images
In addition to filtering by file size, I also used a separate script to remove:
- Images without cars
- Blurry or irrelevant images
This ensured the dataset remained clean and useful for training.
💡 Optimization Highlights
- Adjustable max_images setting
- Looped pagination to collect more samples
- Random wait times to reduce bot detection risk
📂 What’s Next?
- Integrate YOLO for automatic car body cropping
- Automate dataset split into train/val/test for PyTorch
- Expand into an app that shows car model + new & used price
📝 Final Thoughts
Used car websites are a powerful source of high-quality training data. With this scraping script, I was able to collect relevant images efficiently and reliably. In the next post, I’ll explain how I trained a ResNet model to classify these car models with high accuracy.