Car Recognition Project

【Tech Insight】Scraping Car Model Images from Used Car Sites with Python【Development Log #3】

3台の日本車(プリウス、フィット、ノート)が近未来の都市を走る様子。背景には「AI」の文字と回路図が輝き、車種判別AIプロジェクトのスタートを象徴するデザイン。

A futuristic scene of three Japanese cars—Prius, Fit, and Note—driving through a smart city, with glowing circuit diagrams and the word “AI” in the background. This symbolizes the launch of our AI-powered car model recognition project.


🔍 Why Collect Images from Used Car Sites?

To train an AI to accurately identify car models, a large and diverse set of real-world images is essential. That’s why I focused on used car listing websites.

Reasons:

  • Numerous real-world photos are available per car model
  • Variety in angle, background, and lighting conditions
  • Easier to treat as labeled data than generic image searches

Thanks to this approach, I was able to gather practical, realistic data for training purposes.


⚙️ Technologies and Setup

  • Language: Python 3.x
  • Scraping Tools: Selenium + webdriver-manager
  • Browser Automation: Chrome (headless mode)
  • Image Downloading: urllib.request
  • Preprocessing: File size-based filtering of junk images

🧠 Centralized Car List Management

I managed car names and search keywords in both Japanese and English like this:

car_list = [
{"jp_name": "トヨタ プリウス", "en_name": "Toyota Prius", "keyword": "トヨタ プリウス site:example.com"},
{"jp_name": "ホンダ フィット", "en_name": "Honda Fit", "keyword": "ホンダ フィット site:example.com"},
...
]

※ Replace “example.com” with the actual domain of the used car site you’re scraping.


📸 Image Collection Workflow

  1. Access the target used car site
  2. Search for each car model and scroll through results
  3. Extract image URLs
  4. Download images with Python
  5. Filter out small (e.g. banner) images

Example:

if os.path.getsize(filepath) < 5000:
os.remove(filepath)


🧼 Filtering Out Unusable Images

In addition to filtering by file size, I also used a separate script to remove:

  • Images without cars
  • Blurry or irrelevant images

This ensured the dataset remained clean and useful for training.


💡 Optimization Highlights

  • Adjustable max_images setting
  • Looped pagination to collect more samples
  • Random wait times to reduce bot detection risk

📂 What’s Next?

  • Integrate YOLO for automatic car body cropping
  • Automate dataset split into train/val/test for PyTorch
  • Expand into an app that shows car model + new & used price

📝 Final Thoughts

Used car websites are a powerful source of high-quality training data. With this scraping script, I was able to collect relevant images efficiently and reliably. In the next post, I’ll explain how I trained a ResNet model to classify these car models with high accuracy.