# 인터넷 전체를 긁어오는 데 꼭 비싼 계약이나 유료 API가 필요한 건 아님. 생각보다 많은 기능이 이미 GitHub 오픈소스로 다 나와 있...
Canonical: https://social-archive.org/yena/HJtO7utDFv
Original URL: https://x.com/AI_Caffeine/status/2071435612012564913
Author: AI 카페인 ☕️
Platform: x
## Content
인터넷 전체를 긁어오는 데 꼭 비싼 계약이나 유료 API가 필요한 건 아님. 생각보다 많은 기능이 이미 GitHub 오픈소스로 다 나와 있음 👀 이 10개의 프로젝트만 있으면, 기본적으로 다음 영역을 모두 커버할 수 있음: - 웹 크롤링 - JS 렌더링 - 브라우저 자동화 - 문서 / Markdown 정리 - 모바일 앱 자동화 - 브라우저 지문 위장까지 말 그대로 `인터넷 데이터 추출 툴박스`라고 보면 됨. 저장해둘 만한 10개: - firecrawl 웹사이트를 AI가 읽기 좋은 구조화 데이터로 변환 https://github.com/firecrawl/firecrawl - crawl4ai 웹페이지를 LLM 친화적인 Markdown으로 빠르게 변환 https://github.com/unclecode/crawl4ai - browser-use 사람처럼 브라우저를 조작하면서 데이터 수집 https://github.com/browser-use/browser-use - crawlee 프록시, 재시도, 큐까지 갖춘 프로급 크롤링 프레임워크 https://github.com/apify/crawlee - scrapy 오래 검증된 산업용 크롤링 프레임워크 https://github.com/scrapy/scrapy - markitdown Microsoft 오픈소스. 웹 / PDF / Office 파일을 Markdown으로 변환 https://github.com/microsoft/markitdown - Scrapling 복잡한 페이지 처리에 강한 편 https://github.com/D4Vinci/Scrapling - scrcpy 안드로이드폰을 직접 제어해서 모바일 쪽 작업까지 연결 https://github.com/Genymobile/scrcpy - AutoScraper 예시만 주면 나머지 패턴을 자동으로 학습 https://github.com/alirezamika/autoscraper - curl-impersonate 브라우저 요청 특성을 더 가깝게 맞추는 도구 https://github.com/lwthiker/curl-impersonate 월 수백~수천 달러 받는 데이터 추출 기능들, 이미 GitHub에는 대체 가능한 오픈소스가 꽤 많음. #AI #AICaffeine #github