AI 카페인 ☕️ on x

yena shared this post · 3h ago

인터넷 전체를 긁어오는 데

꼭 비싼 계약이나 유료 API가 필요한 건 아님.

생각보다 많은 기능이

이미 GitHub 오픈소스로 다 나와 있음 👀

이 10개의 프로젝트만 있으면,

기본적으로 다음 영역을 모두 커버할 수 있음:

말 그대로

인터넷 데이터 추출 툴박스라고 보면 됨.

저장해둘 만한 10개:

firecrawl

웹사이트를 AI가 읽기 좋은 구조화 데이터로 변환

https://github.com/firecrawl/firecrawl
crawl4ai

웹페이지를 LLM 친화적인 Markdown으로 빠르게 변환

https://github.com/unclecode/crawl4ai
browser-use

사람처럼 브라우저를 조작하면서 데이터 수집

https://github.com/browser-use/browser-use
crawlee

프록시, 재시도, 큐까지 갖춘 프로급 크롤링 프레임워크

https://github.com/apify/crawlee
scrapy

오래 검증된 산업용 크롤링 프레임워크

https://github.com/scrapy/scrapy
markitdown

Microsoft 오픈소스. 웹 / PDF / Office 파일을 Markdown으로 변환

https://github.com/microsoft/markitdown
Scrapling

복잡한 페이지 처리에 강한 편

https://github.com/D4Vinci/Scrapling
scrcpy

안드로이드폰을 직접 제어해서 모바일 쪽 작업까지 연결

https://github.com/Genymobile/scrcpy
AutoScraper

예시만 주면 나머지 패턴을 자동으로 학습

https://github.com/alirezamika/autoscraper
curl-impersonate

브라우저 요청 특성을 더 가깝게 맞추는 도구

https://github.com/lwthiker/curl-impersonate

월 수백~수천 달러 받는 데이터 추출 기능들,

이미 GitHub에는 대체 가능한 오픈소스가 꽤 많음.

1 / 3