<aside>
đź’ˇ Try it out at newsapi.rellort.dev
</aside>
What’s this
- We needed an API to allowed Timelines to query for news articles. In summary, nothing on the market provided
- pre-embedded articles, and
- date range queries, and
- no rate limiting, and
- cheap queries.
- In the end, we built an ETL pipeline that focuses on reliability, fault tolerance, and low cost.
- The data pipeline costs $25/month to run, and currently has ~120k articles (and growing) indexed and ready to be searched.
Architecture
Extract
- Serverless scrapers for 15 different news sources are deployed on Vercel
- The extract script is scheduled to run every hour on each source. It calls the serverless scraper function that scrapes articles in parallel.
- Articles are fed into a message queue to ensure no data loss if the transform stage fails.
Transform
- The transform script embeds the articles usingÂ
all-MiniLM-L6-v2
, which provides a good balance between performance and hardware requirements/speed.
- Articles are fed into a message queue to ensure no data loss if the load stage fails.
Load
- Finally, the load script stores articles into a Meilisearch instance, which is a lightweight full-text search engine.
- Meilisearch was originally designed for fast client-facing search. However, our scale and speed requirements fit it perfectly, and it's lightweight nature helps us to save on cost.