Stack: Python, Streamlit, httpx, Selectolax, Trafilatura, Pydantic, Tenacity, requests-cache, Pillow, Ollama / Hugging Face
Optional Vision: Qwen2-VL / LLaVA (captions)
A hybrid scraper that extracts article metadata + concise summaries and samples a few images per page (optionally captioned by a small local vision model). Runs fully offline via Ollama and ships with a one-click GUI. Designed for strict resource control (tokens, images, bytes, concurrency) so it stays fast on laptop hardware.
Competitor & product monitoring: track pricing/features and key visuals; export JSON to BI.
News & research briefs: multi-source summaries with a couple of contextual images for internal updates.
Lead & account enrichment (lightweight): website facts + leadership mentions + brand imagery to append to CRM.
Content QA & SEO checks: detect missing titles/meta/alt-text; surface issues as a compact report.
Incident & trend scanning: rapid synthesis during events; snapshots with a few images.
Knowledge base refresh: normalize blog/announcements into a consistent schema.
URLs textarea + Run
Result cards: Title, Author, Published, Summary, up to 3 images
Download JSON button
Pure LLM scraping is costly/slow; pure rule-based breaks on messy pages. Needed a hybrid that’s reliable, capped, private, and easy to launch.
Fetch + Retry: httpx + exponential backoff; polite headers.
Parse fast first: Selectolax + Trafilatura for clean text/title/author/date.
LLM “structuring” pass: compact JSON → url, title, author, summary, published.
Image sampling: ≤ 3 images (≥ 256 px), ≤ 2 MB each, saved as JPEG.
Optional vision captions: tiny local model (e.g., qwen2:vl-2b-instruct), ≤ 15 words per image.
GUI + one-click: Streamlit app; start_scraper.cmd creates venv, installs deps, opens browser.
Python: httpx[http2], Selectolax, Trafilatura, Pydantic v2, Tenacity, requests-cache, Pillow
UI: Streamlit (local web app)
LLM: Ollama (Gemma 3, DeepSeek R1 8B, Llama 3.1 8B) or Hugging Face Inference
Vision (optional): Qwen2-VL 2B / LLaVA 7B (4-bit), outputs capped
Repository: Link
In this project, I harnessed the power of machine learning models, specifically GPT-2, DistilBERT, and TextBlob, to analyze sentiment and generate text for news data. Despite facing hardware limitations, I skillfully managed the models, optimized performance, and analyzed the outputs to draw comprehensive insights.
I demonstrated my technical skills in Python programming, data analysis, machine learning, and API integration, while also showing my ability to think critically about project outcomes and identify future research directions. The project suggested potential improvements including integrating more data sources, refining visualizations, and enhancing text generation techniques.
This project underscores my ability to apply complex machine learning methods to solve practical business problems. The focus was not mainly on implementing cutting-edge algorithms, but more so on deriving actionable insights from vast tech news data. This task involved proficiently managing extensive datasets and overcoming natural language processing challenges. The project thus reflects my aptitude for transforming raw data into meaningful knowledge, and my passion for leveraging data and AI to create innovative solutions. This blend of skills and enthusiasm makes me an ideal candidate for teams aiming to harness the potential of AI and data in unique and impactful ways.
This work serves as a clear demonstration of my ability to tackle complex, real-world challenges using machine learning and data analysis, presenting me as a valuable asset to any data-driven organization.
References: Hugging Face, OpenAI, TextBlob by Steven Loria, Nvidia CUDA, and related key publications in the field.
2023
The first video shows how you can use the dashboard. This dashboard can pull data from different places like news sites or social media, and it can work in different languages too. The TextBlob model is used to analyze the feelings or 'sentiments' in the text. It's all pretty straightforward to use.
In the second video, we see how things get more interesting when we add GPT-2 to the dashboard. Not only can you get sentiment analysis as before, but you can also generate new text that sounds like it was written by a human. This shows how the dashboard can do more when you add more complex machine learning models.