A multilingual RAG pipeline for the ETH News corpus
This semester project demonstrates an end‑to‑end, multilingual Retrieval‑Augmented Generation system.
Starting from raw HTML articles, the pipeline:
For full details and deliverables, please refer to the official Project Requirements.pdf provided in the course materials.
Everything can be executed locally or in Google Colab.
Notebook:
Step_1.ipynb
Goal: transform bilingual ETH News HTML files into validated JSON ready for chunking and indexing.
Key points
Task | Script | Highlights |
---|---|---|
HTML parsing | step_1_BeautifulSoup.py , step_1_Docling.py , step_1_hybrid.py |
Compare BeautifulSoup (fast tag‑stripping) vs. Docling (layout‑aware) vs. a hybrid of both |
Advanced cleaning & metadata | step_1_2_advanced_cleaning_and_metadata.py |
Removes disclaimers/repeats, adds language, NER, keywords, summaries |
Validation | step_1_3_validation_filter.py |
Drops empty docs, final sanity checks |
Output (one JSON per article):
{
"doc_id": "9b7f…",
"filename": "example.html",
"domain": "ethz.ch",
"language": "de",
"date": "2023-05-01",
"source": "ETH News",
"paragraphs_original": [ … ],
"paragraphs_cleaned": [ … ],
"named_entities": [ … ],
"keywords": [ … ],
"summary": "…",
"text_stats": { "char_count": 864, "word_count": 128, "paragraph_count": 1 }
}
Notebook:
Step_2_1.ipynb
Goal: create document chunks, build multiple indices, and benchmark retrieval quality.
Notebook:
Step_2_2.ipynb
Pipeline:
Notebook:
Step_3.ipynb
Average semantic F1 = 0.36; human graders rate clarity highest, relevance slightly lower.
# 1️⃣ Create and activate venv
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
# 2️⃣ Install core libraries
pip install -r requirements.txt # contains bs4, docling, spacy, nltk, etc.
# Or core libs
pip install bs4
pip install docling
pip install dateparser
pip install yake
pip install lingua-language-detector
pip install spacy
pip install nltk
# 3️⃣ Download spaCy models and NLTK data
python -m spacy download en_core_web_sm
python -m spacy download de_core_news_sm
python -m spacy download fr_core_news_sm
python -m spacy download it_core_news_sm
python -m nltk.downloader punkt
python -m nltk.downloader punkt_tab
# 4️⃣ Parse & clean HTML (choose one parser or run all)
python Code/step_1_BeautifulSoup.py data/ data_cleaned/BS
python Code/step_1_Docling.py data/ data_cleaned/D
python Code/step_1_hybrid.py data/ data_cleaned/BSD
# 5️⃣ Advanced cleaning & validation
python Code/step_1_2_advanced_cleaning_and_metadata.py data_cleaned/BSD data_cleaned/BSD_advanced
python Code/step_1_3_validation_filter.py data_cleaned/BSD_advanced data_cleaned/BSD_validated
# 6️⃣ Explore the notebooks 🔬
jupyter lab
(Google Colab users can simply open each notebook and run all cells; mount your drive and point the data path accordingly.)
benchmark/ # Q&A datasets & relevance labels
Code/ # Stand‑alone Python scripts
data/ # Raw HTML (not committed)
data_cleaned/ # Parsed & validated JSON
notebooks/
├─ Step_1.ipynb
├─ Step_2_1.ipynb
├─ Step_2_2.ipynb
└─ Step_3.ipynb
pictures/ # All figures used in the README
Note: Copy the HKNews dataset into
data/
before running Step 1.