Open-Source Projects </>
Here's a collection of projects I've worked on over the years. These represent some of my best (public) work and areas where I've focused my skills and passion.
Maestrale
An Italian 7B LLM that we created from Mistral 7B. Continued pre-training, fine-tuning and alignment along with dataset creation and curation. The model is available on Huggingface.
huggingface/maestrale
Gazzetta Ufficiale
A 1.4M example Italian language dataset for training LLMs. It is based on the Gazzetta Ufficiale, the official journal of the Italian Republic.
huggingface/gazzetta-ufficiale
Paper: Two New Datasets for Italian-Language Abstractive Text Summarization
A paper on two new datasets for Italian-language abstractive text summarization. We also trained SOTA models (at the time) for abstractive summarization on these datasets and released them on Huggingface.
mdpi.com/2078-2489/13/5/228
ipt-350m/125m
Two small pre-trained LLMs for Italian, with ALiBi and Flash Attention, trained on llm-foundry. These are toy models for testing and development.
huggingface/ipt-350m
Lecture at Deep Learning with PyTorch - Università degli studi dell'Insubria
A lecture on diffusion models from the ground up, including the theory and an implementation from scratch in PyTorch.
github/diffusion-lecture-it
MMLU-Pro-ita
Italian translation of the MMLU-Pro dataset. The dataset contains 12k complex questions across various disciplines to evaluate Italian LLMs.
huggingface/MMLU-Pro-ita
Italian Sentence Transformers
Developed the first public Italian sentence-transformers available on Huggingface. Although now outdated, they were highly effective at the time, particularly in a Retrieval-Augmented Generation (RAG/LFQA) pipeline using Faiss to generate answers from biomedical texts. This system was made with it5 by Gabriele Sarti.
sentence-BERTino