Open-Source Projects </>

Here's a collection of projects I've worked on over the years. These represent some of my best (public) work and areas where I've focused my skills and passion.

  • Maestrale

    An Italian 7B LLM that we created from Mistral 7B. Continued pre-training, fine-tuning and alignment along with dataset creation and curation. The model is available on Huggingface.

    huggingface/maestrale

  • Gazzetta Ufficiale

    A 1.4M example Italian language dataset for training LLMs. It is based on the Gazzetta Ufficiale, the official journal of the Italian Republic.

    huggingface/gazzetta-ufficiale

  • Paper: Two New Datasets for Italian-Language Abstractive Text Summarization

    A paper on two new datasets for Italian-language abstractive text summarization. We also trained SOTA models (at the time) for abstractive summarization on these datasets and released them on Huggingface.

    mdpi.com/2078-2489/13/5/228

  • ipt-350m/125m

    Two small pre-trained LLMs for Italian, with ALiBi and Flash Attention, trained on llm-foundry. These are toy models for testing and development.

    huggingface/ipt-350m

  • Lecture at Deep Learning with PyTorch - Università degli studi dell'Insubria

    A lecture on diffusion models from the ground up, including the theory and an implementation from scratch in PyTorch.

    github/diffusion-lecture-it

  • MMLU-Pro-ita

    Italian translation of the MMLU-Pro dataset. The dataset contains 12k complex questions across various disciplines to evaluate Italian LLMs.

    huggingface/MMLU-Pro-ita

  • Italian Sentence Transformers

    Developed the first public Italian sentence-transformers available on Huggingface. Although now outdated, they were highly effective at the time, particularly in a Retrieval-Augmented Generation (RAG/LFQA) pipeline using Faiss to generate answers from biomedical texts. This system was made with it5 by Gabriele Sarti.

    sentence-BERTino