Skip to content
printicon
Show report in:

UMINF 25.08

Bridging AI and Privacy: Solutions for High-Dimensional Data and Foundation Models

The widespread adoption of machine learning (ML) in various domains has enabled the extraction of meaningful insights from complex, large–scale datasets. However, recent research has revealed that ML models are vulnerable to a range of privacy attacks which can expose sensitive information about individuals in the training data. With regulatory frameworks like the General Data Protection Regulation (GDPR) enforcing strict requirements on data sharing, the need for privacy–preserving solutions has become increasingly critical. As the world becomes increasingly digital, massive volumes of data are generated, often in high–dimensional spaces, where the number of attributes matches or exceeds the number of samples. ML models are extensively used to process such data, making it critical to protect both the data and the models from privacy attacks. Traditional anonymization techniques such as k-anonymity and differential privacy often fall short when applied to high-dimensional datasets, because as dimensionality of the data increase, data points tends to concentrate in the sparse regions of the feature space, making it difficult to find clusters of similar records. Therefore, this thesis proposes a set of privacy-preserving methodologies tailored for high-dimensional data and large-scale foundation models. In this thesis, we begin by exploring manifold learning techniques to project high-dimensional data into a lower-dimensional latent space while preserving the intrinsic geometric structure of the original data. This transformation enhances the effectiveness of anonymization while maintaining data utility. Building on this, we then present a novel hybrid privacy model that integrates the strengths of k-anonymity with differential privacy, enabling robust anonymization that preserves both privacy and the underlying data structure. We further investigate synthetic data generation as a privacy-preserving alternative to using sensitive data, leveraging advanced generative models such as GANs and VAEs to produce high-quality synthetic datasets. To enhance the quality of the generated data, we propose techniques that preserve the intrinsic structure of the original high-dimensional data and incorporate prior domain knowledge to guide the generation process. We rigorously evaluate the synthetic data in terms of statistical fidelity, privacy risks, ML utility, and distributional capabilities through detailed visualizations. We then address high-dimensionality and privacy concerns in the context of large- scale foundation models. We propose two model compression strategies using knowledge distillation and pruning, that effectively reduce the number of model parameters while preserving performance and enhancing the privacy of the system. Collectively, the thesis contributes towards building privacy-aware AI systems by developing practical solutions that address the complex interplay between high-dimensionality and privacy models.

Keywords

Privacy, Manifold Learning, k-Anonymity, Differential Privacy, Synthetic Data Generation, Language Models and Model Compression

Authors

Back Edit this report
Entry responsible: Sonakshi Garg

Page Responsible: Frank Drewes
2025-06-29