NRNB Issue-238: Using Foundation Models for Gene Regulatory Network Inference

Pull Request

A short description of the goals of the project

The primary goal of this project was to establish a framework for evaluating the performance of foundation models, specifically scGPT, in generating embeddings suitable for gene regulatory network (GRN) inference. By leveraging the BEELINE benchmark, we aimed to create a standardized approach for assessing the effectiveness of current and future foundational models in the context of GRN inference.

What I Did

Implementing a pipeline to use scGPT for generating embeddings from single-cell RNA sequencing data.
Integrating these embeddings within the BEELINE benchmark to evaluate their effectiveness in GRN inference.
Comparing the performance of scGPT-generated embeddings against traditional methods used in GRN inference.
Using metrics and visualization tools to quantify and illustrate the performance of scGPT in this context.

The Current State

All the goals mentioned in the project have been achieved. scGPT, a transformer-based foundation model trained on single-cell RNA sequencing data enhances understanding of cellular biology and improve scRNA-seq analysis. scGPT is trained on a large corpus of scRNA-seq from CELLxGENE collection. The exact size of the pre-training dataset is around 33 million cells as explicitly mentioned in the current research. From which 97% was used for training and 3% for testing.

The Beeline framework has been updated.
Previously, it used only expression data.
Now, it can use embeddings data as well.
Model files and an embeddings generation pipeline have been added to Beeline.
Users can activate the embeddings feature with a simple flag: --use_embeddings
This flag allows the pipeline to generate and use embeddings data for GRN inference.

Results

Results showed that gene regulatory networks (GRNs) inferred using embeddings generated by scGPT consistently outperformed those derived from raw gene expression data. This trend was observed across multiple datasets, including human and mouse single-cell RNA sequencing data from various cell types. The improved performance was evident across different GRN inference algorithms and when compared against multiple types of ground truth networks. These findings suggest that single-cell foundation models can generate embeddings that enhance the accuracy of GRN inference methods.

Fig 1: Performance of three GRN inference models, namely, GENIE3, GRNBOOST2, PIDC using sc-GPT generated embedding and gene expression data seperately. Here, along x-axis we have the gene expression dataset and the reference network and along y-axis we have the corresponding AUROC score.

Fig 2: Performance of three GRN inference models, namely, GENIE3, GRNBOOST2, PIDC using sc-GPT generated embedding and gene expression data seperately. Here, along x-axis we have the gene expression datasets and the reference network and along y-axis we have the corresponding AUPRC score.

Fig 3: Performance of three GRN inference models, namely, GENIE3, GRNBOOST2, PIDC using sc-GPT generated embedding and gene expression data seperately. Here, along x-axis we have the gene expression dataset and the reference network and along y-axis we have the corresponding Early Precision Recall score.

What's Left to Do

While significant progress has been made, there are several areas for progress that still need work:

Work on EPr issue is required. According to the BEELINE research paper, the EPr(Early Precision Recall) metric was turning out to be between 0-7.5, whereas from our computation, the score was turning out to be less that 0.2 in all gene embeddings and gene expression evaluations.
EPr evaluation Pipeline also requires similar edits and addtions as the Auc pipeline did, so that using the flag --use_embeddings can generate and save EPr scores without the need of config file editing.
In case of hHepHepG2-ChIP-seq-network, hHepNon-Specific-ChIP-seq-network, hHepSTRING-network Datasets, when used for GRN inference are producing nan values in case of PIDC algorithms in case of gene expression data. This is when the preprocessing is done as defined in the paper, using TFs(Transcription factor genes) + 500 genes based on highest variance scores. But when only top 500 genes are chosen without considering any TFs, we get values other than nan for PIDC.

Potential Future Direction

Expanding the evaluation to include other foundation models like scBERT and Geneformer for a comprehensive comparison.
Fine-tuning the scGPT model specifically for GRN inference tasks to potentially improve performance.
Exploring the interpretability of the embeddings generated by scGPT to gain insights into the biological relevance of the learned representations.
Extending the framework to assess performance across diverse datasets and cell types.

Changes Included in the Pull Request

The code merged is as follows, firstly some changes were made to the BLRunner.py script in order to incorporate the –use_embeddings flag, so that when the following command is used:

python BLRunner.py –config config-files/config.yaml –use_embeddings

The change made in this file was:

In the get_parser() function:

parser.add_argument('--use_embeddings', action='store_true',
        help='Use embeddings for expression data')

In the main() function:

for idx in range(len(evaluation.runners)):
        evaluation.runners[idx].use_embeddings = opts.use_embeddings

Following this, some additions were made in the runner.py file in the BLRun class:

In the Runner class:

self.use_embeddings = params.get('use_embeddings', False)
    self.embeddings_file = None

Function added in the Runner class:

def generate_embeddings(self):
        embed_script_path = Path(__file__).resolve().parent / "generate_embeds.py"
        
        if not embed_script_path.exists():
            raise FileNotFoundError(f"Embeddings script not found at {embed_script_path}")
        
        expr_filename = Path(self.exprData).stem
        new_input_dir = Path(self.inputDir) / f"processed_{expr_filename}"
        new_input_dir.mkdir(parents=True, exist_ok=True)
        
        self.embeddings_file = new_input_dir / "EmbeddingsData.csv"
        
        print(f"Using input file: {self.exprData}")
        print(f"Running embedding generation script at: {embed_script_path}")
        print(f"Embeddings will be saved to: {self.embeddings_file}")
        
        command = [
            "python",
            str(embed_script_path),
            "--input", str(Path(self.inputDir) / self.exprData),
        ]
        
        try:
            subprocess.run(command, check=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            print("Embeddings generated successfully.")
            
            generated_file = Path(self.inputDir) / "EmbeddingsData.csv"
            if generated_file.exists():
                shutil.move(str(generated_file), str(self.embeddings_file))
                print(f"Embeddings moved to {self.embeddings_file}")
            
            ref_network_file = Path(self.inputDir) / "refNetwork.csv"
            if ref_network_file.exists():
                shutil.copy(str(ref_network_file), str(new_input_dir))
                print(f"refNetwork.csv copied to {new_input_dir}")
            else:
                print("refNetwork.csv not found in the input directory.")
            
            self.inputDir = new_input_dir
            self.exprData = "EmbeddingsData.csv"
            
        except subprocess.CalledProcessError as e:
            print(f"Error generating embeddings: {e}")
            print(f"Script output: {e.output.decode()}")
            print(f"Script error: {e.stderr.decode()}")
            raise

This will ensure that the generate_embeds.py script is called within the same BLRun class if the flag is used, and the right input and output directories are passed on. The generate_embeds.py file is another addition I have made entirely:

import subprocess
    import os
    import argparse

    def generate_embeddings(input_file):
        input_dir = os.path.dirname(os.path.abspath(input_file))
        print("------>", input_dir)
        input_filename = os.path.basename(input_file)
        
        output_file = os.path.join(input_dir, "EmbeddingsData.csv")

        cmd = [
            "docker", "run", "--rm",
            "-v", f"{input_dir}:/input", 
            "scgpt_human",  
            "--input", f"/input/{input_filename}", 
            "--model_dir", "/app" 
        ]
        
        try:
            subprocess.run(cmd, check=True)
            print(f"Embeddings generated successfully. Output saved to {output_file}")
        except subprocess.CalledProcessError as e:
            print(f"Error generating embeddings: {e}")
            raise

    def main():
        parser = argparse.ArgumentParser(description='Generate gene embeddings using scGPT model')
        parser.add_argument('--input', required=True, help='Path to input expression data CSV file')
        
        args = parser.parse_args()
        
        generate_embeddings(args.input)

    if __name__ == "__main__":
        main()

The pipeline automatically generates embeddings and uses them for GRN Inference. The whole pipeline here uses a generate_embeddings.py and model files located in another addition made here in a directory named Model. The model files and generate_embeddings script have been entirely dockerized.

generate_embeddings.py script:

import pandas as pd
    import json
    import warnings
    import torch
    from scgpt.tokenizer.gene_tokenizer import GeneVocab
    from scgpt.model import TransformerModel
    from scgpt.utils import set_seed
    import os
    import argparse

    warnings.filterwarnings('ignore')
    set_seed(42)

    def initialize_model(args_file, model_file, vocab_file):
        if not all(os.path.exists(f) for f in [args_file, model_file, vocab_file]):
            raise FileNotFoundError(f"Required model files not found: {args_file, model_file, vocab_file}")

        vocab = GeneVocab.from_file(vocab_file)
        special_tokens = ["", "", ""]
        for s in special_tokens:
            if s not in vocab, vocab.append_token(s)
        
        with open(args_file, "r") as f:
            model_configs = json.load(f)
        
        ntokens = len(vocab)
        model = TransformerModel(
            ntokens,
            model_configs["embsize"],
            model_configs["nheads"],
            model_configs["d_hid"],
            model_configs["nlayers"],
            vocab=vocab,
            pad_value=model_configs.get("pad_value", -2),
            n_input_bins=model_configs.get("n_bins", 51),
        )
        
        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        state_dict = torch.load(model_file, map_location=device)
        
        model.load_state_dict(state_dict, strict=False)
        model.to(device)
        model.eval()
        
        return model, vocab, device

    def generate_and_save_embeddings(file_path, model, vocab, device, output_file):
        expression_data = pd.read_csv(file_path, index_col=0)
        gene_names = expression_data.index.tolist()
        tokenized_genes = []
        for gene in gene_names:
            upper_gene = gene.upper()
            if upper_gene in vocab:
                tokenized_genes.append(vocab[upper_gene])
            else:
                tokenized_genes.append(vocab[""])  # Use  token for unknown genes

        gene_ids = torch.tensor(tokenized_genes, dtype=torch.long).to(device)
        with torch.no_grad():
            gene_embeddings = model.encoder(gene_ids)
        gene_embeddings = gene_embeddings.detach().cpu().numpy()
        
        gene_embeddings_dict = {gene: gene_embeddings[i] for i, gene in enumerate(gene_names)}
        embeddings_df = pd.DataFrame(gene_embeddings_dict).T
        embeddings_df.to_csv(output_file)
        print(f'Saved gene embeddings for {len(gene_embeddings_dict)} genes to {output_file}')

    if __name__ == "__main__":
        parser = argparse.ArgumentParser(description='Generate gene embeddings')
        parser.add_argument('--input', required=True, help='Path to input expression data CSV file')
        parser.add_argument('--model_dir', default='/app', help='Directory containing model files')
        args = parser.parse_args()

        # Generate output path based on input path
        input_dir = os.path.dirname(args.input)
        input_filename = os.path.basename(args.input)
        output_filename = "EmbeddingsData.csv"
        output_path = os.path.join(input_dir, output_filename)
        
        print("Debug Information:")
        print(f"Input file path: {args.input}")
        print(f"Output file path: {output_path}")
        print(f"Model directory: {args.model_dir}")

        args_file = os.path.join(args.model_dir, "args.json")
        model_file = os.path.join(args.model_dir, "best_model.pt")
        vocab_file = os.path.join(args.model_dir, "vocab.json")

        try:
            model, vocab, device = initialize_model(args_file, model_file, vocab_file)
            generate_and_save_embeddings(args.input, model, vocab, device, output_path)
        except FileNotFoundError as e:
            print(f"Error: {e}")
            print("Please ensure that the model files (args.json, best_model.pt, vocab.json) are in the correct directory.")
            print(f"Current model directory: {args.model_dir}")
            exit(1)

To follow along with the changes made in the BLRun package, some changes were also required in BLEval package to ensure automated and smooth performance of the entire Evaluation pipeline for both Expression and Embedding Data. To start off, the firstly changes were required in the __init__.py file this time, since there was no isolated runner.py file that ran all the rest of the evaluators.

In the InputSettings() class, we start off with the following addition. This helps us in case of generating scores for embeddings, by having all the path routes defined in one place


    self.use_embeddings = use_embeddings

    if use_embeddings:
        self.adjust_paths_for_embeddings()

    def adjust_paths_for_embeddings(self):
    for dataset in self.datasets:
        processed_path = 'processed_ExpressionData'
        dataset['exprData'] = str(Path(processed_path) / 'EmbeddingsData.csv')
        dataset['refNetwork'] = str(Path(processed_path) / 'refNetwork.csv')

    def get_true_edges_path(self, dataset_name):
        dataset = next((d for d in self.datasets if d['name'] == dataset_name), None)
        if dataset is None:
            raise ValueError(f"Dataset with name {dataset_name} not found in datasets list.")
        return self.datadir / dataset_name / 'processed_ExpressionData' / dataset['trueEdges']

Similarly I defined the OutputSettings() class:


            if use_embeddings:
            self.adjust_paths_for_embeddings()

    def adjust_paths_for_embeddings(self):
        self.base_dir = self.base_dir / 'processed_ExpressionData'

    def get_output_path(self, dataset_name, file_name):
        dataset = next((d for d in self.datasets if d['name'] == dataset_name), None)
        if dataset is None:
            raise ValueError(f"Dataset with name {dataset_name} not found in datasets list.")
        
        # Construct the output path based on the required structure
        output_path = self.base_dir / dataset_name / 'processed_ExpressionData' / f"{dataset_name}-{file_name}"
        return output_path

Similarly more such path changes were added in the config parse(),__parse_input_settings(),__parse_output_settings(), in the ConfigParser() class to use this path by calling on the InputSettings() class.

Then we move on to the PRROC() function in the computeDGAUC.py file, in which we add an if statement to check if --use_embeddings flag is used, in which a different input and output path needs to be followed:


    if inputSettings.use_embeddings == True:     
        true_edges_path = inputSettings.get_true_edges_path(dataDict['name'])
        trueEdgesDF = pd.read_csv(true_edges_path, sep=',', header=0, index_col=None)
        outDir = Path("outputs") / inputSettings.datadir.relative_to("inputs") / dataDict['name'] / 'processed_ExpressionData'
    else:
        trueEdgesDF = pd.read_csv(str(inputSettings.datadir)+'/'+ dataDict['name'] +
                                '/' +dataDict['trueEdges'],
                                sep = ',', 
                                header = 0, index_col = None)
        outDir = Path("outputs") / inputSettings.datadir.relative_to("inputs") / dataDict['name']

Finally, the main file we run in the evaluation pipeline is the BLEvaluator.py some changes introduced here ensured that the --use_embeddings flag is parsed and the changes we added earlier ensure the pipeline runs smoothly in both cases:

A small addition in the get_parser():


    parser.add_argument('--use_embeddings', action="store_true", default=False,
    help="Use processed expression data embeddings for evaluation.")

An edit in the main() to have the right output directory to print save the results to:


    if opts.auc:
    print('\n\nComputing areas under ROC and PR curves...')

    AUPRC, AUROC = evalSummarizer.computeAUC()

    # Construct the output directory based on InputSettings
    # This ensures that the path mirrors the structure used in InputSettings
    dataset_name = str(evalSummarizer.output_settings.output_prefix)  # Assuming this holds the dataset name, e.g., 'GSD'
    if opts.use_embeddings:
        # Use the processed path if embeddings are used
        outDir = Path("outputs") / evalSummarizer.input_settings.datadir.relative_to("inputs") / dataset_name / "processed_ExpressionData"
        # Ensure that outDir exists
        outDir.mkdir(parents=True, exist_ok=True)
    else:
        # Use the regular path structure
        outDir = Path("outputs") / evalSummarizer.input_settings.datadir.relative_to("inputs") / dataset_name
        # Ensure that outDir exists
        outDir.mkdir(parents=True, exist_ok=True) 
    # Save the output files in the correct directory
    AUPRC.to_csv(outDir / "AUPRC.csv")
    AUROC.to_csv(outDir / "AUROC.csv")

Another additional change, that I will pass on as a solution to this issue, as a part of the original PR. It was essential to fast pace the evaluation pipeline, due to the large size of datasets(>1000 Genes) with a small change in the computeScores() function in the computeDGAUC.py file in the BLEval folder.

# Get unique genes
    unique_genes = np.unique(trueEdgesDF.loc[:, ['Gene1', 'Gene2']])
        
    # Convert DataFrames to dictionaries for faster lookup
    true_edges = set(map(tuple, trueEdgesDF[['Gene1', 'Gene2']].values))
    pred_edges = dict(zip(map(tuple, predEdgeDF[['Gene1', 'Gene2']].values), 
                        predEdgeDF['EdgeWeight']))

    TrueEdgeDict = {}
    PredEdgeDict = {}

    if directed:
        edge_generator = product(unique_genes, repeat=2) if selfEdges else permutations(unique_genes, r=2)
    else:
        edge_generator = combinations_with_replacement(unique_genes, r=2) if selfEdges else combinations(unique_genes, r=2)

    for edge in edge_generator:
        key = '|'.join(edge)
        
        # Compute TrueEdgeDict
        TrueEdgeDict[key] = int(edge in true_edges or (not directed and edge[::-1] in true_edges))
        
        # Compute PredEdgeDict
        if directed:
            PredEdgeDict[key] = abs(pred_edges.get(edge, 0))
        else:
            PredEdgeDict[key] = max(abs(pred_edges.get(edge, 0)), abs(pred_edges.get(edge[::-1], 0)))

The computeScores function has been optimized for improved performance and efficiency. Key enhancements include:

Efficient Data Structures: Converted input DataFrames to sets and dictionaries for faster edge lookups.
Streamlined Edge Generation: Unified the logic for generating possible edges in both directed and undirected cases.
On-the-Fly Dictionary Creation: Eliminated the need for pre-initializing dictionaries with all possible edges.
Concise Code Structure: Reduced code repetition, particularly in handling directed and undirected cases.
Performance Implications: Expected significant speed improvements, especially for large genetic networks.
Scalability: Better suited for handling large-scale genetic network data.

Important Things I Learned During the Project

Adapting language model architectures to biological data required a deep understanding of both machine learning and genomics.
Balancing computational efficiency with model performance was crucial, especially when dealing with large-scale single-cell datasets.
Interpreting the biological significance of embeddings generated by foundation models proved to be a complex task, highlighting the need for interdisciplinary approaches.
The importance of rigorous benchmarking in computational biology, as demonstrated by the BEELINE framework, became evident throughout the project.
Collaborating with researchers from diverse backgrounds enhanced the project's scope and impact, emphasizing the value of interdisciplinary work in bioinformatics.

References

BEELINE: Benchmarking gEnE reguLatory network Inference from siNgle-cEll transcriptomic data. Link to Publication and Github
scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI. Link to Publication and Github