PyMethylProcess – a preprocessing workflow for DNA methylation

Byte sized content for the busy scientist

PyMethylProcess – a preprocessing workflow for DNA methylation

Fynd Science Analytics

Introducing PyMethylProcess

There’s an eternal debate among data scientists about whether R or Python should reign supreme. In reality, there shouldn’t be a required trade-off, and tools should be built for both user sets. That’s where PyMethylProcess comes in.

PyMethylProcess is a Python tool developed to port over some of the most widely used tools in R for DNA methylation analysis. Instead of recreating the wheel, Levy et al. decided to take advantage of the best that both Python and R had to offer, and developed a tool to allow to do just that. Below you will find the tutorial on how to use PyMethylProcess. This tutorial was originally published with the PyMethylProcess project GitHub repository, and is reproduced here for consolidation sake.

Steps involved in using the PyMethylProcess tool for DNA methylation analysis

  1. Downloading the Data

  2. Formatting the Sample Sheet

  3. Running the Preprocessing Pipeline

  4. Accessing, Reading and Writing MethylationArray Data

  5. Final Preprocessing Steps

  6. Visualizing the Data

  7. Splitting the Data into Training, Testing, and Validation Sets

  8. Cell Type Deconvolution and Age Prediction

  9. Running a Machine Learning Analysis

1. Downloading the Data with PyMethylProcess

Let’s assume that you’ve already followed all of the installation instructions on the README page.

First we’re going to download the IDATs of a GEO dataset along with some covariate information. Normally, you’d have to specify a few commands in R to make this happen using GEOQuery or a similar service. Instead, now all you have to do is:

pymethyl-preprocess download_geo -g GSE87571 -o geo_idats/

“-g” is asking for a particular GEO data set, this is our dataset GSE87571. By default, the IDAT files are stored in a directory called geo_idats.

Once the data is downloaded, you should find a list of IDAT files in that directory along with a CSV file containing the clinical covariates. Let’s take a quick look at the clinical covariate CSV file that is output from this step (you can also acquire this for less time and data using the download_pheno CWL tool):


It’s easy to see that there are a lot of columns we don’t need, and some that we would like to change. See the next step for formatting the data.

Note: This repository is currently oriented for IDAT preprocessing. Josh is currently working on a higher throughput workflow for stored unmethylated and methylated intensity tab/comma-separated value files, as available for many studies on GEO. He may add this component as an experimental part to this pipeline soon and will take you from these matrix files to the MethylationArray to be introduced.

2. Formatting the Sample Sheet

We now need to format this cumbersome clinical CSV file so that it is easy to interpret in downstream steps and used easily. To do this PyMethylProcess only requires two commands and the creation of a file that maps the column names from the table we just saw to pretty looking names.

First, let’s see what header names need changing. The key columns/info in the pheno CSV file are geo_accession, gender:ch1, tissue:ch1, “disease state:ch1” and age:ch1. We want the pheno CSV to only contain this information, and maybe change the column names to AccNum, Sex, Tissue, disease, and Age respectively. We introduce this mapping in a TSV file (tab-separated because some of the column names have spaces):

nano include_col.txt

In include_col.txt, add and separate by tabs (not spaces as I’ve posted below):

gender:ch1       Sex
tissue:ch1       Tissue
age:ch1          Age

We add this file to the command below:

pymethyl-preprocess create_sample_sheet -is ./geo_idats/GSE87571_clinical_info.csv -s geo -i geo_idats/ -os geo_idats/samplesheet.csv -d "disease state:ch1" -c include_col.txt
mkdir backup_clinical && mv ./geo_idats/GSE87571_clinical_info.csv backup_clinical

The first command formats the CSV to match our aims above. Here we’ve specified “disease state:ch1” with command option -d. -s geo formats for GEO and takes into account the geo_accession and makes the change accordingly. It also scans the IDAT directory, –i, and adds a Basename column for minfi and meffil to search as we parallelize and preprocess. -is and -os are responsible for the input and output CSVs. Only up to one CSV (can have CSV in another directory and reference the IDATs) can exist in the geo_idat directory.

This command below makes sure the sex column is properly formatted if there is one.

pymethyl-preprocess meffil_encode -is geo_idats/samplesheet.csv -os geo_idats/samplesheet.csv

Now we can look at what we’ve generated: geo_idats/samplesheet.csv

3. Running the PyMethylProcess Preprocessing Pipeline

Now we have a samplesheet.csv pheno CSV file and a directory of IDATs.

We can run the PyMethylProcess preprocessing pipeline by running:

pymethyl-preprocess preprocess_pipeline -i geo_idats/ -p minfi -noob -qc

This preprocesses the IDATs in geo_idats using minfi (-p) option with noob normalization (-noob), and only is storing the RGSet objects once loaded (-qc) (hasn’t performed the normalization yet). By disabling the –qc option, it will execute the entire pipeline, but this can be useful for debugging or setting threshold parameters.

To finish the qc/norm process, run:

pymethyl-preprocess preprocess_pipeline -i geo_idats/ -p minfi -noob -u

Which loads the saved QC/RGSet(s) via the -u option.

The preprocessing pipeline comes with other options to remove or fill with NA values CpG or samples that fail some detection p-value and bead number thresholds. Using the meffil or enmix options (see docs) allows you to specify multiple cores to preprocess the data. Minfi does not offer this, but as a workaround, you can split up the pheno CSV file into batches (will be added as a future feature) and use the:

pymethyl-preprocess split_preprocess_input_by_subtype -h # similar to a groupby statement
pymethyl-preprocess batch_deploy_preprocess -h # runs the preprocessing command for each new pheno csv from the above step
pymethyl-preprocess combine_methylation_arrays -h # combines and merges the resulting objects

Commands to preprocess batches of these IDATs in parallel. See the help documentation for more detail.

4. Accessing, Reading and Writing MethylationArray Data

The output from the preprocessing pipeline is a .pkl file, which stands for a Python pickle file. This file type is much akin to the .RData file one would get from saving R data, and saves python objects that can be loaded. Saved in this file is a python dictionary, with keys pheno and beta, which saves a phenotype pandas data frame and a beta data frame respectively. In both data frames, samples make up the rows/indices, and CpGs make up the beta matrix columns, different covariates and sample info make up the phenotype data frame. These two data frames are used to construct a MethylationArray object, stored in preprocess_outputs/ (or combined_outputs/ when batching/splitting up data).

Let’s open up this object by running Python interactively:

>>> from pymethylprocess.MethylationDataTypes import MethylationArray
>>> MethylationArray.from_pickle('preprocess_outputs/')
>>> methyl_arr=MethylationArray.from_pickle('preprocess_outputs/methyl_array.pkl')
>>> methyl_arr.pheno
>>> methyl_arr.beta

The outputs of the last two commands were not shown, but this is to say that the MethylationArray object is storing pheno and beta values in the form of data frames, and the data can be loaded in this way. This object has a lot of methods, but this will not be covered here. Consult the help docs for more information.

To write an object to a .pkl file (‘some_pickle_file.pkl’) for python or CSVs (pheno.csv and beta.csv in some directory ‘some_output_directory/’) for loading in R, use:

>>> methyl_arr.write_csvs('some_output_directory/')
>>> methyl_array.write_pickle('some_pickle_file.pkl')

5. Final PyMethylProcess Preprocessing Steps

Finally, once the data is in a MethylationArray object, you can remove sex probes as such:

pymethyl-utils remove_sex -i preprocess_outputs/methyl_array.pkl

See the help docs for more information, but it takes a list of sex probes supplied by meffil and removes them from the beta attribute of the MethylationArray object. Unless the output is specified, it will store the new MethylationArray in the “autosomal/” directory.

Here, in preparation for imputation, you can report on the degree of missingness overall, and in the rows and columns of the beta matrix through this command:

pymethyl-preprocess na_report -i autosomal/methyl_array.pkl -o na_report/

It will output a few plots in na_report/ and output text directly in terminal.

Now, we are ready to impute the missing data.

pymethyl-preprocess imputation_pipeline -i ./autosomal/methyl_array.pkl -s fancyimpute -m KNN -k 15 -st 0.05 -ct 0.05

This imputation pipeline will first remove samples (-st) and CPGs (-ct) that have at least 5% missingness. Then it will impute any remaining missing values using k-nearest neighbors (-m KNN) using package (-s) fancyimpute, using 15 neighbors (-k). It’s an additional option to use feature selection at this stage, but we save this for a downstream step. Results are stored in “imputed_outputs/”

Feature Selection

Now, we can perform feature selection on the data using mean absolute deviation. That is, choosing the CpGs with the highest amount of mean absolute deviation from the mean methylation from the set of CpGs.

pymethyl-preprocess feature_select -n 300000

By default, the algorithm assumes the MethylationArray to be in “imputed_outputs/” and it will output a MethylationArray with 300k CpGs to a pickle file in “final_outputs/”. Other feature selection methods are being developed. Alternatively, one could run:

pymethyl-utils train_test_val_split -tp .8 -vp .125

Once the data are in training, testing and validation sets to ensure the top CpGs from the training set are being selected in case there is any dataset bias. We assumed there is not.

With the preprocessing out of the way, we can perform additional analyses. Please note that you can perform some of the more classical analyses with a MethylationArray from the “imputed_outputs/” folder if not seeking to select features.

6. Visualizing the Data

Here, we can visualize the MethylationArray object with the command:

mkdir visualizations
pymethyl-visualize transform_plot -o visualizations/umap_embed.html -c disease -nn 8 

It’s taking data from final_outputs/methyl_array.pkl (can be specified elsewhere using -i), finding a 3D UMAP embedding for the data using a number of neighbors (-nn) of 8. And is going to color the data using the disease column of the phenotype data. Output (-o) is a Plotly visualization that the user can interact with and explore, “visualizations/umap_embed.html”.

You can overlay the age on each of the datapoints using -c Age.

7. Splitting the Data into Training, Testing, and Validation Sets

Splitting the data in this way is key for machine learning analyses. The training set updates the parameters of the model, the validation set ensures the data generalizes to unseen data and helps with optimization of the model hyperparameters, and the testing set is the held-out data to test or make new predictions. We can split the MethylationArray from the final_outputs/ directory into smaller MethylationArrays by similar denotations using:

pymethyl-utils train_test_val_split -tp .8 -vp .125

which creates 3 new MethylationArrays (70% training, 10% validation, and 20% testing) and outputs them as such:

ls train_val_test_sets/
-> test_methyl_array.pkl train_methyl_array.pkl  val_methyl_array.pkl

If you wanted to combine them again, you can add them to a MethylationArrays object and use the .combine() method.

8. Cell Type Deconvolution and Age Prediction

PyMethylProcess also offers quick access to subroutines for cell-type deconvolution and age prediction.

Age Prediction

Here, we take the MethylationArray that is constructed before feature selection and run age/age rate estimation using the epitoc, Horvath and Hannum clocks.

pymethyl-utils est_age -a epitoc -a horvath -a hannum -ac Age  -i imputed_outputs/test_methyl_array.pkl &

The default output is to this filename: “age_estimation/output_age_estimations.csv”.

If you added the true ages from the MethylationArray to the saved CSV file, you could score the accuracy of these clocks (please submit an issue if you need help doing this):

pymethyl-utils rate_regression -i age_estimation/age_results.csv -c1 Horvath.Est -c2 Age
pymethyl-utils rate_regression -i age_estimation/age_results.csv -c1 Hannum.Est -c2 Age

Cell Type Deconvolution

Cell-type estimates can be derived from one of your original QC sets that was created using the –qc command option (does automatically if not specified) of the preprocessing pipeline. The preprocessing library creates RGSets if running minfi or ENmix and QCObjects if running meffil. The directory that contains them must be specified using the –ro option so it knows where to look for the object. The analysis is either using estimateCellCounts2 (-a IDOL) or the original estimateCellCounts (-a minfi) or a meffil sponsored algorithm (-a meffil). All employ the Houseman deconvolution method. You can also specify which library of CpGs to use (-l) IDOLOptimizedCpGs, or IDOLOptimizedCpGs450klegacy if using estimateCellCounts2, and a reference (-r) blood set if using meffil.

pymethyl-utils ref_estimate_cell_counts -ro geo_idats/ -a IDOL

Estimates are dumped to: added_cell_counts/cell_type_estimates.csv. However, the output can be in long format, which you can change to wide format using the following set of python commands.

>>> df=pd.read_csv('added_cell_counts/cell_type_estimates.csv')
>>> df.pivot('Var1','Var2','Freq').to_csv('added_cell_counts/ cell_types_adjusted.csv')

9. Running a Machine Learning Analysis with PyMethylProcess

Here, we’ll run a quick machine learning analysis on this age data sett. Particularly, we’re going to focus on two applications, clustering and age prediction. As examples, we’re going to show that similar ages cluster together, and demonstrate that we can make a model that predicts someone’s age.

First, if you want this analysis to run faster, but have lower accuracy, prune some of the CpGs, in this case I’m saving 25,000 of them while retaining only 35% of the samples:

>>> from pymethylprocess.MethylationDataTypes import MethylationArray
>>> sample_p = 0.35
>>> methyl_array=MethylationArray.from_pickle("train_methyl_array.pkl")
>>> methyl_array.subsample("Age",frac=None,n_samples=int(methyl_array. pheno.shape[0]*sample_p)).write_pickle("train_methyl_array_subsampled. pkl")
>>> methyl_array=MethylationArray.from_pickle("val_methyl_array.pkl")
>>> methyl_array.subsample("Age",frac=None,n_samples=int(methyl_array. pheno.shape[0]*sample_p)).write_pickle("val_methyl_array_subsampled. pkl")
>>> methyl_array=MethylationArray.from_pickle("test_methyl_array.pkl")
>>> methyl_array.subsample("Age",frac=None,n_samples=int(methyl_array. pheno.shape[0]*sample_p)).write_pickle("test_methyl_array_subsampled. pkl")
>>> pymethyl-utils feature_select_train_val_test -n 25000

The previous step is not necessary but is a neat trick, enacted via the .subsample() method of MethylationArray.


Then, we’d want to UMAP transform the data (make sure to load the saved MethylationArrays again as train_methyl_array, val_methyl_array, and test_methyl_array respectively):

from umap import UMAP
umap = UMAP(n_components=100)
train_methyl_array.beta = pd.DataFrame(umap.transform(train_methyl_array.beta.values),index=train_methyl_array.return_idx())
val_methyl_array.beta = pd.DataFrame(umap.transform(val_methyl_array.beta),index=val_methyl_array.return_idx())
test_methyl_array.beta = pd.DataFrame(umap.transform(test_methyl_array.beta),index=test_methyl_array.return_idx())

We just replaced the CpGs of the beta values with 100 UMAP components, the dimensionality reduction model trained on the training data.

UMAP works well with HDBScan in conjunction, so let’s cluster using HDBScan if it is installed, if not you can get it here: We’ll just run this on our training data:

from sklearn.decomposition import PCA
import matplotlib, matplotlib.pyplot as plt
import seaborn as sns
from seaborn import cubehelix_palette
def reduce_plot(data, labels, legend_title):
    sns.scatterplot('z1','z2',hue=legend_title, cmap=cubehelix_palette(as_cmap=True),data=t_data)

model = HDBSCAN(algorithm='best')
train_predicted_clusters = model.fit_predict(train_methyl_array.beta.astype(np.float64))
reduce_plot(train_methyl_array.beta, train_methyl_array.pheno['Age'].values,'Age')
reduce_plot(train_methyl_array.beta, train_predicted_clusters,'Cluster')
PyMethylProcess clusters 1

It’s easy to see some correspondence between age and cluster. In HDBSCAN, we would want to filter out the data with the -1 cluster labels in future predictions. Some sets of ages appear to correspond to a particular cluster.


Supervised Analysis

After an unsupervised analysis, we may want to try something more supervised and actually predict age in the other two sets. This can be accomplished either by setting up your own pipeline with the convenient data setup, or by calling the general_machine_learning part of the package. Officially, functionality can be accessed on the command line using the pymethyl-basic-ml command and (-h) for help in running. Josh illustrates how to run the module but leave it as a task for the user to run:

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score

y_pred = {}


A reminder that this is a severely limited dataset, and limited to only 100 UMAP components. With the right data and right model, even utilizing all of the CpGs, predictive accuracy should skyrocket. One could also concatenate the validation data to the training data and train using cross-validation. The machine learning module adds the convenience of being able to input a MethylationArray object and run hyperparameter scans.

Feature requests are welcome.

We can also get Gini Indexes that label how important each UMAP embedding was to the prediction. If we did away with the UMAP embedding layer and ran the model in the same manner, we’d get this value for each CpG.



Here’s what Josh found when running the model (again, tuning hyperparameters via some grid search or hyperparameter optimization method may be essential, maximizing validation set’s score):

PyMethylProcess validation scores
Validation set’s score

PyMethylProcess Abstract


Performing highly parallelized preprocessing of methylation array data using Python can accelerate data preparation for downstream methylation analyses, including large scale production-ready machine learning pipelines. We present a highly reproducible, scalable pipeline (PyMethylProcess) that can be quickly set-up and deployed through Docker and PIP.

Availability and Implementation

Project Home Page: Available on PyPI (pymethylprocess), Docker (joshualevy44/pymethylprocess).

PyMethylProcess Citation

Joshua J. Levy, Alexander J. Titus, Lucas A. Salas, Brock C. Christensen (2019) PyMethylProcess – convenient high-throughput preprocessing workflow for DNA methylation data. Bioinformatics. [DOI: 10.1093/bioinformatics/btz594][PMID: 31368477].

Click to read about more tools and tutorials.


Leave a Reply

Your email address will not be published. Required fields are marked *