SumStats¶

class pysumstats.SumStats(path, phenotype=None, gwas_n=None, column_names=None, data=None, low_ram=False, tmpdir='sumstats_temporary', **kwargs)¶

Class for summary statistics of a single GWAS.

Parameters:

path (str) – Path to the file containing summary statistics. Should be a csv, or tab-delimited txt-file (.gz supported).
phenotype (str) – Phenotype name
gwas_n (int) – Optional N subjects in the GWAS, for N-based meta analysis (if there is no N column in the summary statistics)
column_names (dict) – Optional dictionary of column names, if these are not automatically recognised. Keys should be: [‘rsid’, ‘chr’, ‘bp’, ‘ea’, ‘oa’, ‘maf’, ‘b’, ‘se’, ‘p’, ‘hwe’, ‘info’, ‘n’, ‘eaf’, ‘oaf’]
data (dict) – Dataset for the new SumStats object, in general, don’t specify this.
low_ram (bool) – Whether to use the low_ram option for this SumStats object. Use this only when running into MemoryErrors. Enabling this option will read/write data from local storage rather then RAM. It will save lots of RAM, but it will significantly decrease processing speed.
tmpdir (str) – Which directory to store the temporary files if low_ram is enabled.
kwargs – other keyword arguments to be passed to pandas.read_csv() method

close()¶

Close connection to and HDF5 file if low_ram is specified

Returns:	None

copy()¶

Returns:	a deepcopy of the existing opject

describe(columns=None, per_chromosome=False)¶

Get a summary of the data.

Parameters:	columns (list.) – List of column names to print summary for (default: [‘b’, ‘se’, ‘p’]) per_chromosome (bool.) – Enable to return a list of summary dataframes per chromosome
Returns:	pd.Dataframe, or list

groupby(*args, **kwargs)¶

Compatibility function to create pandas grouped object

Parameters:	args – arguments to be passed to pandas groupby function kwargs – keyword arguments to be passed to pandas groupby function
Returns:	a full grouped pandas dataframe object

head(n=10, n_chromosomes=1, **kwargs)¶

Prints (n_chromosomes) dataframes with the first n rows.

Parameters:	n (int) – number of rows to show n_chromosomes (int) – number of chromosomes to show. kwargs – keyword arguments to be passed to pandas head function
Returns:	None

manhattan(**kwargs)¶

Generate a manhattan plot using this sumstats data

Parameters:	kwargs – keyworded arguments to be passed to `pysumstats.plot.manhattan()`
Returns:	None, or (fig, ax)

merge(other, how='inner', low_memory=False)¶

Merge with other SumStats object(s).

Parameters:	other (`pysumstats.plot.SumStats` or list) – Other sumstats object, or list of other SumStats objects. how (str) – Type of merge. low_memory (bool) – Enable to use a more RAM-efficient merging method (WARNING: still untested)
Returns:	`pysumstats.plot.MergedSumStats` object

plot_all(dest='.', prefix='SumStatsPlots', kwargs={})¶

Runs all attached plot functions

Parameters:

dest (str) – Folder to save resulting files to. File names will be: {prefix}_{plottype}_{YEAR-MONTH-DAY}.png
prefix (str) – prefix to use when saving files.
kwargs (dict) – Nested dictionary of other keyword arguments to be passed to each function (keys of top-level dictionary should be function names). Use the ‘all’ key the top level dictionary to pass keyword argument to every function.

Returns:

None

pzplot(**kwargs)¶

Generate a PZ-plot using this sumstats data

Parameters:	kwargs – keyworded arguments to be passed to `pysumstats.plot.pzplot()`
Returns:	None, or (fig, ax)

qc(maf=None, hwe=None, info=None, **kwargs)¶

Basic GWAS quality control function.

Parameters:

maf (float or None) – Minor allele frequency cutoff, will drop SNPs where MAF < cutoff. Default: 0.1
hwe (float or None) – Hardy-Weinberg Equilibrium cutoff, will drop SNPs where HWE < cutoff, if specified and HWE column is present in the data.
info (float or None) – Imputation quality cutoff, will drop SNPs where Info < cutoff, if specified and Info column is present in the data.
kwargs – Other columns to filter on, keyword should be column name, SNPs whill be dropped where the value < argument.

Returns:

None

qqplot(**kwargs)¶

Generate a QQ-plot using this sumstats data

Parameters:	kwargs – keyworded arguments to be passed to `pysumstats.plot.qqplot()`
Returns:	None, or (fig, ax)

reset_index()¶

Reset the index of the data.

Returns:	None

save(path, per_chromosome=False, per_phenotype=False, phenotype=None, **kwargs)¶

Save the data held in this object to local storage.

Parameters:

path (str) – Relative or full path to the target file to store the data or object in. Paths ending in .pickle will save a pickled version of the full object. Note that with low_ram enabled this will not store the data. When per_phenotype is specified, add {} to the path where the phenotype name should be, if {} is not in the string, the filename will be prefixed with phenotype name.
per_chromosome (bool) – Whether to save seperate files for each chromosome.
per_phenotype – Set to True to create a separate file for each phenotype in MergedSumStats objects

:type per_phenotype :param phenotype: Only save a file for a specifici phenotype in MergedSumstats objects :type phenotype: str :param kwargs: keyword arguments to be passed to pandas to_csv() function. :return: None

sort_values(by, inplace=True, **kwargs)¶

Sorts values in the dataframe. Note: Sorting by chromosme (chr) will have no effect as data is already structured by chromosome.

Parameters:	by (str) – label of the column to sort values by inplace (bool) – Whether to return the sorted object or sort values within existing object. (Currently only inplace sorting is supported) kwargs – Other keyword arguments to be passed to pandas sort_values function
Returns:	Non

tail(n=10, n_chromosomes=1, **kwargs)¶

Prints (n_chromosomes) dataframes with the last n rows.

Parameters:	n (int) – number of rows to show n_chromosomes (int) – number of chromosomes to show. kwargs – keyword arguments to be passed to pandas tail function
Returns:	None