SumStats

class pysumstats.SumStats(path, phenotype=None, gwas_n=None, column_names=None, data=None, low_ram=False, tmpdir='sumstats_temporary', **kwargs)

Class for summary statistics of a single GWAS.

Parameters:
  • path (str) – Path to the file containing summary statistics. Should be a csv, or tab-delimited txt-file (.gz supported).
  • phenotype (str) – Phenotype name
  • gwas_n (int) – Optional N subjects in the GWAS, for N-based meta analysis (if there is no N column in the summary statistics)
  • column_names (dict) – Optional dictionary of column names, if these are not automatically recognised. Keys should be: [‘rsid’, ‘chr’, ‘bp’, ‘ea’, ‘oa’, ‘maf’, ‘b’, ‘se’, ‘p’, ‘hwe’, ‘info’, ‘n’, ‘eaf’, ‘oaf’]
  • data (dict) – Dataset for the new SumStats object, in general, don’t specify this.
  • low_ram (bool) – Whether to use the low_ram option for this SumStats object. Use this only when running into MemoryErrors. Enabling this option will read/write data from local storage rather then RAM. It will save lots of RAM, but it will significantly decrease processing speed.
  • tmpdir (str) – Which directory to store the temporary files if low_ram is enabled.
  • kwargs – other keyword arguments to be passed to pandas.read_csv() method
close()

Close connection to and HDF5 file if low_ram is specified

Returns:None
copy()
Returns:a deepcopy of the existing opject
describe(columns=None, per_chromosome=False)

Get a summary of the data.

Parameters:
  • columns (list.) – List of column names to print summary for (default: [‘b’, ‘se’, ‘p’])
  • per_chromosome (bool.) – Enable to return a list of summary dataframes per chromosome
Returns:

pd.Dataframe, or list

groupby(*args, **kwargs)

Compatibility function to create pandas grouped object

Parameters:
  • args – arguments to be passed to pandas groupby function
  • kwargs – keyword arguments to be passed to pandas groupby function
Returns:

a full grouped pandas dataframe object

head(n=10, n_chromosomes=1, **kwargs)

Prints (n_chromosomes) dataframes with the first n rows.

Parameters:
  • n (int) – number of rows to show
  • n_chromosomes (int) – number of chromosomes to show.
  • kwargs – keyword arguments to be passed to pandas head function
Returns:

None

manhattan(**kwargs)

Generate a manhattan plot using this sumstats data

Parameters:kwargs – keyworded arguments to be passed to pysumstats.plot.manhattan()
Returns:None, or (fig, ax)
merge(other, how='inner', low_memory=False)

Merge with other SumStats object(s).

Parameters:
  • other (pysumstats.plot.SumStats or list) – Other sumstats object, or list of other SumStats objects.
  • how (str) – Type of merge.
  • low_memory (bool) – Enable to use a more RAM-efficient merging method (WARNING: still untested)
Returns:

pysumstats.plot.MergedSumStats object

plot_all(dest='.', prefix='SumStatsPlots', kwargs={})

Runs all attached plot functions

Parameters:
  • dest (str) – Folder to save resulting files to. File names will be: {prefix}_{plottype}_{YEAR-MONTH-DAY}.png
  • prefix (str) – prefix to use when saving files.
  • kwargs (dict) – Nested dictionary of other keyword arguments to be passed to each function (keys of top-level dictionary should be function names). Use the ‘all’ key the top level dictionary to pass keyword argument to every function.
Returns:

None

pzplot(**kwargs)

Generate a PZ-plot using this sumstats data

Parameters:kwargs – keyworded arguments to be passed to pysumstats.plot.pzplot()
Returns:None, or (fig, ax)
qc(maf=None, hwe=None, info=None, **kwargs)

Basic GWAS quality control function.

Parameters:
  • maf (float or None) – Minor allele frequency cutoff, will drop SNPs where MAF < cutoff. Default: 0.1
  • hwe (float or None) – Hardy-Weinberg Equilibrium cutoff, will drop SNPs where HWE < cutoff, if specified and HWE column is present in the data.
  • info (float or None) – Imputation quality cutoff, will drop SNPs where Info < cutoff, if specified and Info column is present in the data.
  • kwargs – Other columns to filter on, keyword should be column name, SNPs whill be dropped where the value < argument.
Returns:

None

qqplot(**kwargs)

Generate a QQ-plot using this sumstats data

Parameters:kwargs – keyworded arguments to be passed to pysumstats.plot.qqplot()
Returns:None, or (fig, ax)
reset_index()

Reset the index of the data.

Returns:None
save(path, per_chromosome=False, per_phenotype=False, phenotype=None, **kwargs)

Save the data held in this object to local storage.

Parameters:
  • path (str) – Relative or full path to the target file to store the data or object in. Paths ending in .pickle will save a pickled version of the full object. Note that with low_ram enabled this will not store the data. When per_phenotype is specified, add {} to the path where the phenotype name should be, if {} is not in the string, the filename will be prefixed with phenotype name.
  • per_chromosome (bool) – Whether to save seperate files for each chromosome.
  • per_phenotype – Set to True to create a separate file for each phenotype in MergedSumStats objects

:type per_phenotype :param phenotype: Only save a file for a specifici phenotype in MergedSumstats objects :type phenotype: str :param kwargs: keyword arguments to be passed to pandas to_csv() function. :return: None

sort_values(by, inplace=True, **kwargs)

Sorts values in the dataframe. Note: Sorting by chromosme (chr) will have no effect as data is already structured by chromosome.

Parameters:
  • by (str) – label of the column to sort values by
  • inplace (bool) – Whether to return the sorted object or sort values within existing object. (Currently only inplace sorting is supported)
  • kwargs – Other keyword arguments to be passed to pandas sort_values function
Returns:

Non

tail(n=10, n_chromosomes=1, **kwargs)

Prints (n_chromosomes) dataframes with the last n rows.

Parameters:
  • n (int) – number of rows to show
  • n_chromosomes (int) – number of chromosomes to show.
  • kwargs – keyword arguments to be passed to pandas tail function
Returns:

None