Klab Network Generation Utility


A copy of the KNet coherent network generation utility can be obtained here for Windows, Macintosh and Linux (Intel with GTK).

Overview: This utility finds pairs of genes with similar behavior across a gene expression dataset. It is intended to be used as a comparative tool to compare the network size (number of output rows) between N datasets with the same numbers of biological samples (input columns) and measured genes (input rows).

Instructions




(A) Set the lower limit to the coefficient of variation. A higher value here will reduce the stringency with which a pair of genes will have to correspond in their behavior in order to be added to the list of covariant genes. This value can be adjusted to create a manageable network size given the number of samples N in the dataset. As N increases this value will often need to be increased.

(B) Set the minimum signal for a gene's expression to be considered. This value will change with array platform or sequencing technology and should be considerably above the background. The aim here is prevent genes that are simply sitting at background from being included in the network.

(C) Decide whether or not to filter outliers. In any largish dataset (12 samples or more) there may be a spurious signal readings in one or more sample columns. This can arise from many sources but will often lead to the gene not pairing with other similarly variable genes because of the spurious reading. In order to somewhat mitigate the effect of noise, checking this box will remove the highest and lowest expression ratios from the dataset prior to calculating the similarity measure. This setting should not be checked for small datasets with fewer than around six samples due to the likely increase in false positives but is useful for larger datasets.

(D) Select this to estimate the size of the stable background network. This consists of the genes that are simply stably expressed and is estimated by randomly re-assigning signal between samples. Hence primarily the genes that are simply co-stable at the threshold limits selected will be returned. This can be useful in partitioning the network into co-stable genes and co-variant genes.

(E) Click the begin button to start the analysis. The input is expected to be a folder of one or more expression files (see file format below). The output should be an empty folder into which one result file per input file will be placed. The result file will list the pairs of genes that could be networked together as well as additional data on their similarity. The output folder should always be different from the input folder.



File Formats



(A) Input Format. The file format is essentially a tab delimitated expression table (matrix) of genes (rows) and samples (columns). The first column should contain a gene identifier that can helpfully be unique (consider concatenating an array probe ID with a gene name here to create a human readable unique ID eg PRM1_934673_s_at). No header row is expected. See an example input file here. There is currently a limit of 50 sample columns and just over 80,000 gene rows in the input. Please use the contact information below if you find this overly restrictive.

(B) Output Format. For each input file a corresponding tab delimitated output file will be generated consisting of the input file name prepended by "fold_analysis". The first two columns in the output file contain the IDs of the genes that generate a similar behavior. Column three details their average ratio and column four a measure of variance. The remaining columns detail the expression ratios for each sample in the dataset. For the input example above the output folder here should be generated with settings (A) .09 (B) 300 (C) checked (D) unchecked. Analysis is not multi-threaded so the time is a little long and should be around 30 minutes on a regular 2GHz desktop - to parallelize this simply split the input folder into N sub-folders where N is your number of real/virtual cores and run N instances of the application together.