The DNA Sequencing Core Data Systems

The DNA Sequencing Core uses a mixture of Unix workstations and Windows systems to automate its data handling. Analysis machines for ABI's automated DNA Sequencers are on Windows 2000 or XP. The Unix workstations (running Red Hat Linux) provide efficient database service and strong networking capabilities.

Last Modified 23-Jul-07, RHL

Overview:

Client Sample Submission

Clients enter their sample into the waiting queue by filling out forms on our web site. Our database contains complete information about the samples, as well as the client, their email address, billing information, etc. Samples are assigned sample numbers for tracking purposes, and the client writes those on the tubes before delivering them to the Sample Clerk. The web pages accommodate the submission of a wide variety of sample types and sample set sizes, from individual microfuge tubes to hundreds of plates.

Core Technicians

Core technicians can view the sample queue and can choose from it the samples to be included on their next gel. In preparation for a run, information about the selected samples is downloaded from the database computer directly into the data collection computers attached to each sequencer; no manual typing of "sample sheet" information is necessary.

Proprietary ABI software reads this run information, then collects data from the gels and analyzes the samples. At the end of analysis, the ABI programs generate two data files for each sample analyzed: an ASCII file containing the sequence, and a proprietary ABI file with the chromatographic trace data. These files are manually examined by the Core's sequencing technicians to verify that the run proceeded normally, and to write helpful comments to our clients. The files are then moved back into the database computer for subsequent distribution.

Data Distribution

The Unix database computers automatically handle data distribution. The flat ASCII files (".seq" files) are automatically emailed back to the client. Chromatogram files are automatically transferred into designated FTP directories, where they are accessible to the client or anyone who has the client's password. Note that this FTP server has been customized so that the passwords it uses are not unix "shell" passwords, which would be a dangerous security risk.

The same data distribution scheme can be used to maintain separate data areas where further processing can occur. Possible processing might include assembly, BLAST, multiple alignment and phylogenetic analysis.

Quality Assessment

The Unix system is also used to assess the quality of the data produced. The program 'phred' is run on all lanes coming out of the DNA Sequencing Core, and locally-written software extracts statistics on each run, such as the signal strength, phred-20 read length and the average phred Q value for a run. The Core staff can view summaries of the quality of data as a function of date, instrument, technician, client, DNA type etc. At the same time, automated checks are performed to identify instances of capillary cross-talk, errors of plate rotation/swapping, and to identify any standards that failed or that generated anomalous sequences.

Automated Billing

Billing records are generated for each sample upon completion, and at the end of each month, the system collates those records and generates billing invoices to be sent to the University's accounting offices. Billing is extremely flexible; the system can grant discounts (or surcharges) in any amount, can track how much discount is accorded a person or a group of people, and can cut off the discounts when a cap is reached (caps can be on individual discounts or on an entire group's discount).

Administrative Support

Core administrators have numerous support tools that allow fast, efficient monitoring of the facility's productivity and data quality. Quality Control reports can be generated for specific samples, specific clients, specific technicians, specific dates or specific sequencers. Reports can also be generated on billing records, discounts awarded, departments, etc. Customized data queries of considerable complexity and power further extend the capabilities of the system.

Client Support

Principal Investigators have control over who submits samples (i.e. who spends their money), and can obtain informational reports on who in their lab is doing the sequencing and how much they are spending. PIs can enter or delete account numbers, and can grant narrow spending permission on specific accounts to specific lab members. That way, different grants are accessible only to the lab members who legitimately work on that grant, and the funds are used only for work appropriate to that grant.

Departmental Admin Support

Departmental administrators can get access to their PI's sequencing records as well, in case they need to audit or block expenditures. Each PI 'belongs' to a department, and each department has one password that grants access to billing records. Departmental Admins with the password can even disable account numbers that are being mis-used by the PI.

Description of components:

Dell Compute Cluster running Red Hat Linux

The primary data system in the DNA Sequencing Core is a 10 processor system (each being dual-core 3 MHz Xeon processors) running as a high-performance computer cluster under the Red Hat Enterprise ES V3 operating system. This machine runs a Sybase database, web server, FTP server, cross-platform file server, phred/phrap/consed and the GelEdit/GelDone suite of programs.

Sybase database system

The core of the sequencing lab operation is a Sybase relational database server. This is a commercial database package used to store information on samples, gels, clients, departments, and Core services. Mechanisms in the database engine help preserve data integrity and provide ways to recover from errors with a minimum of lost data. Data are entered into the database primarily via various web CGI scripts written in Perl, Tcl/Tk, UNIX shellscripts and C.

Web Server components

The Core's World Wide Web server is an Apache server. This allows Core technicians or clients to access our facility from virtually any desktop computer, regardless of make or model.

Principal Investigators are responsible for managing who can spend their money, and what accounts may be used for sequencing. For each PI, the data system keeps records on the PI's themselves (name, address and phone number, departmental affiliation), accounts to which sequencing can be charged, laboratory members who can request sequencing, and any Center affiliations that enable them to get discounts. Permission to spend money from different accounts can be individually granted to different lab members, so that different projects (accounts, personnel) are easily kept separate. Summaries of billing records are available to the PI as well, including itemized descriptions of all samples sequenced by that lab since the inception of the data system.

The PI's tasks (adding/deleting lab members or accounts) require a password. This is stored in the database along with the PI's other personal information. The PI usually chooses to give the password to members of the lab who then can manage the lab's Seq Core records, or access the Core's FTP server to obtain chromatogram files.

Lab Members submit samples via the web site, at which time each sample is assigned a tracking number that is written directly on the sample tube. Also using the web site, clients can check on the status of samples previously submitted, can edit or delete samples they submitted (as long as processing is not yet underway) and can access data from analyzed samples (see FTP server, below).

Core Technicians also use web-based interfaces for some chores. The relevent pages are password-protected to prevent unauthorized access. A large variety of tasks are support, including database queries for samples, PI's, lab users, gels, lanes and billing records. Specific query types for these data run the gamut of possibilities, and are constantly being updated to support the tasks most frequently needed for daily operations. Note, however, that the main gel editing application is separate (see 'GelEdit', below).

Core Director has access to all the functions used by the technicians, of course, but also has some global control tools to regulate the operation of the Core. Master switches are available on the web to enable or disable sample submission, data downloads, external (non-UM) sample submission, all discounts, 'RUSH' requests etc. The Director can quickly provide clients with status information for the Core, such as turnaround time and information notices. Finally, various parameters can be set that 'tune' such parameters as the rate at which external (non-UM) samples are processed.

Departmental Administrators need access to the Sequencing Core database as well. They are responsible for auditing the Statements Of Account (SOA's) for each account, and must be able to access the billing records in order to confirm their validity. Consequently, Admins also have passwords into the database system, and with these are able to get billing histories on accounts under their control. They are not allowed to see billing information for accounts that have not been assigned to them.

GelEdit

Core technicians create gel runs through the program GelEdit. This is an X-windows-based program written in Tcl/Tk and WiSH, and allows the technicians to view the samples in the queue and to select the ones to be included on their next gel run. Each gel is given a unique Gel number, and samples selected to be run on that gel are assigned a specific lane number. After selecting the sequencer onto which the samples will be loaded, GelEdit can automatically generate a data file with information for samples to be included on the next gel, in a format appropriate for the target sequencer. A cross-platform file server is used to transfer this file from the Unix computer where it is generated, to the dedicated data collection computer where the file is to be used (Macintosh or Windows NT, depending on the sequencer).

GelDone

GelDone is the program that returns the results to investigators after analysis is complete. The is a web-based system that provides for streamlined tagging of reaults as good/bad, addition of optional comments for any individual lanes, and management of repeat requests. Upon completion, the program will then determine who owns each file and distributes the files as appropriate: the flat ASCII files are emailed to the owner, along with any comments specific to that lane. Chromatogram files are moved into the Core's custom FTP server into password- protected directories for each PI.

Cross-platform File Server

Data are transferred between Unix, Windows and Mac computers by using an appliance server system, currently a SNAPserver (http://www.snapserver.com. This allows the technicians to transfer Sample Sheet files generated on the Unix system into the dedicated data collection computers on the sequencers (Windows), and conversely allows the output data files from the data collection systems to be transferred back to the server for subsequent distribution via GelDone (described above).

FTP server

An FTP server allows users to download chromatogram files generated by the sequencers, and placed on the Core's FTP server during 'GelDone' phase. Access to the FTP server requires the lab password. The FTP daemon allows only read access, and does not recognize shell passwords, for enhanced system security. Data are kept on the FTP server for two weeks, after which a crontab process removes it.

Miscelleneous scripts

A very complete suite of administrative tasks are handled via the Administrative web pages. The most important of these perform the following functions:

  • Database Records: Core employees can view and/or update records on clients, samples, departments, discounts, 'gels' (sample sets), sequencers, and outcomes. These views are often available as a point-and-click link in other reports described below.

  • Center discounts: Discounts can be granted to PI's according to their affiliation with various Research Centers (e.g. P30, P60 grants). Such discounts are often "capped", in total amount offered to all center members and/or in the amount each center member may use. Web-based routines can confer center discounts on various PI's or accounts. Additional scripts written in Tcl are used to track the use of these discounts, to enable or disable them, and to generate reports on their usage.

  • Itemized sample listings: When vouchers are generated each month, a script is run that prints out itemized listings for each investigator, summarizing the samples their lab submitted for sequencing.

  • Quality Assessment: Phil Green's program 'phred' (www.phrap.com) is run on any newly-generated data. Subsequently, the programs 'pscan' and 'pstat' (R. Lyons) summarize the phred-20 read length and average phred Q score for each run, and save these statistics in our database. Web-based and script-based routines then slice through the results, giving summary quality information suitable for assessing performance of instruments, people and reagents.

     

Example Screen Shots:

What we can see about each sample:


A view of a 'gel' (sample set) that's complete, with quality statistics:


A quality control screen showing success/failure for 40 'gels':


The sample sample set, viewed as %-success for each plate position (A01, A02 etc):


Part of the Master Control screen:



Questions or comments may be addressed to Dr. Robert Lyons, the Core Director.

Back to Sequencing Core Home Page