The DNA Sequencing Core Data Systems
The DNA Sequencing Core uses a mixture
of Unix workstations and Windows systems to automate its data handling. Analysis machines
for ABI's automated DNA Sequencers are on Windows 2000 or XP. The Unix workstations (running
Red Hat Linux) provide efficient database service and strong networking capabilities.
Last Modified 23-Jul-07, RHL
- Client Sample Submission
Clients enter their sample into the waiting queue by filling out forms on our web site.
Our database contains complete information about the samples, as well as
the client, their email address, billing information, etc. Samples are assigned sample
numbers for tracking purposes, and the client writes those on the tubes before delivering
them to the Sample Clerk. The web pages accommodate the submission of a wide variety
of sample types and sample set sizes, from individual microfuge tubes to hundreds of plates.
- Core Technicians
Core technicians can view the sample queue and can choose from it the samples to be
included on their next gel. In preparation for a run, information about the selected
samples is downloaded from the database computer directly into the data collection
computers attached to each sequencer; no manual typing of "sample sheet" information is
Proprietary ABI software reads this run information, then collects data from the gels and
analyzes the samples. At the end of analysis, the ABI programs generate two data files
for each sample analyzed: an ASCII file containing the sequence, and a proprietary ABI
file with the chromatographic trace data. These files are manually examined by the Core's
sequencing technicians to verify that the run proceeded normally, and to write helpful
comments to our clients. The files are then moved back into the database computer for
- Data Distribution
The Unix database computers automatically handle data distribution. The flat ASCII files
(".seq" files) are automatically emailed back to the client. Chromatogram files are
automatically transferred into designated FTP directories, where they are accessible to
the client or anyone who has the client's password. Note that this FTP server has been
customized so that the passwords it uses are not unix "shell" passwords, which would be
a dangerous security risk.
The same data distribution scheme can be used to maintain separate data areas
where further processing can occur. Possible processing might include assembly, BLAST,
multiple alignment and phylogenetic analysis.
- Quality Assessment
The Unix system is also used to assess the quality of the data produced. The program 'phred'
is run on all lanes coming out of the DNA Sequencing Core, and locally-written software
extracts statistics on each run, such as the signal strength, phred-20 read length and the average phred
Q value for a run. The Core staff can view summaries of the quality of data as a
function of date, instrument, technician, client, DNA type etc. At the same time,
automated checks are performed to identify instances of capillary cross-talk, errors
of plate rotation/swapping, and to identify any standards that failed or that generated
- Automated Billing
Billing records are generated for each sample upon completion, and at the end of each
month, the system collates those records and generates billing invoices to be sent to
the University's accounting offices. Billing is extremely flexible; the system can
grant discounts (or surcharges) in any amount, can track how much discount is accorded
a person or a group of people, and can cut off the discounts when a cap is reached (caps
can be on individual discounts or on an entire group's discount).
- Administrative Support
Core administrators have numerous support tools that allow fast, efficient monitoring
of the facility's productivity and data quality. Quality Control reports can be generated
for specific samples, specific clients, specific technicians, specific dates or specific
sequencers. Reports can also be generated on billing records, discounts awarded, departments,
etc. Customized data queries of considerable complexity and power further extend the
capabilities of the system.
- Client Support
Principal Investigators have control over who submits samples (i.e. who spends their money),
and can obtain informational reports on who in their lab is doing the sequencing and how much
they are spending. PIs can enter or delete account numbers, and can grant narrow spending
permission on specific accounts to specific lab members. That way, different grants are
accessible only to the lab members who legitimately work on that grant, and the funds are
used only for work appropriate to that grant.
- Departmental Admin Support
Departmental administrators can get access to their PI's sequencing records as well,
in case they need to audit or block expenditures. Each PI 'belongs' to a department, and
each department has one password that grants access to billing records. Departmental Admins
with the password can even disable account numbers that are being mis-used by the PI.
Description of components:
Dell Compute Cluster running Red Hat Linux
The primary data system in the DNA Sequencing Core is a 10 processor system (each
being dual-core 3 MHz Xeon processors) running as a high-performance computer cluster
under the Red Hat Enterprise ES V3 operating system. This machine runs a Sybase
database, web server, FTP server, cross-platform file server, phred/phrap/consed
and the GelEdit/GelDone suite of programs.
Sybase database system
The core of the sequencing lab operation is a Sybase relational database server.
This is a commercial database package used to store information on samples, gels, clients,
departments, and Core services. Mechanisms in the database engine help preserve data integrity
and provide ways to recover from errors with a minimum of lost data. Data are entered into the
database primarily via various web CGI scripts written in Perl, Tcl/Tk, UNIX shellscripts and C.
Web Server components
The Core's World Wide Web server is an Apache server.
This allows Core technicians or clients to access our facility from virtually any desktop
computer, regardless of make or model.
Principal Investigators are responsible for managing who can spend
their money, and what accounts may be used for sequencing. For each PI, the data
system keeps records on the PI's themselves (name, address and phone number,
departmental affiliation), accounts to which sequencing can be charged, laboratory
members who can request sequencing, and any Center affiliations that enable them to
get discounts. Permission to spend money from different accounts can be individually
granted to different lab members, so that different projects (accounts, personnel)
are easily kept separate. Summaries of billing records are available to the PI as well,
including itemized descriptions of all samples sequenced by that lab since the
inception of the data system.
The PI's tasks (adding/deleting lab members or accounts) require a password. This
is stored in the database along with the PI's other personal information. The PI
usually chooses to give the password to members of the lab who then can manage
the lab's Seq Core records, or access the Core's FTP server to obtain chromatogram
Lab Members submit samples via the web site, at which time each
sample is assigned a tracking number that is written directly on the sample tube.
Also using the web site, clients can check on the status of samples previously
submitted, can edit or delete samples they submitted (as long as processing is not
yet underway) and can access data from analyzed samples (see FTP server, below).
Core Technicians also use web-based interfaces for some chores.
The relevent pages are password-protected to prevent unauthorized access. A large
variety of tasks are support, including database queries for samples, PI's, lab users,
gels, lanes and billing records. Specific query types for these data run the gamut
of possibilities, and are constantly being updated to support the tasks most
frequently needed for daily operations. Note, however, that the main gel editing
application is separate (see 'GelEdit', below).
Core Director has access to all the functions used by the technicians,
of course, but also has some global control tools to regulate the operation of the Core.
Master switches are available on the web to enable or disable sample submission, data
downloads, external (non-UM) sample submission, all discounts, 'RUSH' requests etc.
The Director can quickly provide clients with status information for the Core, such
as turnaround time and information notices. Finally, various parameters can be set
that 'tune' such parameters as the rate at which external (non-UM) samples are processed.
Departmental Administrators need access to the Sequencing Core database
as well. They are responsible for auditing the Statements Of Account (SOA's) for each
account, and must be able to access the billing records in order to confirm their
validity. Consequently, Admins also have passwords into the database system, and with
these are able to get billing histories on accounts under their control. They are not
allowed to see billing information for accounts that have not been assigned to them.
Core technicians create gel runs through the program GelEdit. This is an
X-windows-based program written in Tcl/Tk and WiSH, and allows the technicians to view
the samples in the queue and to select the ones to be included on their next gel run.
Each gel is given a unique Gel number, and samples selected to be run on that gel are
assigned a specific lane number. After selecting the sequencer onto which the samples
will be loaded, GelEdit can automatically generate a data file with
information for samples to be included on the next gel, in a format appropriate for the
target sequencer. A cross-platform file server is used to transfer this file from the
Unix computer where it is generated, to the dedicated data collection computer where
the file is to be used (Macintosh or Windows NT, depending on the sequencer).
GelDone is the program that returns the results to
investigators after analysis is complete. The is a web-based system that provides for
streamlined tagging of reaults as good/bad, addition of optional comments
for any individual lanes, and management of repeat requests. Upon completion, the
program will then determine who owns each file and distributes the files as appropriate:
the flat ASCII files are emailed to the owner, along with any comments specific to
that lane. Chromatogram files are moved into the Core's custom FTP server into password-
protected directories for each PI.
Cross-platform File Server
Data are transferred between Unix, Windows and Mac computers by using an appliance
server system, currently a SNAPserver
(http://www.snapserver.com. This allows the
technicians to transfer Sample Sheet files generated on the Unix system into the
dedicated data collection computers on the sequencers (Windows), and conversely allows
the output data files from the data collection systems to be transferred back to the
server for subsequent distribution via GelDone (described above).
An FTP server allows users to download chromatogram files generated by the
sequencers, and placed on the Core's FTP server during 'GelDone' phase.
Access to the FTP server requires the lab password. The FTP daemon allows only
read access, and does not recognize shell passwords, for enhanced system security.
Data are kept on the FTP server for two weeks, after which a crontab process removes
A very complete suite of administrative tasks are handled via the Administrative
web pages. The most important of these perform the following functions:
Core employees can view and/or update records on clients, samples, departments, discounts,
'gels' (sample sets), sequencers, and outcomes. These views are often available as a
point-and-click link in other reports described below.
Discounts can be granted to PI's according
to their affiliation with various Research Centers (e.g. P30, P60 grants). Such discounts
are often "capped", in total amount offered to all center members and/or in the amount
each center member may use. Web-based routines can confer center discounts on various
PI's or accounts. Additional scripts written in Tcl are used to track the use of these
discounts, to enable or disable them, and to generate reports on their usage.
Itemized sample listings:
When vouchers are generated each month, a script is run that prints out itemized listings
for each investigator, summarizing the samples their lab submitted for sequencing.
Phil Green's program 'phred' (www.phrap.com) is run on any newly-generated data.
Subsequently, the programs 'pscan' and 'pstat' (R. Lyons) summarize the phred-20 read length
and average phred Q score for each run, and save these statistics in our database. Web-based
and script-based routines then slice through the results, giving summary quality information
suitable for assessing performance of instruments, people and reagents.