Skip to content

Latest commit

 

History

History
 
 

src

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
This file describes the most commonly used modules,
and the code conventions used throughout the code base in
and under this directory.

SOURCE TREE ORGANIZATION

The major source subdirectories of this source code are:
o - lib - General purpose library routines, some with a biological bent,
    many just generally useful for computing.  
o - inc - Interfaces to the library modules.
o - utils - Command line utility programs. Like the library a mix of
    bioinformatically motivated, and general purpose.  
o - hg - Stuff developed for the Human Genome Project and it's successors.
    Much of the code in this directory requires MySQL.
o - hg/lib - Human Genome Project specific libraries.
o - hg/inc - Interfaces to the same libraries
o - hg/hgTracks - The part of the UCSC Genome Browser that displays 
    annotation tracks graphically.
o - hg/hgc - The part of the Genome Browser that responds to a click
    on an item in a track.
o - hg/hgTrackUi - The part of the Genome Browser that allows users to configure
    a particular track.
o - hg/hgTables - The UCSC Table Browser
o - jkOwnLib - Libraries that support blat, isPcr, gfClient, gfServer. 
In general each program, either command line, or web CGI based, has its source in
a different subdirectory.  For simple programs, like what is in utils, these often
just have a single C module that is linked with the libraries.  For more complex
programs, such as the hgTracks CGI, there may be multiple C source modules in the dir.

COMMONLY USED LIBRARY MODULES

o - common  - String handling, singly-linked list handling. 
    Other basic stuff every other module uses.
o - hash - Simple but effective hash table routines.
o - linefile - Line oriented file input, on some systems
    much faster than fgets().
o - dystring - Dynamically sized strings in C.
o - cheapcgi - Parses out cgi variables for scripts called
    from web pages.
o - htmshell - Helps generate HTML output for scripts that
    are called from web pages or just want to make web
    pages.
o - htmlPage - Read html pages, programatically submit html forms.
o - memgfx - Creates a 256 color image in memory which
    can be drawn on, then saved as a .GIF file which
    can be encorperated into a web page.
o - dnautils and dnaseq - Simple utilities on DNA.
o - fa - Read/write fasta format files.
o - basicBed - Functions for working with BED format files.
o - psl - Functions for working with PSL (blat) format files.
o - twoBit - Functions for working with twoBit DNA files.
o - bPlusTree - Create/user B+ Tree indexes, the backbone of 
    many databases.
o - udc - URL Data Cache - code to locally cache remote files.

CODE CONVENTIONS

INDENTATION AND SPACING:

The code follows an indentation convention that is a bit
unusual for C.  Opening and closing braces are on
a line by themselves and are indented at the same
level as the block they enclose:
    if (someTest)
	{
	doSomething();
	doSomethingElse();
	}
Each block of code is indented by 4 from the previous block.
As per Unix standard practice, tab stops are set to 8, not 4
as is the common practice in Windows, so some care must be
taken when using tabs for indenting.  

Tabs continue to be a problem for the programmer even in 2018.
Currently our makefiles require tabs, while our python code forbids
them. The C code can go either way so long as tabs are treated
as advancing to the next multiple-of-eight column. Please consult local
users of your favorite editor for help configuring it with these
indentation and tab standards.

Lines should be no more than 100 characters wide.  Lines that are 
longer than this are broken and indented at least 8 spaces
more than the original line to indicate the line continuation.
Where possible simplifying techniques should be applied to the code 
in preference to using line continuations, since line continuations
obscure the logic conveyed in the indentation of the program.

Line continuations may be unavoidable when calling functions with long
parameter lists.  In most other situations lines can be shortened 
in better ways than line continuations.  Complex expressions can be 
broken into parts that are assigned to intermediate variables.  Long 
variable names can be revisited and sometimes shortened. Deep indenting 
can be avoided by simplifying logic and by moving blocks into their own 
functions. These are just some ways of avoiding long lines.

NAMES

Symbol names generally begin with a lower-case letter.  The second 
and subsequent words in a name begin with a capital letter 
to help visually separate the words.  Abbreviation of words 
is strongly discouraged.  Words of five letters and less should
generally not be abbreviated. If a word is abbreviated in 
general it is abbreviated to the first three letters:
   tabSeparatedFile -> tabSepFile
In some cases, for local variables abbreviating
to a single letter for each word is ok:
   tabSeparatedFile -> tsf
In complex cases you may treat the abbreviation itself as a word, and 
only the first letter is capitalized.
   genscanTabSeparatedFile -> genscanTsf
Numbers are considered words.  You would
represent "chromosome 22 annotations"
as "chromosome22Annotations" or "chr22Ann."
Note the capitalized 'A" after the 22.  Since both numbers and
single letter words (or abbreviations) disrupt the visual flow
of the word separation by capitalization, it is better to avoid
these except at the end of the name.

These naming rules apply to variables, constants, functions, fields,
and structures.  They generally are used for file names, database tables,
database columns, and C macros as well, though there is a bit less
consistency there in the existing code base.

Variables that are global should begin with the small letter "g."  This
is a relatively recent convention, and is not so widely used, but when
maintaining code or writing new code it would be good to adopt it.
Global variable set from the command line instead should begin with
the letters "cl."  This is actually a somewhat older but still not 
universal convention.

ERROR HANDLING AND MEMORY ALLOCATION

Another convention is that errors are reported
at a fairly low level, and the programs simply
print an error message and abort using errAbort.  If 
you need to catch errors underneath you see the
file errAbort.h and install an "abort handler".

Memory is generally allocated through "needMem"
(which aborts on failure to allocate) and the
macros "AllocVar" and "AllocArray".  This 
memory is initially set to zero, and the programs
very much depend on this fact.

COMMENTING 

Every module should have a comment at the start of
a file that explains concisely what the module
does.  Explanations of algorithms also belong
at the top of the file in most cases. Comments can
be of the /*  */ or the // form.  Structures should be 
commented following the pattern of this example:

struct dyString
/* Dynamically resizable string that you can do formatted
 * output to. */
    {
    struct dyString *next;      /* Next in list. */
    char *string;               /* Current buffer. */
    int bufSize;                /* Size of buffer. */
    int stringSize;             /* Size of string. */
    };

That is, there is a comment describing the overall purpose
of the object between the struct name, and the opening brace,
and there is a short comment by each field.  In many cases
these may not say much more than well-chosen field names,
but that's ok. 

Almost any structure with more than three or four
elements includes a "next" pointer as its first
member, so that it can be part of a singly-linked
list.  There's a whole set of routines (see
common.c and common.h) which work on singly-linked
lists where the next field comes first. Their
names all start with "sl."

Functions which work on a structure by convention begin with
the name of the structure, simulating an object-oriented
coding style.  In general these functions are all grouped
in a file, in this case in dyString.c.  Static functions in
this file need not have the prefix, though they may.  Functions
have a comment between their prototype and the opening brace
as in this example:

char dyStringAppendC(struct dyString *ds, char c)
// Append char to end of string. 
{
char *s;
if (ds->stringSize >= ds->bufSize)
     dyStringExpandBuf(ds, ds->bufSize+256);
s = ds->string + ds->stringSize++;
*s++ = c;
*s = 0;
return c;
}

For short functions like this, the opening comment may be the only
comment.  Longer functions should be broken into logical 'paragraphs'
with a comment at the start of each paragraph and blank lines
between paragraphs as in this example:

struct twoBit *twoBitFromDnaSeq(struct dnaSeq *seq, boolean doMask)
/* Convert dnaSeq representation in memory to twoBit representation.
 * If doMask is true interpret lower-case letters as masked. */
{
/* Allocate structure and fill in name and size fields. */
struct twoBit *twoBit;
AllocVar(twoBit);
int ubyteSize = packedSize(seq->size);
UBYTE *pt = AllocArray(twoBit->data, ubyteSize);
twoBit->name = cloneString(seq->name);
twoBit->size = seq->size;
    
/* Convert to 4-bases per byte representation. */
char *dna = seq->dna;
int i, end;
end = seq->size - 4;
for (i=0; i<end; i += 4)
    {
    *pt++ = packDna4(dna+i);
    }

/* Take care of conversion of last few bases, padding arbitrarily with 'T'. */
DNA last4[4];   
last4[0] = last4[1] = last4[2] = last4[3] = 'T';
memcpy(last4, dna+i, seq->size-i);
*pt = packDna4(last4);

/* Deal with blocks of N, saving end points of blocks. */
twoBit->nBlockCount = countBlocksOfN(dna, seq->size);
if (twoBit->nBlockCount > 0)
    {
    AllocArray(twoBit->nStarts, twoBit->nBlockCount);
    AllocArray(twoBit->nSizes, twoBit->nBlockCount);
    storeBlocksOfN(dna, seq->size, twoBit->nStarts, twoBit->nSizes);
    }

/* Deal with masking, saving end points of blocks. */
if (doMask)
    {
    twoBit->maskBlockCount = countBlocksOfLower(dna, seq->size);
    if (twoBit->maskBlockCount > 0)
        {
        AllocArray(twoBit->maskStarts, twoBit->maskBlockCount);
        AllocArray(twoBit->maskSizes, twoBit->maskBlockCount);
        storeBlocksOfLower(dna, seq->size,
                twoBit->maskStarts, twoBit->maskSizes);
        }
    }
return twoBit;
}

Though code paragraphs help make long functions readable, in general
smaller functions are preferred. It is rare that a function longer than 
100 lines couldn't be improved by moving some blocks of code into new 
functions or simplifying what the function is trying to do.

STRUCTURE OF A TYPICAL C MODULE

To avoid having to declare function prototypes, C modules are generally
ordered with the lowest level functions written before higher level
functions. In particular, if a module includes a main() routine, then
it is the last function in the module. 

If a structure is broadly used in a module, it is declared near the start
of the module, just after the module opening comment and any includes.  
This is followed by broadly used module local (static) variables.  Less 
broadly used structs and variables may be grouped with the functions they 
are used with.

If a module is used by other modules, it will be represented in a header 
file.  In the majority of cases one .h file corresponds to one .c file.
Typically the opening comment is duplicated in .h and .c files, as are
the public structure and function declarations and opening comments. 

In general we try, with mixed success, to keep modules less than 2000 lines.
Sadly many of the Genome Browser specific modules are currently quite long.
On the bright side the vast majority of the library modules are reasonably
sized.

PREVENTING STRING OVERFLOW

One weakness of C in the string handling.  It is very easy using standard C 
library functions like sprintf and strcat to write past the end of the
character array that holds a string.  For this reason instead of sprintf
we use the "safef" function, which takes an additional parameter, the size
of the character array.  So
   char buffer[50];
   sprintf(buf, "My name is %s", name);
becomes
   char buffer[50];
   safef(buf, sizeof(buf), "My name is %s", name);
Instead of just silently overflowing the buffer and crashing cryptically
in many cases if the string is too long, say "Sahar Barjesteh van Waalwijk van Doorn-Khosrovani"
which is actually a real scientists name!

PREVENTING SQL-INJECTION

In order to prevent SQL-Injection (sqli), we use primarily
a special function called sqlSafef() to construct properly
escaped SQL strings.  

The main article about preventing sqli is found here on genomewiki:

http://genomewiki.ucsc.edu/index.php/Sql_injection_protection

There are several other related and supporting 
functions to defeat sqli.  The function reference is found here:

http://genomewiki.ucsc.edu/index.php/Sql-injection_safe_functions

CREATING NEW PROGRAMS

By convention most of our command line programs follow a very simple structure.  They are in 
a directory by themselves which initially will just contain a .c file and a makefile.  It
is easiest to use the program called "newProg" in our source tree to set this up.  It'll create
a proper makefile, which is not simple, but is rarely done enough you're likely to forget it.
It also creates a skeleton for the C program including a usage message.