Tuesday, November 5, 2013

BIOINFORMATICS PART 2



PERL


Perl is an interpreted language optimized for scanning arbitrary text files, extracting information from those text files, and printing reports based on that information. It's also a good language for many system management tasks. The language is intended to be practical (easy to use, efficient, complete) rather than beautiful (tiny, elegant, minimal). It combines (in the author's opinion, anyway) some of the best features of C, sed, awk, and sh, so people familiar with those languages should have little difficulty with it.Expression syntax corresponds quite closely to C expression syntax. Unlike most Unix utilities, perl does not arbitrarily limit the size of your data -- if you've got the memory, perl can slurp in your whole file as a single string. Recursion is of unlimited depth. And the hash tables used by associative arrays grow as necessary to prevent degraded performance. Perl uses sophisticated pattern matching techniques to scan large amounts of data very quickly.

Features

The overall structure of Perl derives broadly from C. Perl is procedural in nature, with variables, expressions, assignment statements, brace-delimited blocks, control structures, and subroutines.
Perl also takes features from shell programming. All variables are marked with leading sigils, which unambiguously identify the data type (for example, scalar, array, hash) of the variable in context. Importantly, sigils allow variables to be interpolated directly into strings. Perl has many built-in functions that provide tools often used in shell programming (although many of these tools are implemented by programs external to the shell) such as sorting, and calling on operating system facilities.

Design

The design of Perl can be understood as a response to three broad trends in the computer industry: falling hardware costs, rising labor costs, and improvements in compiler technology. Many earlier computer languages, such as Fortran and C, aimed to make efficient use of expensive computer hardware. In contrast, Perl is designed to make efficient use of expensive computer-programmers.
Perl has many features that ease the task of the programmer at the expense of greater CPU and memory requirements. These include automatic memory management; dynamic typing; strings, lists, and hashes; regular expressions; introspection; and an eval() function. Perl follows the theory of "no built-in limits",[42] an idea similar to the Zero One Infinity rule.

Applications

Perl has many and varied applications, compounded by the availability of many standard and third-party modules.
Perl has chiefly been used to write CGI scripts: large projects written in Perl include cPanel, Slash, Bugzilla, RT, TWiki, and Movable Type; high-traffic websites that use Perl extensively include Amazon.com, bbc.co.uk, Priceline.com, Craigslist,[51] IMDb,[52] LiveJournal, Slashdot and Ticketmaster. It is also an optional component of the popular LAMP technology stack for web development, in lieu of PHP or Python.
Perl is often used as a glue language, tying together systems and interfaces that were not specifically designed to interoperate, and for "data munging"

Implementation

Perl is implemented as a core interpreter, written in C, together with a large collection of modules, written in Perl and C. As of 2010[update], the stable version (5.12.3) is 14.2 MB when packaged in a tar file and gzip compressed.[54] The interpreter is 150,000 lines of C code and compiles to a 1 MB executable on typical machine architectures. Alternatively, the interpreter can be compiled to a link library and embedded in other programs. There are nearly 500 modules in the distribution, comprising 200,000 lines of Perl and an additional 350,000 lines of C code.

Availability

Perl is dual licensed under both the Artistic License and the GNU General Public License. Distributions are available for most operating systems. It is particularly prevalent on Unix and Unix-like systems, but it has been ported to most modern (and many obsolete) platforms. With only six reported exceptions, Perl can be compiled from source code on all POSIX-compliant, or otherwise-Unix-compatible platforms

Windows

Users of Microsoft Windows typically install one of the native binary distributions of Perl for Win32, most commonly Strawberry Perl or ActivePerl. Compiling Perl from source code under Windows is possible, but most installations lack the requisite C compiler and build tools. This also makes it difficult to install modules from the CPAN, particularly those that are partially written in C.
ActivePerl is a closed source distribution from ActiveState that has regular releases that track the core Perl releases.[65] The distribution also includes the Perl package manager (PPM),[66] a popular tool for installing, removing, upgrading, and managing the use of common Perl modules.
Strawberry Perl is an open source distribution for Windows. It has had regular, quarterly releases since January 2008, including new modules as feedback and requests come in. Strawberry Perl aims to be able to install modules like standard Perl distributions on other platforms, including compiling XS modules.

Boolean Search


Boolean searches allows to combine words and phrases using the words AND, OR, NOT and NEAR (otherwise known as Boolean operators) to limit, widen, or define your search. Most Internet search engines and Web directories default to these Boolean search parameters.

 

Boolean Search Operators

  • The Boolean search operator AND is equal to the "+" symbol.
  • The Boolean search operator NOT is equal to the "-" symbol.
  • The Boolean search operator OR is the default setting of any search engine; meaning, all search engines will return all the words you type in, automatically.
  • The Boolean search operator NEAR is equal to putting a search query in quotes, i.e., "sponge bob squarepants". You're essentially telling the search engine that you want all of these words, in this specific order, or this specific phrase.
Examples: 

Using AND narows a search by combining terms; it will retrieve documents that use both the search terms you specify, as in this example:
  • Portland AND Oregon
Using OR broadens a search to include results that contain either of the words you type in.
  • liberal OR democrat
Using NOT will narrow a search by excluding certain search terms.
  • Oregon NOT travel

 

Fuzzy Search


It can be used for approximate string matching.Other algorithms for approximate string searching exist (e.g., Soundex), but those aren't as easy to implement.
A List<string> is used for searching, and therefore it's quite easy to search a database.
The algorithm used the Levenshtein-distance for determining how exact a string from a word list matches the word to be found.

 

Searching


In the search process, for each word in the wordlist, the Levenshtein-distance is computed, and with this distance, a score. This score represents how good the strings match. The input argument fuzzyness determines how much the strings can differ.
Unlike the other searches, which try to find exact matches for the search terms you enter, the fuzzy search looks for approximate matches. This allows you to search for names when you are uncertain of the correct spelling. This type of search works best with names, so that is how it is used here.
The fuzzy search uses variations on a method commonly called "soundex", which looks for names that sound approximately like the name you entered. Since the computer can't actually hear how the names sound, this method is not exact, and you may get results that seem odd (searching for "Smith", for example, will return the expected matches, but will also return "Sandie" and "Santo"). Each variation on the basic soundex method (there are several) will give slightly different results, so if one is not satisfactory, try another.
The fuzzy search is available as part of the Census, Directory, Tax and Global Name searches.

 

How the Fuzzy Search Methods Work


The Fuzzy Search currently implements two varieties of soundex-type search: the original Soundex method, and a newer method called Metaphone, which has two variations.

Soundex

Soundex is a phonetic algorithm for indexing names by their sound when pronounced in English. The basic aim is for names with the same pronunciation to be encoded to the same string so that matching can occur despite minor differences in spelling. Soundex is the most widely known of all phonetic algorithms and is often used (incorrectly) as a synonym for "phonetic algorithm".

Metaphone

Metaphone was developed in 1990 by Lawrence Philips as a response to deficiencies in the Soundex algorithm. In 2000 Philips modified his original algorithm with additional heuristics, and called the result "double metaphone". Double metaphone will give better results than the original metaphone in some cases, but they often both give identical results.


DNA sequencing


DNA sequencing is the process of determining the precise order of nucleotides within a DNA molecule. It includes any method or technology that is used to determine the order of the four bases—adenine, guanine, cytosine, and thymine—in a strand of DNA. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.
Knowledge of DNA sequences has become indispensable for basic biological research, and in numerous applied fields such as diagnostic, biotechnology, forensic biology, and biological systematics. The rapid speed of sequencing attained with modern DNA sequencing technology has been instrumental in the sequencing of complete DNA sequences, or genomes of numerous types and species of life, including the human genome and other complete DNA sequences of many animal, plant, and microbial species.

 

Use of sequencing

DNA sequencing may be used to determine the sequence of individual genes, larger genetic regions (i.e. clusters of genes or operons), full chromosomes or entire genomes. Depending on the methods used, sequencing may provide the order of nucleotides in DNA or RNA isolated from cells of animals, plants, bacteria, archaea, or virtually any other source of genetic information. The resulting sequences may be used by researchers in molecular biology or genetics to further scientific progress or may be used by medical personnel to make treatment decisions or aid in genetic counseling.

 

Basic method

 

Maxam-Gilbert sequencing

 Allan Maxam and Walter Gilbert published a DNA sequencing method in 1977 based on chemical modification of DNA and subsequent cleavage at specific bases.[7] Also known as chemical sequencing, this method allowed purified samples of double-stranded DNA to be used without further cloning. This method's use of radioactive labeling and its technical complexity discouraged extensive use after refinements in the Sanger methods had been made.
Maxam-Gilbert sequencing requires radioactive labeling at one 5' end of the DNA and purification of the DNA fragment to be sequenced. Chemical treatment then generates breaks at a small proportion of one or two of the four nucleotide bases in each of four reactions (G, A+G, C, C+T). The concentration of the modifying chemicals is controlled to introduce on average one modification per DNA molecule. Thus a series of labeled fragments is generated, from the radiolabeled end to the first "cut" site in each molecule. The fragments in the four reactions are electrophoresed side by side in denaturing acrylamide gels for size separation. To visualize the fragments, the gel is exposed to X-ray film for autoradiography, yielding a series of dark bands each corresponding to a radiolabeled DNA fragment, from which the sequence may be inferred.[7]

Chain-termination methods

The chain-termination method developed by Frederick Sanger and coworkers in 1977 soon became the method of choice, owing to its relative ease and reliability.[6][22] When invented, the chain-terminator method used fewer toxic chemicals and lower amounts of radioactivity than the Maxam and Gilbert method. Because of its comparative ease, the Sanger method was soon automated and was the method used in the first generation of DNA sequencers.

 

Advanced methods and de novo sequencing


Genomic DNA is fragmented into random pieces and cloned as a bacterial library. DNA from individual bacterial clones is sequenced and the sequence is assembled by using overlapping DNA regions.
Large-scale sequencing often aims at sequencing very long DNA pieces, such as whole chromosomes, although large-scale sequencing can also be used to generate very large numbers of short sequences, such as found in phage display. For longer targets such as chromosomes, common approaches consist of cutting (with restriction enzymes) or shearing (with mechanical forces) large DNA fragments into shorter DNA fragments. The fragmented DNA may then be cloned into a DNA vector and amplified in a bacterial host such as Escherichia coli. Short DNA fragments purified from individual bacterial colonies are individually sequenced and assembled electronically into one long, contiguous sequence.
The term "de novo sequencing" specifically refers to methods used to determine the sequence of DNA with no previously known sequence. De novo translates from Latin as "from the beginning". Gaps in the assembled sequence may be filled by primer walking. The different strategies have different tradeoffs in speed and accuracy; shotgun methods are often used for sequencing large genomes, but its assembly is complex and difficult, particularly with sequence repeats often causing gaps in genome assembly.

Shotgun sequencing

Shotgun sequencing is a sequencing method designed for analysis of DNA sequences longer than 1000 base pairs, up to and including entire chromosomes. This method requires the target DNA to be broken into random fragments. After sequencing individual fragments, the sequences can be reassembled on the basis of their overlapping regions.[28]

Bridge PCR

Another method for in vitro clonal amplification is bridge PCR, in which fragments are amplified upon primers attached to a solid surface [16][29][30] and form "DNA colonies" or "DNA clusters". This method is used in the Illumina Genome Analyzer sequencers. Single-molecule methods, such as that developed by Stephen Quake's laboratory (later commercialized by Helicos) are an exception: they use bright fluorophores and laser excitation to detect base addition events from individual DNA molecules fixed to a surface, eliminating the need for molecular amplification.


3D structure visualization tools


This list of software for protein structure visualization is a compilation of bioinformatics software used to view protein structures. Such tools are commonly used in molecular biology, and bioinformatics. For example, you can look at your protein in the vmd viewer. It is necessary that you have an pdb.file to look at your 3-dimensional structure.

Building a protein structure is not enough you have to visualize your final protein tertiary structure to analyze the result. There are so many good software to visualize the protein structure.
  1. BioBlender
  2. Visual Molecular Dynamics : can be use for modeling, visualization, and analysis of biological systems such as proteins, nucleic acids, lipid bilayer assemblies. VMD have so many other functions also beside protein structure visualization.
  3. Jmol
  4. PyMOL
  5. RasMol
  6. UCSF Chimera :  can be used for 'interactive visualization and analysis of molecular structures and related data, including density maps, supramolecular assemblies, sequence alignments, docking results, trajectories, and conformational ensembles'
  7. Friend
  8. VisProt3DS 
  9. Polyview-3D : A web server and good point to start with if you don't like to install any software on your computer.
  10. Swiss pdb viewer : My favorite one. Very light and can work very efficiently. Please watch this beautiful bioinformatic video tutorial to learn to use Swiss pdb viewer. Even you can use  Swiss pdb viewer along with SWISS-MODEL for protein homology modeling. You can vier this free  bioinformatic video tutorial for detail

Single-nucleotide polymorphism


A single-nucleotide polymorphism (SNP, pronounced snip; plural snips) is a DNA sequence variation occurring when a single nucleotideA, T, C or G — in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes in a human. For example, two sequenced DNA fragments from different individuals, AAGCCTA to AAGCTTA, contain a difference in a single nucleotide. In this case we say that there are two alleles. Almost all common SNPs have only two alleles. The genomic distribution of SNPs is not homogenous; SNPs usually occur in non-coding regions more frequently than in coding regions or, in general, where natural selection is acting and fixating the allele of the SNP that constitutes the most favorable genetic adaptation.[1] Other factors, like genetic recombination and mutation rate, can also determine SNP density.
These genetic variations between individuals (particularly in non-coding parts of the genome) are exploited in DNA fingerprinting, which is used in forensic science . Also, these genetic variations underlie differences in our susceptibility to disease. The severity of illness and the way our body responds to treatments are also manifestations of genetic variations. For example, a single base mutation in the APOE (apolipoprotein E) gene is associated with a higher risk for Alzheimer disease.
Types of SNPs
  • Non-coding region
  • Coding region
    • Synonymous
    • Nonsynonymous
      • Missense
      • Nonsense
Single-nucleotide polymorphisms may fall within coding sequences of genes, non-coding regions of genes, or in the intergenic regions (regions between genes). SNPs within a coding sequence do not necessarily change the amino acid sequence of the protein that is produced, due to degeneracy of the genetic code.
SNPs in the coding region are of two types, synonymous and nonsynonymous SNPs. Synonymous SNPs do not affect the protein sequence while nonsynonymous SNPs change the amino acid sequence of protein. The nonsynonymous SNPs are of two types: missense and nonsense.
SNPs that are not in protein-coding regions may still affect gene splicing, transcription factor binding, messenger RNA degradation, or the sequence of non-coding RNA. Gene expression affected by this type of SNP is referred to as an eSNP (expression SNP) and may be upstream or downstream from the gene.

 

Use and importance

Variations in the DNA sequences of humans can affect how humans develop diseases and respond to pathogens, chemicals, drugs, vaccines, and other agents. SNPs are also critical for personalized medicine.[4] However, their greatest importance in biomedical research is for comparing regions of the genome between cohorts (such as with matched cohorts with and without a disease) in genome-wide association studies.
The study of SNPs is also important in crop and livestock breeding programs. See SNP genotyping for details on the various methods used to identify SNPs.
SNPs are usually biallelic and thus easily assayed.[5] A single SNP may cause a Mendelian disease. For complex diseases, SNPs do not usually function individually, rather, they work in coordination with other SNPs to manifest a disease condition as has been seen in Osteoporosis.

 

Applications for SNP Genotyping


Plant and Animal Genetics
 Following Mendel’s rule of inheritance, single-nucleotide polymorphisms (SNPs) are evolutionarily conserved and therefore, are useful in plant and animal marker-assisted breeding programs. Using SNP genotyping, selective breeding is accelerated by allowing traits to be identified and selected prior to growing the organism to maturity – saving time and money. Large-scale quality control testing also benefits from SNP genotyping, using these simple markers as a “genetic fingerprint” to trace samples as well as estimate the purity of a population.
Human Genetics
SNPs are used in large scale epidemiological studies to identify specific variations in genes that influence susceptibility to disease and response to medication or treatment. Once identified, SNPs may be used to diagnose individuals carrying the gene of interest, which provides critical information necessary to personalize medical care. Human identity testing also may be achieved using the “genetic fingerprint” provided by SNP genotyping.
6. Applications of SNPS
NGS and SNP genotyping technologies have made SNPs the most widely used marker for genetic studies in plant species such as Arabidopsis [121] and rice [122]. SNPs can help to decipher breeding pedigree, to identify genomic divergence of species to elucidate speciation and evolution, and to associate genomic variations to phenotypic traits [85]. The ease of SNP development, reasonable genotyping costs, and the sheer number of SNPs present within a collection of individuals allow an assortment of applications that can have a tremendous impact on basic and applied research in plant species.
6.1. SNPs in Genetic Mapping
A genetic map refers to the arrangement of traits, genes, and markers relative to each other as measured by their recombination frequency. Genetic maps are essential tools in molecular breeding for plant genetic improvement as they enable gene localization, map-based cloning, and the identification of QTL [123]. SNPs have greatly facilitated the production of much higher density maps than previous marker systems. SNPs discovered using RNA-Seq and expressed sequence tags (ESTs) have the added advantage of being gene specific [124]. Their high abundance and rapidly improving genotyping technologies make SNPs an ideal marker type for generating new genetic maps as well as saturating existing maps created with other markers. Most SNPs are biallelic thereby having a lower polymorphism information content (PIC) value as compared to most other marker types which are often multiallelic [125]. The limited information associated with their biallelic nature is greatly compensated by their high frequency, and a map of 700–900 SNPs has been found to be equivalent to a map of 300–400 simple sequence repeat (SSR) markers [125]. SNP-based linkage maps have been constructed in many economically important species such as rice [126], cotton [91] and Brassica [127]. The identification of candidate genes for flowering time in Brassica [127] and maize [128] are practical examples of gene discovery through SNP-based genetic maps.
6.2. Genome-Wide Association Mapping
Association mapping (AM) panels provide a better resolution, consider numerous alleles, and may provide faster marker-trait association than biparental populations [129]. AM, often referred to as linkage disequilibrium (LD) mapping, relies on the nonrandom association between markers and traits [130]. LD can vary greatly across a genome. In low LD regions, high marker saturation is required to detect marker-trait association, hence the need for densely saturated maps. In general, GWASs require 10,000–100,000 markers applied to a collection of genotypes representing a broad genetic basis [130].
6.3. Evolutionary Studies
The main advantage of SNPs is unquestionably their large numbers. As with all marker systems the researcher must be aware of ascertainment biases that exist in the panel of SNPs being used. These biases exist because SNPs are often developed from examining a small group of individuals and selecting the markers that maximize the amount of polymorphism that can be detected in the population used. This results in a collection of markers that sample only a fraction of the diversity that exists in the species but that are nevertheless used to infer relatedness and determine genetic distance for whole populations. Ideally, a set of SNP markers randomly distributed throughout the genome would be developed for each population studied. GBS moves us closer to this goal by incorporating simultaneous discovery of SNPs and genotyping of individuals. With this approach genome sample bias remains but can be mitigated by careful restriction enzyme selection.

XML


Extensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. It is defined in the XML 1.0 Specification[3] produced by the W3C, and several other related specifications,[4] all gratis open standards.[5]
The design goals of XML emphasize simplicity, generality, and usability over the Internet.[6] It is a textual data format with strong support via Unicode for the languages of the world. Although the design of XML focuses on documents, it is widely used for the representation of arbitrary data structures, for example in web services.
Many application programming interfaces (APIs) have been developed to aid software developers with processing XML data, and several schema systems exist to aid in the definition of XML-based languages.
XML is a markup language much like HTML
·  XML was designed to carry data, not to display data
·  XML tags are not predefined. You must define your own tags
·  XML is designed to be self-descriptive
·  XML is a W3C Recommendation

XML Separates Data from HTML

If you need to display dynamic data in your HTML document, it will take a lot of work to edit the HTML each time the data changes.
With XML, data can be stored in separate XML files. This way you can concentrate on using HTML/CSS for display and layout, and be sure that changes in the underlying data will not require any changes to the HTML.
With a few lines of JavaScript code, you can read an external XML file and update the data content of your web page.

XML Simplifies Data Sharing

In the real world, computer systems and databases contain data in incompatible formats.
XML data is stored in plain text format. This provides a software- and hardware-independent way of storing data.
This makes it much easier to create data that can be shared by different applications.

XML Simplifies Data Transport

One of the most time-consuming challenges for developers is to exchange data between incompatible systems over the Internet.
Exchanging data as XML greatly reduces this complexity, since the data can be read by different incompatible applications.

XML Simplifies Platform Changes

Upgrading to new systems (hardware or software platforms), is always time consuming. Large amounts of data must be converted and incompatible data is often lost.
XML data is stored in text format. This makes it easier to expand or upgrade to new operating systems, new applications, or new browsers, without losing data.

XML Makes Your Data More Available

Different applications can access your data, not only in HTML pages, but also from XML data sources.
With XML, your data can be available to all kinds of "reading machines" (Handheld computers, voice machines, news feeds, etc), and make it more available for blind people, or people with other disabilities.

XML is Used to Create New Internet Languages

A lot of new Internet languages are created with XML.
Here are some examples:
  • XHTML 
  • WSDL for describing available web services
  • WAP and WML as markup languages for handheld devices
  • RSS languages for news feeds
  • RDF and OWL for describing resources and ontology
  • SMIL for describing multimedia for the web

 

Flat File


A flat file database is a database that stores data in a plain text file. Each line of the text file holds one record, with fields separated by delimiters, such as commas or tabs. While it uses a simple structure, a flat file database cannot contain multiple tables like a relational database can. Fortunately, most database programs such as Microsoft Access and FileMaker Pro can import flat file databases and use them in a larger relational database.
Flat file is also a type of computer file system that stores all data in a single directory. There are no folders or paths used organize the data. While this is a simple way to store files, a flat file system becomes increasingly inefficient as more data is added. The original Macintosh computer used this kind of file system, creatively called the Macintosh File System (MFS). However, it was soon replaced by the more efficient Hierarchical File System (HFS) that was based on a directory structure.
Flat files are used not only as data storage tools in DB and CMS systems, but also as data transfer tools to remote servers (in which case they become known as information streams).
In recent years, this latter implementation has been replaced with XML files, which not only contain but also describe the data. Those still using Flat Files to transfer information are mainframes employing specific procedures which are too expensive to modify.
One criticism often raised against the XML format as a way to perform mass data transfer operations is that file size is significantly larger with respect to that of Flat Files, which is generally reduced to the bare minimum. The solution to this problem consists in XML file compression (a solution that applies equally well to Flat Files), which has nowadays gained EXI standards (i.e., Efficient XML Interchange, which is often used by mobile devices).

 

PubMed


PubMed is a free database accessing primarily the MEDLINE database of references and abstracts on life sciences and biomedical topics. The United States National Library of Medicine (NLM) at the National Institutes of Health maintains the database as part of the Entrez system of information retrieval.
PubMed is the National Library of Medicine s search service that provides access to:
MEDLINE: an international bibliographic database of over 4000 biomedical journals from 1966-present.
PreMEDLINE: new records are added daily; will appear with the tag [MEDLINE record in progress].
HealthSTAR: a health administration and technology/research database; also includes full text of practice guidelines.
PubMed also provides links to molecular biology databases of DNA sequences, population data, and genomes.

Subject Searching
PubMed may be searched by entering keywords or phrases into the text box. You may enter one or more terms and press the enter key or click Go. PubMed will search multiple words as a phrase if it recognizes the terms. Otherwise, PubMed will search the words separately and combine with AND. PubMed will also automatically try to map your term to a MeSH heading.
Example: vitamin c common cold is searched as vitamin c AND common cold
PubMed does not perform adjacency searching; instead it uses a Phrase List against which to match terms. Enclosing a term in quotation marks forces PubMed to check a second Phrase List. If the term is still not found, the words are combined with AND.
Example: "pressure point "
When using keyword searching, you may sometimes want to truncate (*) words to allow for variant word endings.
Example: bacteri* retrieves bacteria, bacterium, bacterial, etc.

 

Characteristics

Standard searches

Simple searches on PubMed can be carried out by entering key aspects of a subject into PubMed's search window.
PubMed translates this initial search formulation and automatically adds field names, relevant MeSH (Medical Subject Headings) terms, synonyms, Boolean operators, and 'nests' the resulting terms appropriately, enhancing the search formulation significantly, in particular by routinely combining (using the OR operator) textwords and MeSH terms.

Comprehensive searches

For comprehensive, optimal searches in PubMed, it is necessary to have a thorough understanding of its core component, MEDLINE, and especially of the MeSH (Medical Subject Headings) controlled vocabulary used to index MEDLINE articles. They may also require complex search strategies, use of field names (tags), proper use of limits and other features, and are best carried out by PubMed search specialists or librarians,[10] who are able to select the right type of search and carefully adjust it for recall and precision.[11]

Journal Article Parameters

When a journal article is indexed, numerous article parameters are extracted and stored as structured information. Such parameters are: Article Type (MeSH terms, e.g., "Clinical Trial"), Secondary identifiers, Keywords (MeSH terms), Language, Country of the Journal or publication history (e-publication date, print journal publication date).

Publication Type: Clinical queries/systematic reviews

Publication type parameter enables many special features. A special feature of PubMed is its "Clinical Queries" section, where "Clinical Categories", "Systematic Reviews", and "Medical Genetics" subjects can be searched, with study-type 'filters' automatically applied to identify substantial, robust studies.[12] As these 'clinical queries' can generate small sets of robust studies with considerable precision, it has been suggested that this PubMed section can be used as a 'point-of-care' resource.[13]

Bioinformatics and its applications


Bioinformatics is the branch of science which uses the applications of information technology and computer science into the field of molecular biology. It was Paulien Hogeweg who invented the term Bioinformatics in 1979 to study the processes of information technology into biological systems. The science of bioinformatics actually develops algorithms and biological software of computer to analyze and record the data related to biology for example the data of genes, proteins, drug ingredients and metabolic pathways. As biological data is always in raw form and there is a need of certain storage house in which the data can be stored, organized and manipulated. Biological software and databases provide the scientists this opportunity so that the data can be extracted from these database easily and can be used by the scientists.

Applications:-

Bioinformatics joins mathematics, statistics, and computer science and information technology to solve complex biological problems. These problems are usually t the molecular level which cannot be solved by other means. This interesting field of science has many applications and research areas where it can be applied.

Sequence Analysis:-
The application of sequence analysis determines those genes which encode regulatory sequences or peptides by using the information of sequencing. For sequence analysis, there are many powerful tools and computers which perform the duty of analyzing the genome of various organisms. These computers and tools also see the DNA mutations in an organism and also detect and identify those sequences which are related. Shotgun sequence techniques are also used for sequence analysis of numerous fragments of DNA. Special software is used to see the overlapping of fragments and their assembly.

Prediction of Protein Structure:-
It is easy to determine the primary structure of proteins in the form of amino acids which are present on the DNA molecule but it is difficult to determine the secondary, tertiary or quaternary structures of proteins. For this purpose either the method of crystallography is used or tools of bioinformatics can also be used to determine the complex protein structures.

Genome Annotation:-
In genome annotation, genomes are marked to know the regulatory sequences and protein coding. It is a very important part of the human genome project as it determines the regulatory sequences.

Comparative Genomics:-
Comparative genomics is the branch of bioinformatics which determines the genomic structure and function relation between different biological species. For this purpose, intergenomic maps are constructed which enable the scientists to trace the processes of evolution that occur in genomes of different species. These maps contain the information about the point mutations as well as the information about the duplication of large chromosomal segments.

Health and Drug discovery:-
The tools of bioinformatics are also helpful in drug discovery, diagnosis and disease management. Complete sequencing of human genes has enabled the scientists to make medicines and drugs which can target more than 500 genes. Different computational tools and drug targets has made the drug delivery easy and specific because now only those cells can be targeted which are diseased or mutated. It is also easy to know the molecular basis of a disease.

 

Entrez


The Entrez Global Query Cross-Database Search System is a powerful federated search engine, or web portal that allows users to search many discrete health sciences databases at the National Center for Biotechnology Information (NCBI) website.[1] The NCBI is a part of the National Library of Medicine (NLM), which is itself a department of the National Institutes of Health (NIH), which in turn is a part of the United States Department of Health and Human Services. "Entrez" also happens to be the second person plural (or formal) form of the French verb "entrer (to enter)", meaning the invitation "Come in!".
Entrez Global Query is an integrated search and retrieval system that provides access to all databases simultaneously with a single query string and user interface. Entrez can efficiently retrieve related sequences, structures, and references. The Entrez system can provide views of gene and protein sequences and chromosome maps.
The Entrez front page provides, by default, access to the global query. All databases indexed by Entrez can be searched via a single query string, supporting boolean operators and search term tags to limit parts of the search statement to particular fields. This returns a unified results page, that shows the number of hits for the search in each of the databases, which are also links to actual search results for that particular database.
Entrez also provides a similar interface for searching each particular database and for refining search results. The Limits feature allows the user to narrow a search a web forms interface. The History feature gives a numbered list of recently performed queries. Results of previous queries can be referred to by number and combined via boolean operators. Search results can be saved temporarily in a Clipboard. Users with a MyNCBI account can save queries indefinitely and also choose to have updates with new search results e-mailed for saved queries of most databases. It is widely used in the field of biotechnology to enhance the knowledge of students worldwide.It is a Life science search engine It is used in Bioinformatics

 

Databases

Entrez searches the following databases:
  • PubMed: biomedical literature citations and abstracts, including Medline - articles from (mainly medical) journals, often including abstracts. Links to PubMed Central and other full-text resources are provided for articles from the 1990s.
  • PubMed Central: free, full-text journal articles
  • Site Search: NCBI web and FTP web sites
  • Books: online books
  • Online Mendelian Inheritance in Man (OMIM)
  • Online Mendelian Inheritance in Animals (OMIA)
  • Nucleotide: sequence database (GenBank)
  • Protein: sequence database
  • Genome: whole genome sequences and mapping
  • Structure: three-dimensional macromolecular structures
  • Taxonomy: organisms in GenBank Taxonomy
  • SNP: single nucleotide polymorphism
  • Gene: gene-centered information
  • HomoloGene: eukaryotic homology groups
  • PubChem Compound: unique small molecule chemical structures
  • PubChem Substance: deposited chemical substance records
  • Genome Project: genome project information
  • UniGene: gene-oriented clusters of transcript sequences
  • CDD: conserved protein domain database
  • UniSTS: markers and mapping data
  • PopSet: population study data sets (epidemiology)
  • GEO Profiles: expression and molecular abundance profiles
  • GEO DataSets: experimental sets of GEO data
  • Sequence read archive: high-throughput sequencing data
  • Cancer Chromosomes: cytogenetic databases
  • PubChem BioAssay: bioactivity screens of chemical substances
  • GENSAT: gene expression atlas of mouse central nervous system[2]
  • Probe: sequence-specific reagents
  • NLM Catalog: NLM bibliographic data for over 1.2 million journals, books, audiovisuals, computer software, electronic resources, and other materials resident in LocatorPlus (updated every weekday).

Access

In addition to using the search engine forms to query the data in Entrez, NCBI provides the Entrez Programming Utilities (eUtils) for more direct access to query results. The eUtils are accessed by posting specially formed URLs to the NCBI server, and parsing the XML response. There is also an eUtils SOAP interface

Entrez


The Entrez Global Query Cross-Database Search System is a powerful federated search engine, or web portal that allows users to search many discrete health sciences databases at the National Center for Biotechnology Information (NCBI) website. "Entrez" happens to be the second person plural (or formal) form of the French verb "entrer (to enter)", meaning the invitation "Come in!".

Entrez is the text-based search and retrieval system used at NCBI for all of the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, OMIM, and many others. Entrez is at once an indexing and retrieval system, a collection of data from many sources, and an organizing principle for biomedical information.

Entrez Global Query is an integrated search and retrieval system that provides access to all databases simultaneously with a single query string and user interface. Entrez can efficiently retrieve related sequences, structures, and references. The Entrez system can provide views of gene and protein sequences and chromosome maps. Some textbooks are also available online through the Entrez system.

Entrez Nodes Represent Data

An Entrez “node” is a collection of data that is grouped together and indexed together. It is usually referred to as an Entrez database. In the first version of Entrez, there were three nodes: published articles, nucleotide sequences, and protein sequences. Each node represents specific data objects of the same type, e.g., protein sequences, which are each given a unique ID (UID) within that logical Entrez Proteins node. Records in a node may come from a single source (e.g., all published articles are from PubMed) or many sources (e.g., proteins are from translated Gen-Bank sequences, SWISS-PROT, or PIR)


Sequence Retrieval System (SRS)


SRS is a generic bioinformatics data integration software system. Developed initially in the early 1990s as an academic project at the European Molecular Biology Laboratory (EMBL), the system has evolved into a commercial product and is currently sold under license as a stand-alone software product.

SRS uses proprietary parsing techniques largely based on context-free grammars to parse and index flat-file data. A similar system combined with DOM-based processing rules is used to parse and index XML-formatted data. A relational database connector can be used to integrate data stored in relational database systems. SRS provides a unique common interface for accessing heterogeneous data sources and bypass complexities related to the actual format and storage mechanism for the data. SRS can exploit textual references between different databases and pull together data from disparate sources into a unified view.

SRS is designed from the ground up with extensibility and flexibility in mind, in order to cope with the ever-changing list of databases and formats in the bioinformatics world. SRS relies on a mix of database configuration via meta-definitions and hand-crafted parsers to integrate a wide range of database distributions. These meta-definitions are regularly updated and are also available for extension and modification to all users.
A number of similar commercial systems have been developed that replicate the basic functionality of SRS.

Protein Information Resource


The Protein Information Resource (PIR), is an integrated public bioinformatics resource to support genomic and proteomic research, and scientific studies.
PIR was established in 1984 by the National Biomedical Research Foundation (NBRF) as a resource to assist researchers in the identification and interpretation of protein sequence information.
PIR offers a wide variety of resources mainly oriented to assist the propagation and standardization of protein annotation: PIRSF,[8] iProClass, iProLINK.

KEGG


KEGG (Kyoto Encyclopedia of Genes and Genomes) is a collection of online databases dealing with genomes, enzymatic pathways, and biological chemicals.
According to the developers they consider KEGG to be a "computer representation" of the biological system.[3] The KEGG database can be utilized for modeling and simulation, browsing and retrieval of data. It is a part of the systems biology approach.
KEGG maintains five main databases:[1]
  • KEGG Atlas
  • KEGG Pathway
  • KEGG Genes
  • KEGG Ligand
  • KEGG BRITE

 

DNA Data Bank of Japan


The DNA Data Bank of Japan (DDBJ) is a biological database that collects DNA sequences. It exchanges its data with European Molecular Biology Laboratory at the European Bioinformatics Institute and with GenBank at the National Center for Biotechnology Information on a daily basis. Thus these three databanks contain the same data at any given time.
Although DDBJ mainly receives its data from Japanese researchers, it can accept data from contributors from any other country.

National Center for Biotechnology Information


The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health
The NCBI houses a series of databases relevant to biotechnology and biomedicine. Major databases include GenBank for DNA sequences and PubMed, a bibliographic database for the biomedical literature. All these databases are available online through the Entrez search engine.

The NCBI has had responsibility for making available the GenBank DNA sequence database since 1992.[1] GenBank coordinates with individual laboratories and other sequence databases such as those of the European Molecular Biology Laboratory (EMBL) and the DNA Data Bank of Japan (DDBJ).

Expressed sequence tag


An expressed sequence tag or EST is a short sub-sequence of a cDNA sequence.[1] They may be used to identify gene transcripts, and are instrumental in gene discovery and gene sequence determination
An EST results from one-shot sequencing of a cloned cDNA. The cDNAs used for EST generation are typically individual clones from a cDNA library. The resulting sequence is a relatively low quality fragment whose length is limited by current technology to approximately 500 to 800 nucleotides. Because these clones consist of DNA that is complementary to mRNA, the ESTs represent portions of expressed genes. They may be represented in databases as either cDNA/mRNA sequence or as the reverse complement of the mRNA, the template strand.


FASTA


FASTA is a DNA and protein sequence alignment software.
FASTA takes a given nucleotide or amino-acid sequence and searches a corresponding sequence database by using local sequence alignment to find matches of similar database sequences.
The FASTA program follows a largely heuristic method which contributes to the high speed of its execution. It initially observes the pattern of word hits, word-to-word matches of a given length, and marks potential matches before performing a more time-consuming optimized search using a Smith-Waterman type of algorithm

 

BLAST



In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold.

Uses of BLAST

BLAST can be used for several purposes. These include identifying species, locating domains, establishing phylogeny, DNA mapping, and comparison.
Identifying species
Locating domains
Establishing phylogeny
DNA mapping
Comparison


Identification of Gene Expression Patterns


Gene expression characterization information is important in gaining a better understanding of how multiple genes react in different conditions or over time. This ultimately can provide better insights into the controlling functionality of different genes in certain situations.
Gene expression is dynamic, and the same gene may act in different ways under different circumstances.
Researchers often use laboratory techniques such as a Northern blot or serial analysis of gene expression (SAGE). Both of these techniques make it possible to identify which genes are turned on and which are turned off within cells. Subsequently, this information can be used to help determine what circumstances trigger expression of various genes.
Both Northern blots and SAGE analyses work by measuring levels of mRNA, the intermediary between DNA and protein. Remember, in order to activate a gene, a cell must first copy the DNA sequence of that gene into a piece of mRNA known as a transcript. Thus, by determining which mRNA transcripts are present in a cell, scientists can determine which genes are expressed in that cell at different stages of development and under different environmental conditions.

 

Northern blots: What are they, and how do they work?

The quantity of mRNA transcript for a single gene directly reflects how much transcription of that gene has occurred. Tracking of that quantity will therefore indicate how vigorously a gene is transcribed, or expressed. To visualize differences in the quantity of mRNA produced by different groups of cells or at different times, researchers often use the method known as a Northern blot. For this method, researchers must first isolate mRNA from a biological sample by exposing the cells within it to a protease, which is an enzyme that breaks down cell membranes and releases the genetic material in the cells. Next, the mRNA is separated from the DNA, proteins, lipids, and other cellular contents. The different fragments of mRNA are then separated from one another via gel electrophoresis (a technique that separates molecules by passing an electrical current through a gel medium containing the molecules) and transferred to a filter or other solid support using a technique known as blotting. To identify the mRNA transcripts produced by a particular gene, the researchers next incubate the sample with a short piece of single-stranded RNA or DNA (also known as a probe) that is labeled with a radioactive molecule. Designed to be complementary to mRNA from the gene of interest, the probe will bind to this sequence. Later, when the filter is placed against X-ray film, the radioactivity in the probe will expose the film, thereby making marks on it. The intensity of the resulting marks, called bands, tells researchers how much mRNA was in the sample, which is a direct indicator of how strongly the gene of interest is expressed (Figure 2).

 

How does SAGE work?

SAGE identifies and counts the mRNA transcripts in a cell with the help of short snippets of the genetic code, called tags. These tags, which are a maximum of 14 nucleotides long, enable researchers to match an mRNA transcript with the specific gene that produced it. In most cases, each tag contains enough information to uniquely identify a transcript. The name "serial analysis" refers to the fact that tags are read sequentially as a continuous string of information.
The basic steps of the SAGE technique are outlined below.

Capturing mRNA

To begin a SAGE analysis, researchers must first separate the mRNA in a sample from the other cellular contents. To do this, they attach long strips of thymine nucleotides to tiny magnetic beads. When researchers flush the contents of a cell over the beads, these thymine strips form complementary base pairs with the poly-A tails of the mRNA molecules. Thus, when the flushing process is complete, the mRNA transcripts from the sample are captured because they are attached to the magnetic beads, while the other contents of the cells flush past the beads and are discarded.

Rewriting mRNA into cDNA

mRNA is more fragile than DNA, which makes it difficult to handle and analyze. To solve this problem, researchers often convert mRNA samples into complementary DNA sequences, or cDNA. This is done by reversing the natural process a cell uses to make mRNA from DNA, a method known as reverse transcription. The reverse transcription process doesn't use DNA polymerase or RNA polymerase; instead, it employs a special enzyme called reverse transcriptase. This enzyme makes cDNA sequences that are complementary to each mRNA transcript, essentially creating a converted form of the same sequence (Figure 3). This new single-stranded cDNA is then converted into a double-stranded cDNA molecule.

Cutting tags from each cDNA

To begin the next portion of SAGE, the researchers use a cutting enzyme to slice off short segments of nucleotides, called tags, at designated positions in each cDNA molecule. Next, two tags from each cDNA are combined into a single unit. These tags then become the representative for the gene they came from, and they act as a unique identifier in the form of a stand-in. Without having to process the entire mRNA sequence thereafter, scientists can use these shorter tag sequences to keep track of whether a specific gene was expressed in mRNA form.

Linking tags together in chains for sequencing

After the different tags have been made from each mRNA sequence, they are next linked together into long chains called concatemers. These concatemers therefore contain representatives of mRNAs from a group of genes. Linking the tags together in a concatemer is important, because it means that researchers will later be able to read thousands of tags at once during the analysis portion of the SAGE procedure.

Copying and reading the chains

Although the researchers now possess concatemers representing the genes expressed in the sample, they need multiple copies of these concatemers if they wish to run the molecules through a sequencing machine. Thus, just before sequencing, the concatemers are inserted into bacteria, and through their own replication process these bacteria make millions of copies of each concatemer chain. This step increases the volume of material, and it therefore ensures that there is a baseline amount of material necessary for a sequencing machine. After that, researchers use a sequencing machine to decode and read the long string of nucleotides in each chain.

Identifying and counting the tags

Finally, a computer processes the data from the sequencing machine and compiles a list of tags. By comparing the tags to a sequence database, the researchers can identify the mRNA (and ultimately the gene) that each tag came from. By subsequently counting the number of times each tag is observed, the researchers can also estimate the degree to which a particular gene is expressed: the more often a tag appears, the greater the level of gene expression.

Phylogenetic Analysis


Phylogenetic methods can be used for many purposes, including analysis of morphological and several kinds of molecular data. We concentrate here on the analysis of DNA and protein sequences.
Comparisons of more than two sequences
Analysis of gene families, including functional predictions
Estimation of evolutionary relationships among organisms
The basic concepts of phylogenetic analysis are quite easy to understand, but understanding what the results of the analysis mean, and avoiding errors of analysis can be quite difficult.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.