Text Manipulation

Lots of different data sources, with lots of different formats. Common formats being dealt with are delimited text and XML.

Delimited text

  • can use piped shell commands (or KNIME)

Piped Shell Commands

  • Start the data flow
    • cat filename (or zcat for .gz files)
  • Extract relevant columns
    • cut -f col1,col2,col3
    • you cannot reorder columns (specified columns will always be output in the original order)
  • Split file by number of rows
    • split -l lines inputfile outputprefix
  • Join rows from two files on field (default join field is first)
    • join file1 file2
    • join is particularly sensitive to field order - best results when the join field is first
    • files for join must be sorted on the join field - use sort -k 3b,3 where 3 is the field
    • (you'd think these tools would be better designed…)
  • Search and replace
    • sed "y/\t/\|/"
      • single character replacement - replaces tabs with pipes
    • sed -r regexp
    • Example Use Case: Strip RefSeq versions from accessions
      • sed -r 's/([A-Z][A-Z]_[0-9]+).[0-9]*/\1/g'
    • Example Use Case: Remove quotes
      • sed -e 's/^"//' -e 's/"$//'
  • Find lines in common between two files
      • comm file1 file2
      • output is columns with (for each line): lines unique to file1, unique to file2, common to file1 and file2
      • use -1, -2 , -3 to suppress the respective column
  • Sorting
    • sort [-r] [-n] [-k POS1[,POS2]] [-t sep] [-T dir]
    • -r : reverse
    • -n : numeric
    • -k : use fields POS1 (through POS2) as the key. The key is all the text in the fields concatenated with separators removed
    • -t : use sep as field separator (instead of non-blank transitions to blank transition)
    • -T : use dir as directory for temporary files
  • Tab Formatted unique line counts
    • uniq -c | perl -p -e 's/^[[:blank:]]*([[:digit:]]*)[[:blank:]]*(.*)/\2\t\1/'
  • Convert uppercase text to lowercase
    • tr '[:upper:]' '[:lower:]'
  • use less to view, or redirect to a file
  • use column -t | less -S to view delimited file

For long jobs, use

  • nohup nice cmd &

to put it in the background and not have it lost if the terminal connection is gone

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-Share Alike 2.5 License.