Learning Python for Forensics
上QQ阅读APP看书,第一时间看更新

Strings and Unicode

Strings are a data type that contain any character, including alphanumeric characters, symbols, Unicode, and other codecs. With the vast amount of information that can be stored as a string, it is no surprise they are one of the most common data types. Examples of areas where strings are found include reading arguments at the command line, user input, data from files, and outputting data. To begin, let us look at how we can define a string in Python.

There are three ways to create a string: with single quotes, double quotes, or with the built-in str() constructor method. Note that there is no difference between single- and double-quoted strings. Having multiple ways to create a string is advantageous, as it allows us to differentiate between intentional quotes within a string. For example, in the 'I hate when people use "air-quotes"!' string, we use the single quotes to demarcate the beginning and end of the main string. The double quotes inside the string will not cause any issues with the Python interpreter. Let's verify with the type() function that both single and double quotes create the same type of object:

>>> type('Hello World!')
<class 'str'>
>>> type("Foo Bar 1234")
<class 'str'>

As we saw with comments, a block string can be defined by three single or double quotes to create multi-line strings. The only difference is whether we do something with the block-quoted value or not:

>>> """This is also a string""" 
This is also a string
>>> '''it
can span
several lines'''
it\ncan span\nseveral lines

The \n character in the returned line signifies a line feed or a new line. The output in the interpreter displays these newline characters as \n, though when fed into a file or console, a new line is created. The \n character is one of the common escape characters in Python. Escape characters are denoted by a backslash following a specific character. Other common escape characters include \t for horizontal tabs, \r for carriage returns, \', \", and \\ for literal single quotes, double quotes, and backslashes, among others. Literal characters allow us to use these characters without unintentionally using their special meaning in Python's context.

We can also use the add (+) or multiply (*) operators with strings. The add operator is used to concatenate strings together, and the multiply operator will repeat the provided string values:

>>> 'Hello' + ' ' + 'World'
Hello World
>>> "Are we there yet? " * 3
Are we there yet? Are we there yet? Are we there yet?

Let's look at some common functions we use with strings. We can remove characters from the beginning or end of a string using the strip() function. The strip() function requires the character we want to remove as its input, otherwise it will replace whitespace by default. Similarly, the replace() function takes two inputs the character to replace and what to replace it with. The major difference between these two functions is that strip() only looks at the beginning and end of a string:

# This will remove colon (`:`) from the beginning and end of the line
>>> ':HelloWorld:'.strip(':')
HelloWorld


# This will remove the colon (`:`) from the line and place a
# space (` `) in it's place
>>> 'Hello:World'.replace(':', ' ')
Hello World

We can check if a character or characters are in a string using the in statement. Or, we can be more specific, and check if a string startswith() or endswith() a specific character(s) instead (you know a language is easy to understand when you can create sensible sentences out of functions). These methods return True or False Boolean objects:

>>> 'a' in 'Chapter 2'
True
>>> 'Chapter 1'.startswith('Chapter')
True
>>> 'Chapter 1'.endswith('1')
True

We can quickly split a string into a list based on some delimiter. This can be helpful to quickly convert data separated by a delimiter into a list. For example, comma-separated values (CSV) data is separated by commas and could be split on that value:

>>> print("Hello, World!".split(','))
["Hello", " World!"]

Formatting parameters can be used on strings to manipulate them and convert them based on provided values. With the .format() function, we can insert values into strings, pad numbers, and display patterns with simple formatting. This chapter will highlight a few examples of the .format() method, and we will introduce more complex features of it throughout this book. The .format() method replaces curly brackets with the provided values in order.

This is the most basic operation for inserting values into a string dynamically:

>>> "{} {} {} {}".format("Formatted", "strings", "are", "easy!")
'Formatted strings are easy!'

Our second example displays some of the expressions we can use to manipulate a string. Inside the curly brackets, we place a colon, which indicates that we are going to specify a format for interpretation. Following this colon, we specify that there should be at least six characters printed. If the supplied input is not six characters long, we prepend zeroes to the beginning of the input. Lastly, the d character specifies that the input will be a base 10 decimal:

>>> "{:06d}".format(42)
'000042'

Our last example demonstrates how we can easily print a string of 20 equal signs by stating that our fill character is the equals symbol, followed by the caret (to center the symbols in the output), and the number of times to repeat the symbol. By providing this format string, we can quickly create visual separators in our outputs:

>>> "{:=^20}".format('')
'===================='
While we will introduce more advanced features of the .format() method, the site https://pyformat.info/ is a great resource for learning more about the capabilities of Python's string formatting.