Framework - pykdump/ directory

Program developers import everything as:

from pykdump.API import *

This document describes the submodules included in the pykdump package, in particular the logical separation of functionality between submodules.

Directory Structure

pykdump/
        __init__.py         - check versions of Python code and C-module
        API.py              - main module
        ASCII_Art.py        - ASCII art for tree-like structures
        datatypes.py        - mapping C types to Python types
        dlkmload.py         - loading DLKM debuginfo
        Generic.py          - generic useful subroutines/classes
        highlevel.py        - high-level readers for data
        logging.py          - log messages to be printed as summary
        lowlevel.py         - low-level readers for data
        memocaches.py       - memoization of types and data
        tparser.py*         - parse pieces of text in C syntax
        vmcorearch.py       - get vmcore arch specific info

Modules Interdependecy

In general, it is a bad idea for two modules to import each other, as this can lead to circular import. (Search in Google for 'python circular imports' to find many articles describing what could go wrong.) The best approach is to combine two such modules into a single one. But this is not always desirable:

  • The resulting file will be bigger, more difficult to read and understand.

  • It is not a good programming practice to put completely unrelated classes/subroutines into a single source file; it makes maintenance more difficult. If two developers work on logically different things, but source code is in one file, this is not convenient for commits and merges.

It is OK to have some interdependencies if they are limited and implemented properly.

API.py

This is a main module to be used by program developers. It does several things:

  • imports all needed subroutines/classes/variables from other modules so that there is no need to do this manually

  • parses global options (such as the timeout value for crash commands) and processes them as needed. These options are stripped from the argument list passed to programs.

  • reinitializes logging before each program run and prints a summary of logged messages when the program exits

__init__.py

This module contains the version number of the Python API and specifies a minimal version of C-module needed for this API to work.

If you added a new subroutine to the C-module and this subroutine is used in your updated Python code, you cannot use an old C-module.

This is unimportant for end-users - those who rely on the binary mpykdump.so module. But developers might pull new commits and try to run Python code without rebuilding the C-module, and this can create problems.

To check for such problems, we compare the C-module version (specififed in C-sources) and Python-API version (specified in __init__.py).

datatypes.py

This module defines classes used to represent information about types, structs, enumerations, etc. to be used by high-level subroutines.

We extract symbolic data using Python bindings to GDB internals (implemented in the crash module, written in C) and then we need to convert this information to objects suitable for Python.

This module mainly defines classes and some auxiliary subroutines; conversion from GDB data to instances of these classes is implemented in lowlevel.py.

dlkmload.py

To access kernel symbols/structs defined in DLKMs, we need to load debuginfo as needed.

Depending on your distribution, these files can have different suffixes:

  • .ko.debug

  • .o.debug

  • .ko

  • .o

and they can be located in different directories. In addition, loading DLKM debuginfo might invalidate PyKdump caches (e.g. if a struct with the same name is present both in the DLKM and kernel).

This module provides a number of helper subroutines to load/unload these debuginfos for DLKMs.

Generic.py

Useful subroutines not directly related to vmcore analysis: lazy evaluation, containers, registering handlers for module-level debugging.

highlevel.py

File highlevel.py contains code that will be used by developers of programs. It is imported by API.py.

There are several logical groups of subroutines:

  • read data at a specific address (both virtual and physical)

  • read a specific global symbol

  • subroutines to work with lists

  • obtaining information about structs (e.g. member offsets)

  • executing built-in GDB and crash commands

logging.py

If your program produces lots of output, it is difficult to quickly find the important things. It usually makes sense to display a summary of all "important" findings (such as critical errors) after the end of normal output, when the program exits.

logging.py implements the PyLog class. It is a singleton, so doing:

pylog = PyLog()

in any of your own modules will use the same underlying data. Logging is reinitialized every time when you start a program, and on program exit a summary is displayed.

lowlevel.py

This module contains code to construct instances of classes defined in datatypes.py. This module is rather low-level, used internally by the framework but not by developers of programs. The contents of this module are used by highlevel.py. So highlevel.py imports from lowlevel.py but not vice-versa.

For objects representing the contents of a struct/union, we need to implement struct field access/dereference. struct/union fields can be of different types, so to implement such access we need to analyze the type of each field and use an appropriate subroutine. In the PyKdump source code, such subroutines are called readers. During analysis of a specific struct/union type, we create and store readers for each field, so that they will be used for all structs of this type (results of analysis are cached).

Readers are implemented as closures, to preserve information about extra specifiers of field type. For example, for arrays the reader needs to take into account array dimensions. Factory functions for readers at this moment are:

  • ptrReader - reading pointers

  • suReader - reading structs/unions

  • ti_boolReader - reading booleans

  • ti_enumReader - reading enumerations

  • ti_intReader - reading all integer types

memocaches.py

Memoization classes and decorators, subroutines for cache maintenance.

Some operations are quite CPU-expensive - for example, obtaining and analyzing symbolic info about structs/unions. To improve performance, it makes sense to cache the results, so that we would not repeat the expensive computations again and again.

Another group of CPU-intensive operations is related to executing built-in crash commands (e.g. kmem -s). Once again, it makes sense to cache the results.

Caching depends on whether we are running a live session or using a vmcore; if we are using a live kernel, some things change with time.

Loading DLKM debuginfo might change struct definitions, so some caches should be invalidated after such operations.

ASCII-Art.py

ASCII art for displaying tree-like structures.

tparser.py

In some cases, we cannot extract the needed information from debuginfo. This module implements simple parsers for C-text, so that we can copy a definition from kernel sources (C) and convert it to the format used by PyKdump. In particular, we can copy a block of #define statements and convert it to a dictionary.

vmcorearch.py

C language definitions for integers are rather ambiguous; for example, the size of long int can be either 4 bytes or 8 bytes. This depends on the hardware used to run Linux, and we need this information to be able to read integers (and pointers) properly.

This module extracts from a vmcore some basic data needed to perform analysis:

  • arch-specific data (integer sizes, etc.)

  • HZ, PAGESIZE, PAGE_CACHE_SHIFT, CPUS

  • kernel revision and directory of the vmcore

  • standard directories used for DLKM debuginfo search

  • whether this is a vmcore or we are running on a live kernel