Skip to content

API Reference

This page is automatically generated from the source code docstrings.

kegger.kegg_tools

clean_entry(entry)

Standardizes and cleans raw KEGG record fields.

This internal helper takes a dictionary of raw tags and values and applies specific parsing logic based on the KEGG field type (e.g., splitting GENE identifiers from their descriptions or parsing PATHWAY_MAP strings).

Parameters:

Name Type Description Default
entry dict

A dictionary where keys are KEGG tags (e.g., 'GENE') and values are lists of raw string lines.

required

Returns:

Name Type Description
dict dict

A cleaned dictionary where values are strings or lists of strings, structured for easier data analysis.

Note

Special handling is applied to 'GENE' fields to separate gene IDs from their associated 'ORTHOLOG' identifiers.

Source code in src/kegger/kegg_tools.py
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
def clean_entry(entry: dict) -> dict:
    """
        Standardizes and cleans raw KEGG record fields.

        This internal helper takes a dictionary of raw tags and values and applies
        specific parsing logic based on the KEGG field type (e.g., splitting GENE
        identifiers from their descriptions or parsing PATHWAY_MAP strings).

        Args:
            entry (dict): A dictionary where keys are KEGG tags (e.g., 'GENE')
                and values are lists of raw string lines.

        Returns:
            dict: A cleaned dictionary where values are strings or lists
                of strings, structured for easier data analysis.

        Note:
            Special handling is applied to 'GENE' fields to separate gene IDs
            from their associated 'ORTHOLOG' identifiers.
    """
    cleaned_entry = defaultdict(list)
    for tag, value in entry.items():
        if tag == "ENTRY":
            cleaned_entry[tag] = value[0].split()
        elif tag in ("NAME", "ORGANISM"):
            cleaned_entry[tag] = value[0].strip()
        elif tag == "GENE":
            cleaned_entry[tag], cleaned_entry["ORTHOLOG"] = map(list, zip(*[v.split(None, 1) for v in value]))
        elif tag in ["REL_PATHWAY"]:
            cleaned_entry[tag] = [v.strip() for v in value if v.strip()]
        elif tag == "PATHWAY_MAP":
            cleaned_entry[tag] = value[0].split("  ")
        elif tag in ("PATHWAY", "GENES", "REACTION", "MODULE"):
            for v in value:
                cleaned_entry[tag].append(v.strip())
        else:
            cleaned_entry[tag] = value[0].strip()

    return cleaned_entry

genes_to_pathways(org)

Retrieves the mapping between genes and their associated pathways for an organism.

Queries the KEGG 'link' endpoint to produce a many-to-many map of pathways and gene identifiers. This is useful for enrichment analysis or finding all genes within a specific biological process.

Parameters:

Name Type Description Default
org str

The KEGG organism code (e.g., 'shn' or 'eco').

required

Returns:

Type Description
DataFrame

pd.DataFrame: A two-column DataFrame: - 'pathid': The KEGG pathway identifier (e.g., 'shn00010'). - 'gene': The specific gene identifier (e.g., 'Shewana3_0001').

Example

df_map = genes_to_pathways('shn')

To find all genes in a specific pathway:

glycolysis = df_map[df_map['pathid'] == 'shn00010']

Source code in src/kegger/kegg_tools.py
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
def genes_to_pathways(org: str) -> pd.DataFrame:
    """
    Retrieves the mapping between genes and their associated pathways for an organism.

    Queries the KEGG 'link' endpoint to produce a many-to-many map of pathways
    and gene identifiers. This is useful for enrichment analysis or finding
    all genes within a specific biological process.

    Args:
        org: The KEGG organism code (e.g., 'shn' or 'eco').

    Returns:
        pd.DataFrame: A two-column DataFrame:
            - 'pathid': The KEGG pathway identifier (e.g., 'shn00010').
            - 'gene': The specific gene identifier (e.g., 'Shewana3_0001').

    Example:
        >>> df_map = genes_to_pathways('shn')
        >>> # To find all genes in a specific pathway:
        >>> glycolysis = df_map[df_map['pathid'] == 'shn00010']
    """
    url = f'https://rest.kegg.jp/link/{org}/pathway'
    response = get_url(url)
    record = io.StringIO(response)
    col_names = ["pathid", "gene"]
    df = pd.read_csv(record, sep="\t", header=None, names=col_names)
    # Removing the 'org' and 'path:' prefixes
    df["pathid"] = df["pathid"].str.replace("path:", "", regex=False)
    df["gene"] = df["gene"].str.replace(f"{org}:", "", regex=False)
    return df

get_org(org)

Retrieve and parse a KEGG organism genome list.

Connects to the KEGG REST API 'list' endpoint to retrieve gene-level metadata and converts the tab-delimited response into a structured DataFrame.

Parameters

org : str The three- or four-letter KEGG organism identifier.

Returns

df : pandas.DataFrame DataFrame with the following columns: - gene: KEGG gene identifier (e.g., 'Shewana3_0001') - feature: Biological category (e.g., 'CDS', 'RNA') - position: Chromosomal coordinates - annotation: Functional description/gene name

Source code in src/kegger/kegg_tools.py
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
def get_org(org: str) -> pd.DataFrame:
    """
    Retrieve and parse a KEGG organism genome list.

    Connects to the KEGG REST API 'list' endpoint to retrieve gene-level
    metadata and converts the tab-delimited response into a structured DataFrame.

    Parameters
    ----------
    org : str
        The three- or four-letter KEGG organism identifier.

    Returns
    -------
    df : pandas.DataFrame
        DataFrame with the following columns:
        - gene: KEGG gene identifier (e.g., 'Shewana3_0001')
        - feature: Biological category (e.g., 'CDS', 'RNA')
        - position: Chromosomal coordinates
        - annotation: Functional description/gene name
    """
    url = f"https://rest.kegg.jp/list/{org}"
    response = requests.get(url)
    record = io.StringIO(response.text)
    cols = ["gene", "feature", "position", "annotation"]
    df = pd.read_csv(record, sep="\t", header=None, names=cols)
    # Removing the 'org' prefix
    df["gene"] = df["gene"].str.replace(f"{org}:", "", regex=False)
    return df

initialize_kegger(cache_path=None, expire_days=30)

Sets up a persistent local cache for KEGG API requests.

This function initializes a SQLite database to store API responses. Subsequent calls to the same KEGG URL will pull data from the local cache instead of the internet, significantly speeding up data processing and reducing server load.

Parameters:

Name Type Description Default
cache_path str | None

The filename or path for the SQLite cache. Defaults to "kegg_cache" (which creates 'kegg_cache.sqlite').

None
expire_days int

How many days a cached response remains valid before a fresh request is forced. Defaults to 30.

30
Note

If a cache file already exists at the specified path, this function will automatically load and reuse it.

Source code in src/kegger/kegg_tools.py
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def initialize_kegger(cache_path: str | None = None, expire_days: int = 30):
    """
        Sets up a persistent local cache for KEGG API requests.

        This function initializes a SQLite database to store API responses. Subsequent
        calls to the same KEGG URL will pull data from the local cache instead of
        the internet, significantly speeding up data processing and reducing
        server load.

        Args:
            cache_path (str | None): The filename or path for the SQLite cache.
                Defaults to "kegg_cache" (which creates 'kegg_cache.sqlite').
            expire_days (int): How many days a cached response remains valid
                before a fresh request is forced. Defaults to 30.

        Note:
            If a cache file already exists at the specified path, this function
            will automatically load and reuse it.
    """
    if cache_path is None:
        cache_path = "kegg_cache"
    requests_cache.install_cache(cache_path,
                                 backend="sqlite",
                                 expire_days=timedelta(days=expire_days))

kegg_parser(request_text)

Parses a raw KEGG REST API response into a structured dictionary.

This function reads the flat-file format used by KEGG, identifying tags (like ENTRY, NAME, PATHWAY) and capturing their associated data. It utilizes a temporary file for memory-efficient processing of large records.

Parameters:

Name Type Description Default
request_text str

The raw text response from a KEGG REST API call.

required

Returns:

Name Type Description
dict dict

A processed dictionary containing the parsed and cleaned fields of the KEGG record.

Source code in src/kegger/kegg_tools.py
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
def kegg_parser(request_text: str) -> dict:
    """
        Parses a raw KEGG REST API response into a structured dictionary.

        This function reads the flat-file format used by KEGG, identifying tags
        (like ENTRY, NAME, PATHWAY) and capturing their associated data. It
        utilizes a temporary file for memory-efficient processing of large records.

        Args:
            request_text (str): The raw text response from a KEGG REST API call.

        Returns:
            dict: A processed dictionary containing the parsed and cleaned
                fields of the KEGG record.
    """
    res = io.StringIO(request_text)
    with tempfile.NamedTemporaryFile(delete=False, mode="w+", encoding="utf-8") as file_path:
        shutil.copyfileobj(res, file_path)
        file_path_name = file_path.name

    current_key = None
    saved_rec = dict()

    try:
        with open(file_path_name, "r") as entry_file:
            for line in entry_file:
                if line.startswith("///"):
                    break
                tag = line[:12].strip()
                value = line[12:].strip()
                if tag:
                    current_key = tag
                    saved_rec[current_key] = [value]
                else:
                    saved_rec[current_key].append(value)
        cleaned_recs = clean_entry(saved_rec)
    finally:
        # Putting this in 'finally' ensures the temp file is deleted
        # even if the parsing logic above crashes.
        if os.path.exists(file_path_name):
            os.remove(file_path_name)

    return cleaned_recs

list_all_pathways(org)

Retrieves a list of all KEGG pathways for a specific organism.

This function queries the KEGG REST API to find every metabolic and signaling pathway associated with the provided organism code. It automatically cleans the 'path:' prefix from the results to simplify downstream data merging.

Parameters:

Name Type Description Default
org str

The 3-4 letter KEGG organism code (e.g., 'eco' for E. coli, 'hsa' for humans, or 'mmu' for mouse).

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame with two columns: - 'pathid': The unique KEGG pathway identifier (e.g., 'eco00010'). - 'description': The human-readable name of the pathway.

Example

import kegger pathways = kegger.list_all_pathways('eco') print(pathways.head())

Source code in src/kegger/kegg_tools.py
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
def list_all_pathways(org: str) -> pd.DataFrame:
    """
        Retrieves a list of all KEGG pathways for a specific organism.

        This function queries the KEGG REST API to find every metabolic and signaling
        pathway associated with the provided organism code. It automatically cleans
        the 'path:' prefix from the results to simplify downstream data merging.

        Args:
            org (str): The 3-4 letter KEGG organism code (e.g., 'eco' for E. coli,
                'hsa' for humans, or 'mmu' for mouse).

        Returns:
            pd.DataFrame: A DataFrame with two columns:
                - 'pathid': The unique KEGG pathway identifier (e.g., 'eco00010').
                - 'description': The human-readable name of the pathway.

        Example:
            >>> import kegger
            >>> pathways = kegger.list_all_pathways('eco')
            >>> print(pathways.head())
    """
    url = f'https://rest.kegg.jp/list/pathway/{org}'
    response = get_url(url)
    record = io.StringIO(response)
    col_names = ["pathid", "description"]
    df = pd.read_csv(record, sep="\t", header=None, names=col_names)
    df["pathid"] = df["pathid"].str.replace("path:", "", regex=False)
    return df