unix - split a huge .gz file by line

I'm sure someone has had the below need, what is a quick way of splitting a huge .gz file by line? The underlying text file has 120million rows. I don't have enough disk space to gunzip the entire file at once so I was wondering if someone knows of a bash/perl script or tool that could split the file (either the .gz or inner .txt) into 3x 40mn line files. ie calling it like:

    bash splitter.sh hugefile.txt.gz 4000000 1

 would get lines 1 to 40 mn    

    bash splitter.sh hugefile.txt.gz 4000000 2

would get lines 40mn to 80 mn

    bash splitter.sh hugefile.txt.gz 4000000 3

would get lines 80mn to 120 mn

Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped(ie the original problem): gunzip -c hugefile.txt.gz | head 4000000

Note: I can't get extra disk.

Thanks!

asked Jan 23 '12 at 11:21

toop

1951416

migrated from stackoverflow.com Jan 23 '12 at 11:28

This question came from our site for professional and enthusiast programmers.

1

Do you want the resulting files to be gziped again?

– user95605
Jan 23 '12 at 11:25

You can use gunzip in a ipe. The rest can be done with head and tail

– Ingo
Jan 23 '12 at 11:25

@Tichodroma - no I don't need them gziped again. But I could not store all the split text files at once. So i would like to get the first split, do stuff with it, then delete the first split, and then get the 2nd split.etc finally removing the original gz

– toop
Jan 23 '12 at 11:42

1

@toop: Thanks for the clarification. Note that it's generally better to edit your question if you want to clarify it, rather than put it into a comment; that way everyone will see it.

– sleske
Jan 23 '12 at 12:06

The accepted answer is good if you only want a fraction of the chunks, and do not know them in advance. If you want to generate all the chunks at once, the solutions based on split will be a lot faster ,O(N) instead of O(N²).

– b0fh
Aug 13 '14 at 15:27

add a comment |

    bash splitter.sh hugefile.txt.gz 4000000 1

 would get lines 1 to 40 mn    

    bash splitter.sh hugefile.txt.gz 4000000 2

would get lines 40mn to 80 mn

    bash splitter.sh hugefile.txt.gz 4000000 3

would get lines 80mn to 120 mn

Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped(ie the original problem): gunzip -c hugefile.txt.gz | head 4000000

Note: I can't get extra disk.

Thanks!

asked Jan 23 '12 at 11:21

toop

1951416

migrated from stackoverflow.com Jan 23 '12 at 11:28

This question came from our site for professional and enthusiast programmers.

1

Do you want the resulting files to be gziped again?

– user95605
Jan 23 '12 at 11:25

You can use gunzip in a ipe. The rest can be done with head and tail

– Ingo
Jan 23 '12 at 11:25

@Tichodroma - no I don't need them gziped again. But I could not store all the split text files at once. So i would like to get the first split, do stuff with it, then delete the first split, and then get the 2nd split.etc finally removing the original gz

– toop
Jan 23 '12 at 11:42

1

@toop: Thanks for the clarification. Note that it's generally better to edit your question if you want to clarify it, rather than put it into a comment; that way everyone will see it.

– sleske
Jan 23 '12 at 12:06

The accepted answer is good if you only want a fraction of the chunks, and do not know them in advance. If you want to generate all the chunks at once, the solutions based on split will be a lot faster ,O(N) instead of O(N²).

– b0fh
Aug 13 '14 at 15:27

add a comment |

    bash splitter.sh hugefile.txt.gz 4000000 1

 would get lines 1 to 40 mn    

    bash splitter.sh hugefile.txt.gz 4000000 2

would get lines 40mn to 80 mn

    bash splitter.sh hugefile.txt.gz 4000000 3

would get lines 80mn to 120 mn

Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped(ie the original problem): gunzip -c hugefile.txt.gz | head 4000000

Note: I can't get extra disk.

Thanks!

asked Jan 23 '12 at 11:21

toop

1951416

    bash splitter.sh hugefile.txt.gz 4000000 1

 would get lines 1 to 40 mn    

    bash splitter.sh hugefile.txt.gz 4000000 2

would get lines 40mn to 80 mn

    bash splitter.sh hugefile.txt.gz 4000000 3

would get lines 80mn to 120 mn

Is perhaps doing a series of these a solution or would the gunzip -c require enough space for the entire file to be unzipped(ie the original problem): gunzip -c hugefile.txt.gz | head 4000000

Note: I can't get extra disk.

Thanks!

linux perl bash shell unix

asked Jan 23 '12 at 11:21

toop

1951416

asked Jan 23 '12 at 11:21

toop

1951416

asked Jan 23 '12 at 11:21

toop

1951416

asked Jan 23 '12 at 11:21

toop

1951416

asked Jan 23 '12 at 11:21

toop

1951416

migrated from stackoverflow.com Jan 23 '12 at 11:28

This question came from our site for professional and enthusiast programmers.

migrated from stackoverflow.com Jan 23 '12 at 11:28

This question came from our site for professional and enthusiast programmers.

1

Do you want the resulting files to be gziped again?

– user95605
Jan 23 '12 at 11:25

You can use gunzip in a ipe. The rest can be done with head and tail

– Ingo
Jan 23 '12 at 11:25

@Tichodroma - no I don't need them gziped again. But I could not store all the split text files at once. So i would like to get the first split, do stuff with it, then delete the first split, and then get the 2nd split.etc finally removing the original gz

– toop
Jan 23 '12 at 11:42

1

@toop: Thanks for the clarification. Note that it's generally better to edit your question if you want to clarify it, rather than put it into a comment; that way everyone will see it.

– sleske
Jan 23 '12 at 12:06

The accepted answer is good if you only want a fraction of the chunks, and do not know them in advance. If you want to generate all the chunks at once, the solutions based on split will be a lot faster ,O(N) instead of O(N²).

– b0fh
Aug 13 '14 at 15:27

add a comment |

1

Do you want the resulting files to be gziped again?

– user95605
Jan 23 '12 at 11:25

You can use gunzip in a ipe. The rest can be done with head and tail

– Ingo
Jan 23 '12 at 11:25

@Tichodroma - no I don't need them gziped again. But I could not store all the split text files at once. So i would like to get the first split, do stuff with it, then delete the first split, and then get the 2nd split.etc finally removing the original gz

– toop
Jan 23 '12 at 11:42

1

@toop: Thanks for the clarification. Note that it's generally better to edit your question if you want to clarify it, rather than put it into a comment; that way everyone will see it.

– sleske
Jan 23 '12 at 12:06

The accepted answer is good if you only want a fraction of the chunks, and do not know them in advance. If you want to generate all the chunks at once, the solutions based on split will be a lot faster ,O(N) instead of O(N²).

– b0fh
Aug 13 '14 at 15:27

Do you want the resulting files to be gziped again?

– user95605
Jan 23 '12 at 11:25

You can use gunzip in a ipe. The rest can be done with head and tail

– Ingo
Jan 23 '12 at 11:25

@Tichodroma - no I don't need them gziped again. But I could not store all the split text files at once. So i would like to get the first split, do stuff with it, then delete the first split, and then get the 2nd split.etc finally removing the original gz

– toop
Jan 23 '12 at 11:42

@toop: Thanks for the clarification. Note that it's generally better to edit your question if you want to clarify it, rather than put it into a comment; that way everyone will see it.

– sleske
Jan 23 '12 at 12:06

The accepted answer is good if you only want a fraction of the chunks, and do not know them in advance. If you want to generate all the chunks at once, the solutions based on split will be a lot faster ,O(N) instead of O(N²).

– b0fh
Aug 13 '14 at 15:27

add a comment |

7 Answers
7

active

oldest

votes

How to do this best depends on what you want:

Do you want to extract a single part of the large file?

Or do you want to create all the parts in one go?

If you want a single part of the file, your idea to use gunzip and head is right. You can use:

gunzip -c hugefile.txt.gz | head -n 4000000

That would output the first 4000000 lines on standard out - you probably want to append another pipe to actually do something with the data.

To get the other parts, you'd use a combination of head and tail, like:

gunzip -c hugefile.txt.gz | head -n 8000000 |tail -n 4000000

to get the second block.

Is perhaps doing a series of these a solution or would the gunzip -c
require enough space for the entire file to be unzipped

No, the gunzip -c does not require any disk space - it does everything in memory, then streams it out to stdout.

If you want to create all the parts in one go, it is more efficient to create them all with a single command, because then the input file is only read once. One good solution is to use split; see jim mcnamara's answer for details.

edited Sep 7 '16 at 7:19

answered Jan 23 '12 at 11:29

sleske

18k85383

From performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed?

– Alois Mahdal
Mar 22 '12 at 12:57

3

@AloisMahdal: Actually, that would be a good separate question :-). Short version: gzip does not know about the limit (which comes from a different process). If head is used, head will exit when it has received enough, and this will propagate to gzip (via SIGPIPE, see Wikipedia). For tail this is not possible, so yes, gzip will decompress everything.

– sleske
Mar 22 '12 at 15:26

But if you are interested, you should really ask this as a separate question.

– sleske
Mar 22 '12 at 15:35

add a comment |

pipe to split use either gunzip -c or zcat to open the file

gunzip -c bigfile.gz | split -l 400000

Add output specifications to the split command.

answered Jan 23 '12 at 16:41

jim mcnamara

75947

2

This is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote.

– b0fh
Aug 13 '14 at 15:29

@b0fh: Yes, your are right. Upvoted, and referenced in my answer :-).

– sleske
Sep 7 '16 at 7:20

Best answer for sure.

– Stephen Blum
Mar 7 '18 at 21:27

what are the output specs so that the outputs are .gz files themselves?

– Quetzalcoatl
Oct 26 '18 at 5:53

add a comment |

As you are working on a (non-rewindable) stream, you will want to use the '+N' form of tail to get lines starting from line N onwards.

zcat hugefile.txt.gz | head -n 40000000

zcat hugefile.txt.gz | tail -n +40000001 | head -n 40000000

zcat hugefile.txt.gz | tail -n +80000001 | head -n 40000000

answered Jan 23 '12 at 11:33

zgpmax

1692

add a comment |

I'd consider using split.

split a file into pieces

edited Feb 20 '12 at 0:22

Tom Wijsman

50.4k24164247

answered Jan 23 '12 at 11:24

Michael Krelin - hacker

58636

add a comment |

Here's a python script to open a globbed set of files from a directory, gunzip them if necessary, and read through them line by line. It only uses the space necessary in memory for holding the filenames, and the current line, plus a little overhead.

#!/usr/bin/env python

import gzip, bz2

import os

import fnmatch



def gen_find(filepat,top):

    for path, dirlist, filelist in os.walk(top):

        for name in fnmatch.filter(filelist,filepat):

            yield os.path.join(path,name)



def gen_open(filenames):

    for name in filenames:

        if name.endswith(".gz"):

            yield gzip.open(name)

        elif name.endswith(".bz2"):

            yield bz2.BZ2File(name)

        else:

            yield open(name)



def gen_cat(sources):

    for s in sources:

        for item in s:

            yield item



def main(regex, searchDir):

    fileNames = gen_find(regex,searchDir)

    fileHandles = gen_open(fileNames)

    fileLines = gen_cat(fileHandles)

    for line in fileLines:

        print line



if __name__ == '__main__':

    parser = argparse.ArgumentParser(description='Search globbed files line by line', version='%(prog)s 1.0')

    parser.add_argument('regex', type=str, default='*', help='Regular expression')

    parser.add_argument('searchDir', , type=str, default='.', help='list of input files')

    args = parser.parse_args()

    main(args.regex, args.searchDir)

The print line command will send every line to std out, so you can redirect to a file. Alternatively, if you let us know what you want done with the lines, I can add it to the python script and you won't need to leave chunks of the file laying around.

answered Jan 23 '12 at 13:39

Spencer Rathbun

27125

add a comment |

Here's a perl program that can be used to read stdin, and split the lines, piping each clump to a separate command that can use a shell variable $SPLIT to route it to a different destination. For your case, it would be invoked with

zcat hugefile.txt.gz | perl xsplit.pl 40000000 'cat > tmp$SPLIT.txt; do_something tmp$SPLIT.txt; rm tmp$SPLIT.txt'

Sorry the command-line processing is a little kludgy but you get the idea.

#!/usr/bin/perl -w

#####

# xsplit.pl: like xargs but instead of clumping input into each command's args, clumps it into each command's input.

# Usage: perl xsplit.pl LINES 'COMMAND'

# where: 'COMMAND' can include shell variable expansions and can use $SPLIT, e.g.

#   'cat > tmp$SPLIT.txt'

# or:

#   'gzip > tmp$SPLIT.gz'

#####

use strict;



sub pipeHandler {

    my $sig = shift @_;

    print " Caught SIGPIPE: $sign";

    exit(1);

}

$SIG{PIPE} = &pipeHandler;



my $LINES = shift;

die "LINES must be a positive numbern" if ($LINES <= 0);

my $COMMAND = shift || die "second argument should be COMMANDn";



my $line_number = 0;



while (<STDIN>) {

    if ($line_number%$LINES == 0) {

        close OUTFILE;

        my $split = $ENV{SPLIT} = sprintf("%05d", $line_number/$LINES+1);

        print "$splitn";

        my $command = $COMMAND;

        open (OUTFILE, "| $command") or die "failed to write to command '$command'n";

    }

    print OUTFILE $_;

    $line_number++;

}



exit 0;

answered Jan 23 '12 at 20:54

Liudvikas Bukys

27149

add a comment |

Directly split .gz file to .gz files:

zcat bigfile.gz | split -l 400000 --filter='gzip > $FILE.gz'

I think this is what OP wanted, because he don't have much space.

answered Feb 24 at 10:53

siulkilulki

1111

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "3"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fsuperuser.com%2fquestions%2f381394%2funix-split-a-huge-gz-file-by-line%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

7 Answers
7

active

oldest

votes

7 Answers
7

active

oldest

votes

How to do this best depends on what you want:

Do you want to extract a single part of the large file?

Or do you want to create all the parts in one go?

If you want a single part of the file, your idea to use gunzip and head is right. You can use:

gunzip -c hugefile.txt.gz | head -n 4000000

That would output the first 4000000 lines on standard out - you probably want to append another pipe to actually do something with the data.

To get the other parts, you'd use a combination of head and tail, like:

gunzip -c hugefile.txt.gz | head -n 8000000 |tail -n 4000000

to get the second block.

Is perhaps doing a series of these a solution or would the gunzip -c
require enough space for the entire file to be unzipped

No, the gunzip -c does not require any disk space - it does everything in memory, then streams it out to stdout.

edited Sep 7 '16 at 7:19

answered Jan 23 '12 at 11:29

sleske

18k85383

From performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed?

– Alois Mahdal
Mar 22 '12 at 12:57

3

@AloisMahdal: Actually, that would be a good separate question :-). Short version: gzip does not know about the limit (which comes from a different process). If head is used, head will exit when it has received enough, and this will propagate to gzip (via SIGPIPE, see Wikipedia). For tail this is not possible, so yes, gzip will decompress everything.

– sleske
Mar 22 '12 at 15:26

But if you are interested, you should really ask this as a separate question.

– sleske
Mar 22 '12 at 15:35

add a comment |

How to do this best depends on what you want:

Do you want to extract a single part of the large file?

Or do you want to create all the parts in one go?

If you want a single part of the file, your idea to use gunzip and head is right. You can use:

gunzip -c hugefile.txt.gz | head -n 4000000

That would output the first 4000000 lines on standard out - you probably want to append another pipe to actually do something with the data.

To get the other parts, you'd use a combination of head and tail, like:

gunzip -c hugefile.txt.gz | head -n 8000000 |tail -n 4000000

to get the second block.

Is perhaps doing a series of these a solution or would the gunzip -c
require enough space for the entire file to be unzipped

No, the gunzip -c does not require any disk space - it does everything in memory, then streams it out to stdout.

edited Sep 7 '16 at 7:19

answered Jan 23 '12 at 11:29

sleske

18k85383

From performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed?

– Alois Mahdal
Mar 22 '12 at 12:57

3

@AloisMahdal: Actually, that would be a good separate question :-). Short version: gzip does not know about the limit (which comes from a different process). If head is used, head will exit when it has received enough, and this will propagate to gzip (via SIGPIPE, see Wikipedia). For tail this is not possible, so yes, gzip will decompress everything.

– sleske
Mar 22 '12 at 15:26

But if you are interested, you should really ask this as a separate question.

– sleske
Mar 22 '12 at 15:35

add a comment |

How to do this best depends on what you want:

Do you want to extract a single part of the large file?

Or do you want to create all the parts in one go?

If you want a single part of the file, your idea to use gunzip and head is right. You can use:

gunzip -c hugefile.txt.gz | head -n 4000000

That would output the first 4000000 lines on standard out - you probably want to append another pipe to actually do something with the data.

To get the other parts, you'd use a combination of head and tail, like:

gunzip -c hugefile.txt.gz | head -n 8000000 |tail -n 4000000

to get the second block.

Is perhaps doing a series of these a solution or would the gunzip -c
require enough space for the entire file to be unzipped

No, the gunzip -c does not require any disk space - it does everything in memory, then streams it out to stdout.

edited Sep 7 '16 at 7:19

answered Jan 23 '12 at 11:29

sleske

18k85383

How to do this best depends on what you want:

Do you want to extract a single part of the large file?

Or do you want to create all the parts in one go?

If you want a single part of the file, your idea to use gunzip and head is right. You can use:

gunzip -c hugefile.txt.gz | head -n 4000000

That would output the first 4000000 lines on standard out - you probably want to append another pipe to actually do something with the data.

To get the other parts, you'd use a combination of head and tail, like:

gunzip -c hugefile.txt.gz | head -n 8000000 |tail -n 4000000

to get the second block.

Is perhaps doing a series of these a solution or would the gunzip -c
require enough space for the entire file to be unzipped

No, the gunzip -c does not require any disk space - it does everything in memory, then streams it out to stdout.

edited Sep 7 '16 at 7:19

answered Jan 23 '12 at 11:29

sleske

18k85383

edited Sep 7 '16 at 7:19

answered Jan 23 '12 at 11:29

sleske

18k85383

answered Jan 23 '12 at 11:29

sleske

18k85383

answered Jan 23 '12 at 11:29

sleske

18k85383

From performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed?

– Alois Mahdal
Mar 22 '12 at 12:57

3

@AloisMahdal: Actually, that would be a good separate question :-). Short version: gzip does not know about the limit (which comes from a different process). If head is used, head will exit when it has received enough, and this will propagate to gzip (via SIGPIPE, see Wikipedia). For tail this is not possible, so yes, gzip will decompress everything.

– sleske
Mar 22 '12 at 15:26

But if you are interested, you should really ask this as a separate question.

– sleske
Mar 22 '12 at 15:35

add a comment |

From performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed?

– Alois Mahdal
Mar 22 '12 at 12:57

3

@AloisMahdal: Actually, that would be a good separate question :-). Short version: gzip does not know about the limit (which comes from a different process). If head is used, head will exit when it has received enough, and this will propagate to gzip (via SIGPIPE, see Wikipedia). For tail this is not possible, so yes, gzip will decompress everything.

– sleske
Mar 22 '12 at 15:26

But if you are interested, you should really ask this as a separate question.

– sleske
Mar 22 '12 at 15:35

From performance view: does gzip actually unzip whole file? Or is it able to "magically" know that only 4mn lines are needed?

– Alois Mahdal
Mar 22 '12 at 12:57

@AloisMahdal: Actually, that would be a good separate question :-). Short version: gzip does not know about the limit (which comes from a different process). If head is used, head will exit when it has received enough, and this will propagate to gzip (via SIGPIPE, see Wikipedia). For tail this is not possible, so yes, gzip will decompress everything.

– sleske
Mar 22 '12 at 15:26

But if you are interested, you should really ask this as a separate question.

– sleske
Mar 22 '12 at 15:35

add a comment |

pipe to split use either gunzip -c or zcat to open the file

gunzip -c bigfile.gz | split -l 400000

Add output specifications to the split command.

answered Jan 23 '12 at 16:41

jim mcnamara

75947

2

This is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote.

– b0fh
Aug 13 '14 at 15:29

@b0fh: Yes, your are right. Upvoted, and referenced in my answer :-).

– sleske
Sep 7 '16 at 7:20

Best answer for sure.

– Stephen Blum
Mar 7 '18 at 21:27

what are the output specs so that the outputs are .gz files themselves?

– Quetzalcoatl
Oct 26 '18 at 5:53

add a comment |

pipe to split use either gunzip -c or zcat to open the file

gunzip -c bigfile.gz | split -l 400000

Add output specifications to the split command.

answered Jan 23 '12 at 16:41

jim mcnamara

75947

2

This is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote.

– b0fh
Aug 13 '14 at 15:29

@b0fh: Yes, your are right. Upvoted, and referenced in my answer :-).

– sleske
Sep 7 '16 at 7:20

Best answer for sure.

– Stephen Blum
Mar 7 '18 at 21:27

what are the output specs so that the outputs are .gz files themselves?

– Quetzalcoatl
Oct 26 '18 at 5:53

add a comment |

pipe to split use either gunzip -c or zcat to open the file

gunzip -c bigfile.gz | split -l 400000

Add output specifications to the split command.

answered Jan 23 '12 at 16:41

jim mcnamara

75947

pipe to split use either gunzip -c or zcat to open the file

gunzip -c bigfile.gz | split -l 400000

Add output specifications to the split command.

answered Jan 23 '12 at 16:41

jim mcnamara

75947

answered Jan 23 '12 at 16:41

jim mcnamara

75947

answered Jan 23 '12 at 16:41

jim mcnamara

75947

answered Jan 23 '12 at 16:41

jim mcnamara

75947

2

This is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote.

– b0fh
Aug 13 '14 at 15:29

@b0fh: Yes, your are right. Upvoted, and referenced in my answer :-).

– sleske
Sep 7 '16 at 7:20

Best answer for sure.

– Stephen Blum
Mar 7 '18 at 21:27

what are the output specs so that the outputs are .gz files themselves?

– Quetzalcoatl
Oct 26 '18 at 5:53

add a comment |

2

This is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote.

– b0fh
Aug 13 '14 at 15:29

@b0fh: Yes, your are right. Upvoted, and referenced in my answer :-).

– sleske
Sep 7 '16 at 7:20

Best answer for sure.

– Stephen Blum
Mar 7 '18 at 21:27

what are the output specs so that the outputs are .gz files themselves?

– Quetzalcoatl
Oct 26 '18 at 5:53

This is massively more efficient than the accepted answer, unless you only require a fraction of the split chunks. Please upvote.

– b0fh
Aug 13 '14 at 15:29

@b0fh: Yes, your are right. Upvoted, and referenced in my answer :-).

– sleske
Sep 7 '16 at 7:20

Best answer for sure.

– Stephen Blum
Mar 7 '18 at 21:27

what are the output specs so that the outputs are .gz files themselves?

– Quetzalcoatl
Oct 26 '18 at 5:53

add a comment |

As you are working on a (non-rewindable) stream, you will want to use the '+N' form of tail to get lines starting from line N onwards.

zcat hugefile.txt.gz | head -n 40000000

zcat hugefile.txt.gz | tail -n +40000001 | head -n 40000000

zcat hugefile.txt.gz | tail -n +80000001 | head -n 40000000

answered Jan 23 '12 at 11:33

zgpmax

1692

add a comment |

As you are working on a (non-rewindable) stream, you will want to use the '+N' form of tail to get lines starting from line N onwards.

zcat hugefile.txt.gz | head -n 40000000

zcat hugefile.txt.gz | tail -n +40000001 | head -n 40000000

zcat hugefile.txt.gz | tail -n +80000001 | head -n 40000000

answered Jan 23 '12 at 11:33

zgpmax

1692

add a comment |

As you are working on a (non-rewindable) stream, you will want to use the '+N' form of tail to get lines starting from line N onwards.

zcat hugefile.txt.gz | head -n 40000000

zcat hugefile.txt.gz | tail -n +40000001 | head -n 40000000

zcat hugefile.txt.gz | tail -n +80000001 | head -n 40000000

answered Jan 23 '12 at 11:33

zgpmax

1692

As you are working on a (non-rewindable) stream, you will want to use the '+N' form of tail to get lines starting from line N onwards.

zcat hugefile.txt.gz | head -n 40000000

zcat hugefile.txt.gz | tail -n +40000001 | head -n 40000000

zcat hugefile.txt.gz | tail -n +80000001 | head -n 40000000

answered Jan 23 '12 at 11:33

zgpmax

1692

answered Jan 23 '12 at 11:33

zgpmax

1692

answered Jan 23 '12 at 11:33

zgpmax

1692

answered Jan 23 '12 at 11:33

zgpmax

1692

add a comment |

I'd consider using split.

split a file into pieces

edited Feb 20 '12 at 0:22

Tom Wijsman

50.4k24164247

answered Jan 23 '12 at 11:24

Michael Krelin - hacker

58636

add a comment |

I'd consider using split.

split a file into pieces

edited Feb 20 '12 at 0:22

Tom Wijsman

50.4k24164247

answered Jan 23 '12 at 11:24

Michael Krelin - hacker

58636

add a comment |

I'd consider using split.

split a file into pieces

edited Feb 20 '12 at 0:22

Tom Wijsman

50.4k24164247

answered Jan 23 '12 at 11:24

Michael Krelin - hacker

58636

I'd consider using split.

split a file into pieces

edited Feb 20 '12 at 0:22

Tom Wijsman

50.4k24164247

answered Jan 23 '12 at 11:24

Michael Krelin - hacker

58636

edited Feb 20 '12 at 0:22

Tom Wijsman

50.4k24164247

edited Feb 20 '12 at 0:22

Tom Wijsman

50.4k24164247

edited Feb 20 '12 at 0:22

Tom Wijsman

50.4k24164247

answered Jan 23 '12 at 11:24

Michael Krelin - hacker

58636

answered Jan 23 '12 at 11:24

Michael Krelin - hacker

58636

answered Jan 23 '12 at 11:24

Michael Krelin - hacker

58636

add a comment |

#!/usr/bin/env python

import gzip, bz2

import os

import fnmatch



def gen_find(filepat,top):

    for path, dirlist, filelist in os.walk(top):

        for name in fnmatch.filter(filelist,filepat):

            yield os.path.join(path,name)



def gen_open(filenames):

    for name in filenames:

        if name.endswith(".gz"):

            yield gzip.open(name)

        elif name.endswith(".bz2"):

            yield bz2.BZ2File(name)

        else:

            yield open(name)



def gen_cat(sources):

    for s in sources:

        for item in s:

            yield item



def main(regex, searchDir):

    fileNames = gen_find(regex,searchDir)

    fileHandles = gen_open(fileNames)

    fileLines = gen_cat(fileHandles)

    for line in fileLines:

        print line



if __name__ == '__main__':

    parser = argparse.ArgumentParser(description='Search globbed files line by line', version='%(prog)s 1.0')

    parser.add_argument('regex', type=str, default='*', help='Regular expression')

    parser.add_argument('searchDir', , type=str, default='.', help='list of input files')

    args = parser.parse_args()

    main(args.regex, args.searchDir)

answered Jan 23 '12 at 13:39

Spencer Rathbun

27125

add a comment |

#!/usr/bin/env python

import gzip, bz2

import os

import fnmatch



def gen_find(filepat,top):

    for path, dirlist, filelist in os.walk(top):

        for name in fnmatch.filter(filelist,filepat):

            yield os.path.join(path,name)



def gen_open(filenames):

    for name in filenames:

        if name.endswith(".gz"):

            yield gzip.open(name)

        elif name.endswith(".bz2"):

            yield bz2.BZ2File(name)

        else:

            yield open(name)



def gen_cat(sources):

    for s in sources:

        for item in s:

            yield item



def main(regex, searchDir):

    fileNames = gen_find(regex,searchDir)

    fileHandles = gen_open(fileNames)

    fileLines = gen_cat(fileHandles)

    for line in fileLines:

        print line



if __name__ == '__main__':

    parser = argparse.ArgumentParser(description='Search globbed files line by line', version='%(prog)s 1.0')

    parser.add_argument('regex', type=str, default='*', help='Regular expression')

    parser.add_argument('searchDir', , type=str, default='.', help='list of input files')

    args = parser.parse_args()

    main(args.regex, args.searchDir)

answered Jan 23 '12 at 13:39

Spencer Rathbun

27125

add a comment |

#!/usr/bin/env python

import gzip, bz2

import os

import fnmatch



def gen_find(filepat,top):

    for path, dirlist, filelist in os.walk(top):

        for name in fnmatch.filter(filelist,filepat):

            yield os.path.join(path,name)



def gen_open(filenames):

    for name in filenames:

        if name.endswith(".gz"):

            yield gzip.open(name)

        elif name.endswith(".bz2"):

            yield bz2.BZ2File(name)

        else:

            yield open(name)



def gen_cat(sources):

    for s in sources:

        for item in s:

            yield item



def main(regex, searchDir):

    fileNames = gen_find(regex,searchDir)

    fileHandles = gen_open(fileNames)

    fileLines = gen_cat(fileHandles)

    for line in fileLines:

        print line



if __name__ == '__main__':

    parser = argparse.ArgumentParser(description='Search globbed files line by line', version='%(prog)s 1.0')

    parser.add_argument('regex', type=str, default='*', help='Regular expression')

    parser.add_argument('searchDir', , type=str, default='.', help='list of input files')

    args = parser.parse_args()

    main(args.regex, args.searchDir)

answered Jan 23 '12 at 13:39

Spencer Rathbun

27125

#!/usr/bin/env python

import gzip, bz2

import os

import fnmatch



def gen_find(filepat,top):

    for path, dirlist, filelist in os.walk(top):

        for name in fnmatch.filter(filelist,filepat):

            yield os.path.join(path,name)



def gen_open(filenames):

    for name in filenames:

        if name.endswith(".gz"):

            yield gzip.open(name)

        elif name.endswith(".bz2"):

            yield bz2.BZ2File(name)

        else:

            yield open(name)



def gen_cat(sources):

    for s in sources:

        for item in s:

            yield item



def main(regex, searchDir):

    fileNames = gen_find(regex,searchDir)

    fileHandles = gen_open(fileNames)

    fileLines = gen_cat(fileHandles)

    for line in fileLines:

        print line



if __name__ == '__main__':

    parser = argparse.ArgumentParser(description='Search globbed files line by line', version='%(prog)s 1.0')

    parser.add_argument('regex', type=str, default='*', help='Regular expression')

    parser.add_argument('searchDir', , type=str, default='.', help='list of input files')

    args = parser.parse_args()

    main(args.regex, args.searchDir)

answered Jan 23 '12 at 13:39

Spencer Rathbun

27125

answered Jan 23 '12 at 13:39

Spencer Rathbun

27125

answered Jan 23 '12 at 13:39

Spencer Rathbun

27125

answered Jan 23 '12 at 13:39

Spencer Rathbun

27125

add a comment |

zcat hugefile.txt.gz | perl xsplit.pl 40000000 'cat > tmp$SPLIT.txt; do_something tmp$SPLIT.txt; rm tmp$SPLIT.txt'

Sorry the command-line processing is a little kludgy but you get the idea.

#!/usr/bin/perl -w

#####

# xsplit.pl: like xargs but instead of clumping input into each command's args, clumps it into each command's input.

# Usage: perl xsplit.pl LINES 'COMMAND'

# where: 'COMMAND' can include shell variable expansions and can use $SPLIT, e.g.

#   'cat > tmp$SPLIT.txt'

# or:

#   'gzip > tmp$SPLIT.gz'

#####

use strict;



sub pipeHandler {

    my $sig = shift @_;

    print " Caught SIGPIPE: $sign";

    exit(1);

}

$SIG{PIPE} = &pipeHandler;



my $LINES = shift;

die "LINES must be a positive numbern" if ($LINES <= 0);

my $COMMAND = shift || die "second argument should be COMMANDn";



my $line_number = 0;



while (<STDIN>) {

    if ($line_number%$LINES == 0) {

        close OUTFILE;

        my $split = $ENV{SPLIT} = sprintf("%05d", $line_number/$LINES+1);

        print "$splitn";

        my $command = $COMMAND;

        open (OUTFILE, "| $command") or die "failed to write to command '$command'n";

    }

    print OUTFILE $_;

    $line_number++;

}



exit 0;

answered Jan 23 '12 at 20:54

Liudvikas Bukys

27149

add a comment |

zcat hugefile.txt.gz | perl xsplit.pl 40000000 'cat > tmp$SPLIT.txt; do_something tmp$SPLIT.txt; rm tmp$SPLIT.txt'

Sorry the command-line processing is a little kludgy but you get the idea.

#!/usr/bin/perl -w

#####

# xsplit.pl: like xargs but instead of clumping input into each command's args, clumps it into each command's input.

# Usage: perl xsplit.pl LINES 'COMMAND'

# where: 'COMMAND' can include shell variable expansions and can use $SPLIT, e.g.

#   'cat > tmp$SPLIT.txt'

# or:

#   'gzip > tmp$SPLIT.gz'

#####

use strict;



sub pipeHandler {

    my $sig = shift @_;

    print " Caught SIGPIPE: $sign";

    exit(1);

}

$SIG{PIPE} = &pipeHandler;



my $LINES = shift;

die "LINES must be a positive numbern" if ($LINES <= 0);

my $COMMAND = shift || die "second argument should be COMMANDn";



my $line_number = 0;



while (<STDIN>) {

    if ($line_number%$LINES == 0) {

        close OUTFILE;

        my $split = $ENV{SPLIT} = sprintf("%05d", $line_number/$LINES+1);

        print "$splitn";

        my $command = $COMMAND;

        open (OUTFILE, "| $command") or die "failed to write to command '$command'n";

    }

    print OUTFILE $_;

    $line_number++;

}



exit 0;

answered Jan 23 '12 at 20:54

Liudvikas Bukys

27149

add a comment |

zcat hugefile.txt.gz | perl xsplit.pl 40000000 'cat > tmp$SPLIT.txt; do_something tmp$SPLIT.txt; rm tmp$SPLIT.txt'

Sorry the command-line processing is a little kludgy but you get the idea.

#!/usr/bin/perl -w

#####

# xsplit.pl: like xargs but instead of clumping input into each command's args, clumps it into each command's input.

# Usage: perl xsplit.pl LINES 'COMMAND'

# where: 'COMMAND' can include shell variable expansions and can use $SPLIT, e.g.

#   'cat > tmp$SPLIT.txt'

# or:

#   'gzip > tmp$SPLIT.gz'

#####

use strict;



sub pipeHandler {

    my $sig = shift @_;

    print " Caught SIGPIPE: $sign";

    exit(1);

}

$SIG{PIPE} = &pipeHandler;



my $LINES = shift;

die "LINES must be a positive numbern" if ($LINES <= 0);

my $COMMAND = shift || die "second argument should be COMMANDn";



my $line_number = 0;



while (<STDIN>) {

    if ($line_number%$LINES == 0) {

        close OUTFILE;

        my $split = $ENV{SPLIT} = sprintf("%05d", $line_number/$LINES+1);

        print "$splitn";

        my $command = $COMMAND;

        open (OUTFILE, "| $command") or die "failed to write to command '$command'n";

    }

    print OUTFILE $_;

    $line_number++;

}



exit 0;

answered Jan 23 '12 at 20:54

Liudvikas Bukys

27149

zcat hugefile.txt.gz | perl xsplit.pl 40000000 'cat > tmp$SPLIT.txt; do_something tmp$SPLIT.txt; rm tmp$SPLIT.txt'

Sorry the command-line processing is a little kludgy but you get the idea.

#!/usr/bin/perl -w

#####

# xsplit.pl: like xargs but instead of clumping input into each command's args, clumps it into each command's input.

# Usage: perl xsplit.pl LINES 'COMMAND'

# where: 'COMMAND' can include shell variable expansions and can use $SPLIT, e.g.

#   'cat > tmp$SPLIT.txt'

# or:

#   'gzip > tmp$SPLIT.gz'

#####

use strict;



sub pipeHandler {

    my $sig = shift @_;

    print " Caught SIGPIPE: $sign";

    exit(1);

}

$SIG{PIPE} = &pipeHandler;



my $LINES = shift;

die "LINES must be a positive numbern" if ($LINES <= 0);

my $COMMAND = shift || die "second argument should be COMMANDn";



my $line_number = 0;



while (<STDIN>) {

    if ($line_number%$LINES == 0) {

        close OUTFILE;

        my $split = $ENV{SPLIT} = sprintf("%05d", $line_number/$LINES+1);

        print "$splitn";

        my $command = $COMMAND;

        open (OUTFILE, "| $command") or die "failed to write to command '$command'n";

    }

    print OUTFILE $_;

    $line_number++;

}



exit 0;

answered Jan 23 '12 at 20:54

Liudvikas Bukys

27149

answered Jan 23 '12 at 20:54

Liudvikas Bukys

27149

answered Jan 23 '12 at 20:54

Liudvikas Bukys

27149

answered Jan 23 '12 at 20:54

Liudvikas Bukys

27149

add a comment |

Directly split .gz file to .gz files:

zcat bigfile.gz | split -l 400000 --filter='gzip > $FILE.gz'

I think this is what OP wanted, because he don't have much space.

answered Feb 24 at 10:53

siulkilulki

1111

add a comment |

Directly split .gz file to .gz files:

zcat bigfile.gz | split -l 400000 --filter='gzip > $FILE.gz'

I think this is what OP wanted, because he don't have much space.

answered Feb 24 at 10:53

siulkilulki

1111

add a comment |

Directly split .gz file to .gz files:

zcat bigfile.gz | split -l 400000 --filter='gzip > $FILE.gz'

I think this is what OP wanted, because he don't have much space.

answered Feb 24 at 10:53

siulkilulki

1111

Directly split .gz file to .gz files:

zcat bigfile.gz | split -l 400000 --filter='gzip > $FILE.gz'

I think this is what OP wanted, because he don't have much space.

answered Feb 24 at 10:53

siulkilulki

1111

answered Feb 24 at 10:53

siulkilulki

1111

answered Feb 24 at 10:53

siulkilulki

1111

answered Feb 24 at 10:53

siulkilulki

1111

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Super User!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

hCZVyd3RSd5b9Y63nPu ziGs9Ekw5k8hOKcAIR,UdxGaFtHGKq,BGBD1dBX,CZYGI6eLWVANjlWGc dWcQ J

搜尋此網誌

Jtdylktuy